Skip to content
AI Intermediate Tutorial

Add Semantic Caching to Your LLM App with Redis

Build a Python layer that stores LLM responses by embedding and retrieves them by semantic similarity, so paraphrased questions skip the API entirely.

Mariana Souza
Mariana Souza
Senior Editor · Jun 30, 2026 · 8 min read
Add Semantic Caching to Your LLM App with Redis

What You'll Build

A drop-in Python caching layer that embeds incoming queries, searches Redis for a semantically similar stored response, and only calls the LLM when nothing close enough exists. Identical intent, different wording: cache hit, no API call.

Prerequisites

  • Python 3.10 or newer
  • Docker (any runtime; commands below are Linux/macOS shell syntax)
  • An OpenAI API key set as OPENAI_API_KEY in your environment
  • Basic Redis familiarity helpful but not required

Step 1: Start Redis Stack

Stock Redis doesn't ship with vector search. Redis Stack bundles RediSearch, which adds the FT.* commands and vector indexing. Start it with Docker:

docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Port 8001 is RedisInsight, a browser UI at http://localhost:8001 that's useful for inspecting cache keys. The container has no persistence by default; add -v /your/path:/data if you need durability across restarts.

Step 2: Install Dependencies

pip install "redis>=4.3.0" openai numpy

redis>=4.3.0 is when redis.commands.search (the Python wrapper for FT.*) stabilized. numpy is only needed for the tobytes() call when packing float32 embedding arrays into Redis hashes.

Step 3: Build the Cache Module

Create semantic_cache.py:

import hashlib
import os

import numpy as np
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from openai import OpenAI

EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
DISTANCE_THRESHOLD = 0.15  # cosine distance: 0 = identical, 2 = opposite

r = redis.from_url(
    os.getenv("REDIS_URL", "redis://localhost:6379"),
    decode_responses=False,  # required: embedding bytes cannot be UTF-8 decoded
)
openai_client = OpenAI()  # reads OPENAI_API_KEY from environment

INDEX_NAME = "sem_cache"
KEY_PREFIX = "sem_cache:"


def ensure_index() -> None:
    """Create the vector index once; no-op if it already exists."""
    try:
        r.ft(INDEX_NAME).info()
    except redis.exceptions.ResponseError:
        r.ft(INDEX_NAME).create_index(
            fields=[
                TextField("query"),
                TextField("response"),
                VectorField(
                    "embedding",
                    "HNSW",
                    {
                        "TYPE": "FLOAT32",
                        "DIM": EMBEDDING_DIM,
                        "DISTANCE_METRIC": "COSINE",
                    },
                ),
            ],
            definition=IndexDefinition(
                prefix=[KEY_PREFIX], index_type=IndexType.HASH
            ),
        )


def embed(text: str) -> np.ndarray:
    result = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
    return np.array(result.data[0].embedding, dtype=np.float32)


def cache_get(vec: np.ndarray) -> str | None:
    q = (
        Query("*=>[KNN 1 @embedding $vec AS dist]")
        .sort_by("dist")
        .return_fields("response", "dist")
        .dialect(2)
    )
    results = r.ft(INDEX_NAME).search(q, query_params={"vec": vec.tobytes()})
    if not results.total:
        return None
    hit = results.docs[0]
    # With decode_responses=False, distance and string fields come back as bytes.
    dist = float(hit.dist.decode() if isinstance(hit.dist, bytes) else hit.dist)
    if dist < DISTANCE_THRESHOLD:
        return hit.response.decode() if isinstance(hit.response, bytes) else hit.response
    return None


def cache_set(query: str, response: str, vec: np.ndarray) -> None:
    key = KEY_PREFIX + hashlib.sha256(query.encode()).hexdigest()
    r.hset(
        key,
        mapping={
            "query": query,
            "response": response,
            "embedding": vec.tobytes(),
        },
    )


def ask(query: str) -> tuple[str, bool]:
    vec = embed(query)
    cached = cache_get(vec)
    if cached:
        return cached, True
    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
    )
    response_text = completion.choices[0].message.content
    cache_set(query, response_text, vec)
    return response_text, False

A few things worth calling out:

decode_responses=False is non-negotiable. Embedding vectors are raw float bytes; setting this to True would cause the client to attempt UTF-8 decoding on binary payloads and raise on every write. The flip side: with decode_responses=False, all fields returned by FT.SEARCH come back as bytes, including the distance score. That's why cache_get decodes hit.dist before passing it to float().

HNSW vs FLAT: HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor algorithm, fast at query time with a small accuracy tradeoff. FLAT does exact linear search and is fine up to roughly 5,000 vectors. For a cache that grows, HNSW is the right default.

The distance threshold: Cosine distance in Redis is 1 - cosine_similarity, so 0.15 is tight but practical for "same question, different phrasing." Push it toward 0.25-0.30 for broader matching, but watch for false positives where different questions get each other's answers.

Step 4: Test It

Add this to the bottom of semantic_cache.py:

if __name__ == "__main__":
    ensure_index()

    queries = [
        "What is the capital of France?",
        "Which city serves as France's capital?",  # close enough to hit cache
        "How do I reverse a string in Python?",    # new topic, cache miss
    ]

    for q in queries:
        answer, from_cache = ask(q)
        label = "CACHE" if from_cache else "LLM  "
        print(f"[{label}] {q}\n       {answer[:100]}\n")
python semantic_cache.py

Verify It Works

Expected output:

[LLM  ] What is the capital of France?
       Paris is the capital of France...

[CACHE] Which city serves as France's capital?
       Paris is the capital of France...

[LLM  ] How do I reverse a string in Python?
       You can reverse a string in Python using slicing...

The second query is a cache hit because its embedding lands within 0.15 cosine distance of the first. You can confirm stored keys with:

redis-cli -p 6379 KEYS "sem_cache:*"

Or browse them in RedisInsight at http://localhost:8001.

Troubleshooting

ConnectionError: Error 111 connecting to localhost:6379 Redis Stack isn't running. Check docker ps; if the container is stopped, docker start redis-stack.

ResponseError: Unknown Index name when calling cache_get before ensure_index Call ensure_index() once at application startup, not inside the request path. In a FastAPI or Flask app, put it in the startup event or app factory.

Index schema conflict (ResponseError on FT.CREATE) A prior run created an index with a different schema under the same name. Drop it and recreate: redis-cli FT.DROPINDEX sem_cache DD (the DD flag also deletes the indexed documents), then rerun ensure_index().

Cache never hits, even for the same query pasted twice Confirm decode_responses=False on the client. With True, r.hset will attempt to encode the binary embedding bytes as a string, producing garbage in Redis that will never match a real KNN query.

Next Steps

  • Add TTLs: Call r.expire(key, 86400) after r.hset(...) to auto-expire entries after 24 hours so stale answers don't persist indefinitely.
  • Hybrid filtering: RediSearch supports pre-filtering by tag or numeric field before the KNN step, so you can scope the cache per user, model, or topic. See the RediSearch query syntax docs.
  • Swap the embedding model: Replace embed() with any provider (Cohere, Voyage, or a local model via sentence-transformers). Update EMBEDDING_DIM to match the new model's output size and rebuild the index.
  • LangChain ships a RedisSemanticCache class that follows the same pattern if you prefer a maintained abstraction over rolling your own.
Mariana Souza
Written by
Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading