AI Intermediate Tutorial

Add Semantic Caching to Your LLM App with Redis

Build a Python layer that stores LLM responses by embedding and retrieves them by semantic similarity, so paraphrased questions skip the API entirely.

Mariana Souza

Senior Editor · Jun 30, 2026 · 8 min read

Add Semantic Caching to Your LLM App with Redis

What You'll Build

A drop-in Python caching layer that embeds incoming queries, searches Redis for a semantically similar stored response, and only calls the LLM when nothing close enough exists. Identical intent, different wording: cache hit, no API call.

Prerequisites

Python 3.10 or newer
Docker (any runtime; commands below are Linux/macOS shell syntax)
An OpenAI API key set as OPENAI_API_KEY in your environment
Basic Redis familiarity helpful but not required

Step 1: Start Redis Stack

Stock Redis doesn't ship with vector search. Redis Stack bundles RediSearch, which adds the FT.* commands and vector indexing. Start it with Docker:

docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Port 8001 is RedisInsight, a browser UI at http://localhost:8001 that's useful for inspecting cache keys. The container has no persistence by default; add -v /your/path:/data if you need durability across restarts.

Step 2: Install Dependencies

pip install "redis>=4.3.0" openai numpy

redis>=4.3.0 is when redis.commands.search (the Python wrapper for FT.*) stabilized. numpy is only needed for the tobytes() call when packing float32 embedding arrays into Redis hashes.

Step 3: Build the Cache Module

Create semantic_cache.py:

import hashlib
import os

import numpy as np
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from openai import OpenAI

EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
DISTANCE_THRESHOLD = 0.15  # cosine distance: 0 = identical, 2 = opposite

r = redis.from_url(
    os.getenv("REDIS_URL", "redis://localhost:6379"),
    decode_responses=False,  # required: embedding bytes cannot be UTF-8 decoded
)
openai_client = OpenAI()  # reads OPENAI_API_KEY from environment

INDEX_NAME = "sem_cache"
KEY_PREFIX = "sem_cache:"


def ensure_index() -> None:
    """Create the vector index once; no-op if it already exists."""
    try:
        r.ft(INDEX_NAME).info()
    except redis.exceptions.ResponseError:
        r.ft(INDEX_NAME).create_index(
            fields=[
                TextField("query"),
                TextField("response"),
                VectorField(
                    "embedding",
                    "HNSW",
                    {
                        "TYPE": "FLOAT32",
                        "DIM": EMBEDDING_DIM,
                        "DISTANCE_METRIC": "COSINE",
                    },
                ),
            ],
            definition=IndexDefinition(
                prefix=[KEY_PREFIX], index_type=IndexType.HASH
            ),
        )


def embed(text: str) -> np.ndarray:
    result = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
    return np.array(result.data[0].embedding, dtype=np.float32)


def cache_get(vec: np.ndarray) -> str | None:
    q = (
        Query("*=>[KNN 1 @embedding $vec AS dist]")
        .sort_by("dist")
        .return_fields("response", "dist")
        .dialect(2)
    )
    results = r.ft(INDEX_NAME).search(q, query_params={"vec": vec.tobytes()})
    if not results.total:
        return None
    hit = results.docs[0]
    # With decode_responses=False, distance and string fields come back as bytes.
    dist = float(hit.dist.decode() if isinstance(hit.dist, bytes) else hit.dist)
    if dist < DISTANCE_THRESHOLD:
        return hit.response.decode() if isinstance(hit.response, bytes) else hit.response
    return None


def cache_set(query: str, response: str, vec: np.ndarray) -> None:
    key = KEY_PREFIX + hashlib.sha256(query.encode()).hexdigest()
    r.hset(
        key,
        mapping={
            "query": query,
            "response": response,
            "embedding": vec.tobytes(),
        },
    )


def ask(query: str) -> tuple[str, bool]:
    vec = embed(query)
    cached = cache_get(vec)
    if cached:
        return cached, True
    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
    )
    response_text = completion.choices[0].message.content
    cache_set(query, response_text, vec)
    return response_text, False

A few things worth calling out:

decode_responses=False is non-negotiable. Embedding vectors are raw float bytes; setting this to True would cause the client to attempt UTF-8 decoding on binary payloads and raise on every write. The flip side: with decode_responses=False, all fields returned by FT.SEARCH come back as bytes, including the distance score. That's why cache_get decodes hit.dist before passing it to float().

HNSW vs FLAT: HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor algorithm, fast at query time with a small accuracy tradeoff. FLAT does exact linear search and is fine up to roughly 5,000 vectors. For a cache that grows, HNSW is the right default.

The distance threshold: Cosine distance in Redis is 1 - cosine_similarity, so 0.15 is tight but practical for "same question, different phrasing." Push it toward 0.25-0.30 for broader matching, but watch for false positives where different questions get each other's answers.

Step 4: Test It

Add this to the bottom of semantic_cache.py:

if __name__ == "__main__":
    ensure_index()

    queries = [
        "What is the capital of France?",
        "Which city serves as France's capital?",  # close enough to hit cache
        "How do I reverse a string in Python?",    # new topic, cache miss
    ]

    for q in queries:
        answer, from_cache = ask(q)
        label = "CACHE" if from_cache else "LLM  "
        print(f"[{label}] {q}\n       {answer[:100]}\n")

python semantic_cache.py

Verify It Works

Expected output:

[LLM  ] What is the capital of France?
       Paris is the capital of France...

[CACHE] Which city serves as France's capital?
       Paris is the capital of France...

[LLM  ] How do I reverse a string in Python?
       You can reverse a string in Python using slicing...

The second query is a cache hit because its embedding lands within 0.15 cosine distance of the first. You can confirm stored keys with:

redis-cli -p 6379 KEYS "sem_cache:*"

Or browse them in RedisInsight at http://localhost:8001.

Troubleshooting

ConnectionError: Error 111 connecting to localhost:6379 Redis Stack isn't running. Check docker ps; if the container is stopped, docker start redis-stack.

ResponseError: Unknown Index name when calling cache_get before ensure_index Call ensure_index() once at application startup, not inside the request path. In a FastAPI or Flask app, put it in the startup event or app factory.

Index schema conflict (ResponseError on FT.CREATE) A prior run created an index with a different schema under the same name. Drop it and recreate: redis-cli FT.DROPINDEX sem_cache DD (the DD flag also deletes the indexed documents), then rerun ensure_index().

Cache never hits, even for the same query pasted twice Confirm decode_responses=False on the client. With True, r.hset will attempt to encode the binary embedding bytes as a string, producing garbage in Redis that will never match a real KNN query.

Next Steps

Add TTLs: Call r.expire(key, 86400) after r.hset(...) to auto-expire entries after 24 hours so stale answers don't persist indefinitely.
Hybrid filtering: RediSearch supports pre-filtering by tag or numeric field before the KNN step, so you can scope the cache per user, model, or topic. See the RediSearch query syntax docs.
Swap the embedding model: Replace embed() with any provider (Cohere, Voyage, or a local model via sentence-transformers). Update EMBEDDING_DIM to match the new model's output size and rebuild the index.
LangChain ships a RedisSemanticCache class that follows the same pattern if you prefer a maintained abstraction over rolling your own.

#Python #Llm #Vector Search #Openai #Redis #Caching

Written by

Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Add Semantic Caching to Your LLM App with Redis

What You'll Build

Prerequisites

Step 1: Start Redis Stack

Step 2: Install Dependencies

Step 3: Build the Cache Module

Step 4: Test It

Verify It Works

Troubleshooting

Next Steps

Discussion 0

Related Reading

Build a Streaming Chat UI with React, Vercel AI SDK, and FastAPI

The 1.6-Trillion Parameter Mirage: LongCat 2.0 and the MoE Memory Tax

Ornith-1.0: Coding Models That Train Their Own Agent Scaffolds

Qwen 3.6 27B Hits the Local Development Sweet Spot