Add Semantic Caching to Your LLM App with Redis
Build a Python layer that stores LLM responses by embedding and retrieves them by semantic similarity, so paraphrased questions skip the API entirely.
What You'll Build
A drop-in Python caching layer that embeds incoming queries, searches Redis for a semantically similar stored response, and only calls the LLM when nothing close enough exists. Identical intent, different wording: cache hit, no API call.
Prerequisites
- Python 3.10 or newer
- Docker (any runtime; commands below are Linux/macOS shell syntax)
- An OpenAI API key set as
OPENAI_API_KEYin your environment - Basic Redis familiarity helpful but not required
Step 1: Start Redis Stack
Stock Redis doesn't ship with vector search. Redis Stack bundles RediSearch, which adds the FT.* commands and vector indexing. Start it with Docker:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
Port 8001 is RedisInsight, a browser UI at http://localhost:8001 that's useful for inspecting cache keys. The container has no persistence by default; add -v /your/path:/data if you need durability across restarts.
Step 2: Install Dependencies
pip install "redis>=4.3.0" openai numpy
redis>=4.3.0 is when redis.commands.search (the Python wrapper for FT.*) stabilized. numpy is only needed for the tobytes() call when packing float32 embedding arrays into Redis hashes.
Step 3: Build the Cache Module
Create semantic_cache.py:
import hashlib
import os
import numpy as np
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from openai import OpenAI
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
DISTANCE_THRESHOLD = 0.15 # cosine distance: 0 = identical, 2 = opposite
r = redis.from_url(
os.getenv("REDIS_URL", "redis://localhost:6379"),
decode_responses=False, # required: embedding bytes cannot be UTF-8 decoded
)
openai_client = OpenAI() # reads OPENAI_API_KEY from environment
INDEX_NAME = "sem_cache"
KEY_PREFIX = "sem_cache:"
def ensure_index() -> None:
"""Create the vector index once; no-op if it already exists."""
try:
r.ft(INDEX_NAME).info()
except redis.exceptions.ResponseError:
r.ft(INDEX_NAME).create_index(
fields=[
TextField("query"),
TextField("response"),
VectorField(
"embedding",
"HNSW",
{
"TYPE": "FLOAT32",
"DIM": EMBEDDING_DIM,
"DISTANCE_METRIC": "COSINE",
},
),
],
definition=IndexDefinition(
prefix=[KEY_PREFIX], index_type=IndexType.HASH
),
)
def embed(text: str) -> np.ndarray:
result = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
return np.array(result.data[0].embedding, dtype=np.float32)
def cache_get(vec: np.ndarray) -> str | None:
q = (
Query("*=>[KNN 1 @embedding $vec AS dist]")
.sort_by("dist")
.return_fields("response", "dist")
.dialect(2)
)
results = r.ft(INDEX_NAME).search(q, query_params={"vec": vec.tobytes()})
if not results.total:
return None
hit = results.docs[0]
# With decode_responses=False, distance and string fields come back as bytes.
dist = float(hit.dist.decode() if isinstance(hit.dist, bytes) else hit.dist)
if dist < DISTANCE_THRESHOLD:
return hit.response.decode() if isinstance(hit.response, bytes) else hit.response
return None
def cache_set(query: str, response: str, vec: np.ndarray) -> None:
key = KEY_PREFIX + hashlib.sha256(query.encode()).hexdigest()
r.hset(
key,
mapping={
"query": query,
"response": response,
"embedding": vec.tobytes(),
},
)
def ask(query: str) -> tuple[str, bool]:
vec = embed(query)
cached = cache_get(vec)
if cached:
return cached, True
completion = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": query}],
)
response_text = completion.choices[0].message.content
cache_set(query, response_text, vec)
return response_text, False
A few things worth calling out:
decode_responses=False is non-negotiable. Embedding vectors are raw float bytes; setting this to True would cause the client to attempt UTF-8 decoding on binary payloads and raise on every write. The flip side: with decode_responses=False, all fields returned by FT.SEARCH come back as bytes, including the distance score. That's why cache_get decodes hit.dist before passing it to float().
HNSW vs FLAT: HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor algorithm, fast at query time with a small accuracy tradeoff. FLAT does exact linear search and is fine up to roughly 5,000 vectors. For a cache that grows, HNSW is the right default.
The distance threshold: Cosine distance in Redis is 1 - cosine_similarity, so 0.15 is tight but practical for "same question, different phrasing." Push it toward 0.25-0.30 for broader matching, but watch for false positives where different questions get each other's answers.
Step 4: Test It
Add this to the bottom of semantic_cache.py:
if __name__ == "__main__":
ensure_index()
queries = [
"What is the capital of France?",
"Which city serves as France's capital?", # close enough to hit cache
"How do I reverse a string in Python?", # new topic, cache miss
]
for q in queries:
answer, from_cache = ask(q)
label = "CACHE" if from_cache else "LLM "
print(f"[{label}] {q}\n {answer[:100]}\n")
python semantic_cache.py
Verify It Works
Expected output:
[LLM ] What is the capital of France?
Paris is the capital of France...
[CACHE] Which city serves as France's capital?
Paris is the capital of France...
[LLM ] How do I reverse a string in Python?
You can reverse a string in Python using slicing...
The second query is a cache hit because its embedding lands within 0.15 cosine distance of the first. You can confirm stored keys with:
redis-cli -p 6379 KEYS "sem_cache:*"
Or browse them in RedisInsight at http://localhost:8001.
Troubleshooting
ConnectionError: Error 111 connecting to localhost:6379
Redis Stack isn't running. Check docker ps; if the container is stopped, docker start redis-stack.
ResponseError: Unknown Index name when calling cache_get before ensure_index
Call ensure_index() once at application startup, not inside the request path. In a FastAPI or Flask app, put it in the startup event or app factory.
Index schema conflict (ResponseError on FT.CREATE)
A prior run created an index with a different schema under the same name. Drop it and recreate: redis-cli FT.DROPINDEX sem_cache DD (the DD flag also deletes the indexed documents), then rerun ensure_index().
Cache never hits, even for the same query pasted twice
Confirm decode_responses=False on the client. With True, r.hset will attempt to encode the binary embedding bytes as a string, producing garbage in Redis that will never match a real KNN query.
Next Steps
- Add TTLs: Call
r.expire(key, 86400)afterr.hset(...)to auto-expire entries after 24 hours so stale answers don't persist indefinitely. - Hybrid filtering: RediSearch supports pre-filtering by tag or numeric field before the KNN step, so you can scope the cache per user, model, or topic. See the RediSearch query syntax docs.
- Swap the embedding model: Replace
embed()with any provider (Cohere, Voyage, or a local model viasentence-transformers). UpdateEMBEDDING_DIMto match the new model's output size and rebuild the index. - LangChain ships a
RedisSemanticCacheclass that follows the same pattern if you prefer a maintained abstraction over rolling your own.
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0
No comments yet
Be the first to weigh in.