Why RAG Needs PII Protection
RAG pipelines are the #1 pattern where PII leaks into LLMs. Documents retrieved from your knowledge base — support tickets, customer records, internal reports — often contain personal data. When those documents are embedded, stored, and retrieved, the PII flows through multiple systems:- Retrieval results — documents with PII are injected into LLM prompts
- LLM provider logs — your provider sees the full prompt, including retrieved PII
Security Trade-offs
There is no one-size-fits-all approach to PII in RAG pipelines. The right choice depends on your threat model:| Approach | Names in vector store | Name-based search | PII at LLM boundary | Complexity |
|---|---|---|---|---|
| Selective redaction (recommended) | Yes | Yes | No (tokenized) | Low |
| Full redaction | No | No — content-based only | No | Low |
| Tokenize with stored mapping | No (tokens only) | Yes (via reverse lookup) | No | High |
Selective Redaction (Recommended)
Redact contact info (emails, phones, IBANs) at ingestion — keep person names for searchability. At query time, search with the original question (names match), then tokenize context + question in a single call before the LLM. This is the approach used in all cookbook examples and described below.Full Redaction
Redact all PII at ingestion. Strongest privacy — no personal data anywhere — but you lose the ability to search by name. The vector store can only match based on surrounding content.Tokenize with Stored Mapping (Advanced)
Tokenize at ingestion and store the mapping. Build a reverse lookup to translate real names in queries to tokens. No PII in the vector store and name-based search works. See the advanced section below for details.Two Protection Layers
Layer 1: Selective Ingestion Redaction
Redact contact info from documents before embedding and indexing. Names are kept so the vector store can match name-based queries.- Python
- JavaScript
- LangChain
Why keep names? At ingestion, person names are replaced with
[PERSON]. At query time, names are tokenized to <Person_1>. Neither placeholder matches the other — so searching for “Hans Mueller” cannot find [PERSON] in the vector store. Keeping names at ingestion solves this and lets users search by name. Contact info (emails, phones) is rarely searched for and should always be redacted.Layer 2: Query-Time Tokenization
After retrieval, tokenize the context and question in a single call before they reach the LLM. Then detokenize the response to restore real data.- Python
- JavaScript
- LangChain
Protection Method Comparison
Choose the right protection method for your RAG use case:| Method | Reversible | Best for | Example output |
|---|---|---|---|
| Redact | No | Ingestion — permanent PII removal | [PERSON], [EMAIL_ADDRESS] |
| Tokenize | Yes | Queries — protect input, restore output | <Person_1>, <Email Address_1> |
| Encrypt | Yes (with key) | Regulated data requiring audit trail | ENC_a8f3b2... |
| Hash | No | Analytics — consistent pseudonymous IDs | HASH_a3f8b9c2d4e5 |
Advanced: Tokenize with Stored Mapping
For the strongest privacy with full searchability — no PII in the vector store and name-based search — tokenize at ingestion and store the mapping. This is the most complete architecture but requires managing a mapping store. How it works:- Ingestion:
tokenize()each document → store tokenized text in vector store + store mapping securely - Query: Build a reverse lookup from stored mappings. Replace real names in the query with their tokens before searching
- LLM: Tokenized context + tokenized query → LLM sees only tokens
- Response: Detokenize using stored mappings
- Requires managing a mapping store (encrypted DB, secrets manager)
- Reverse lookup needs exact string matching (partial names may not match)
- More complex than the selective-redaction approach
- But: strongest privacy with full searchability — no PII in the vector store at all
detokenize() is a free local operation — no API call. This means the mapping store is the only infrastructure you need to manage.Policy Recommendations
Match your compliance policy to your use case:| Use case | Policy | Region | Key entities detected |
|---|---|---|---|
| General RAG | basic | — | Names, emails, phones, addresses, credit cards |
| EU customer data | gdpr_eu | eu | Names, emails, IBANs, national IDs, DOB, addresses |
| US healthcare | hipaa_us | us | All 18 HIPAA identifiers (SSN, MRN, DOB, etc.) |
| Payment data | pci_dss | — | Credit cards, CVVs, expiration dates |
| Maximum coverage | strict | — | All supported entity types, lowest threshold |
Performance Tips
- Batch redaction at ingestion — use
blindfold.redact_batch()for processing multiple documents in one API call - Async processing — use
AsyncBlindfoldfor concurrent document processing during ingestion - Detokenization is free —
detokenize()is a local string replacement, no API call required - Cache redacted documents — once documents are redacted and indexed, no further Blindfold calls are needed for retrieval
Cookbook Examples
Complete, runnable examples for every RAG framework:OpenAI + ChromaDB (Python)
Selective redaction + search-first tokenization
OpenAI + ChromaDB (Node.js)
TypeScript OpenAI + ChromaDB RAG pipeline
LangChain + FAISS (Python)
BlindfoldPIITransformer + retrieve-then-tokenize
LangChain + FAISS (Node.js)
LangChain.js RAG with inline PII protection
LlamaIndex (Python)
Retrieve-then-tokenize with LlamaIndex
LlamaIndex (Node.js)
LlamaIndex.TS with single tokenize call
GDPR Customer Support (Python)
Multi-turn EU support chatbot with gdpr_eu policy
GDPR Customer Support (Node.js)
TypeScript multi-turn EU support chatbot
Strategy Deep-Dives
Standalone examples for each ingestion strategy — compare trade-offs side by side:Selective Redact (Python)
Keep names, redact contact info — simplest approach
Selective Redact (Node.js)
TypeScript version of the selective redact strategy
Stored Mapping (Python)
Tokenize everything, store per-document mappings
Stored Mapping (Node.js)
TypeScript version of the stored mapping strategy
Consistent Registry (Python)
Same person = same token everywhere — best search quality
Consistent Registry (Node.js)
TypeScript version of the consistent registry strategy
Strategy Comparison (Python)
All 3 strategies side by side with CLI selection
Strategy Comparison (Node.js)
TypeScript version — all 3 strategies with CLI selection