Skip to main content

What is Synthesis?

Synthesis is a privacy protection method that replaces real sensitive data with realistic fake data generated by Faker library. The fake data looks authentic but contains no real PII. Example:
Input:  "John Doe lives in New York and works at Microsoft"
Output: "Michael Smith lives in Boston and works at TechCorp"

How It Works

  1. Detection: Blindfold identifies sensitive entities in your text
  2. Generation: For each entity, realistic fake data is generated based on type
  3. Replacement: Real data is replaced with synthetic data
  4. Language Support: Fake data matches the specified language locale

When to Use Synthesis

Synthesis is ideal when you need to:

1. Generate Test Data

Create realistic test data for development and testing environments.
# Generate test user profiles
template = "Name: John Doe, Email: [email protected], Phone: +1-555-1234"

for i in range(10):
    result = client.synthesize(template, language="en")
    print(result.text)

# Output (examples):
# "Name: Michael Smith, Email: [email protected], Phone: +1-555-9876"
# "Name: Sarah Johnson, Email: [email protected], Phone: +1-555-4567"
# ... 10 unique profiles
Why this matters:
  • Realistic test data without PII
  • Repeatable test scenarios
  • No risk of exposing real user data

2. Demo Environments

Populate demo environments with realistic but fake data.
# Create demo customer data
customer_template = """
Customer: Jane Smith
Email: [email protected]
Location: New York
Company: TechCorp
"""

demo_customer = client.synthesize(customer_template, language="en")
load_into_demo_db(demo_customer.text)
Use cases:
  • Product demos
  • Sales presentations
  • Training environments
  • Screenshots and marketing

3. Realistic Training Data

Create training datasets that look real but contain no actual PII.
# Generate training data for ML models
training_examples = []

for _ in range(1000):
    synthetic = client.synthesize(
        "Patient John Doe, age 45, diagnosed with diabetes",
        language="en"
    )
    training_examples.append(synthetic.text)

# Train model on synthetic data

4. Data Sharing for Testing

Share realistic data with partners or vendors for integration testing.
# Create synthetic data for vendor testing
test_data = client.synthesize(
    production_data_sample,
    language="en"
)

# Safe to share - no real PII
send_to_vendor(test_data.text)

When NOT to Use Synthesis

Synthesis is not suitable when:

1. You Need Original Data Back

Synthesis is irreversible. Use Tokenization instead.
# Bad - can't restore
synthetic = client.synthesize("[email protected]")
# No way to get "[email protected]" back

# Good - use tokenization
protected = client.tokenize("[email protected]")
original = client.detokenize(protected.text, protected.mapping)

2. Users Need to Recognize Their Data

Users won’t recognize synthesized data. Use Masking.
# Bad - user won't recognize their card
synthetic = client.synthesize("Card: 4532-7562-9102-3456")
# Output: "Card: 5678-1234-9012-3456" (completely different)

# Good - show last 4 of real card
masked = client.mask("Card: 4532-7562-9102-3456")
# Output: "Card: ***************3456"

3. You Need Consistent Identifiers

Each synthesis generates different data. Use Hashing.
# Bad - different each time
synth1 = client.synthesize("[email protected]")  # [email protected]
synth2 = client.synthesize("[email protected]")  # [email protected]

# Good - same hash every time
hash1 = client.hash("[email protected]")  # ID_a3f8b9...
hash2 = client.hash("[email protected]")  # ID_a3f8b9... (same)

Key Features

Realistic Data

Generated data looks authentic

Multi-Language

Supports 8 languages with locale-specific data

Type-Aware

Generates appropriate data for each entity type

Powered by Faker

Uses Faker library for quality fake data

Quick Start

from blindfold import Blindfold

client = Blindfold(api_key="your-api-key")

# Basic synthesis
result = client.synthesize(
    text="John Doe lives in New York and works at Microsoft",
    language="en"
)

print(result.text)
# "Michael Smith lives in Boston and works at TechCorp"
# (example output - will vary)

# Generate multiple variations
template = "Customer: Jane Doe, Email: [email protected]"

for i in range(3):
    result = client.synthesize(template, language="en")
    print(f"{i+1}. {result.text}")

# Output (examples):
# 1. Customer: Sarah Johnson, Email: [email protected]
# 2. Customer: Michael Brown, Email: [email protected]
# 3. Customer: Emily Davis, Email: [email protected]

Supported Languages

Generate locale-specific fake data for different languages:
result = client.synthesize(
    "John Doe from New York",
    language="en"
)
# "Michael Smith from Boston"
Supported Languages:
  • en - English (US)
  • cs - Czech
  • de - German
  • fr - French
  • es - Spanish
  • it - Italian
  • pl - Polish
  • sk - Slovak

Entity Types and Fake Data

Different entity types generate different kinds of fake data:
Entity TypeExample InputExample Output
PERSONJohn DoeMichael Smith
EMAIL_ADDRESS[email protected][email protected]
PHONE_NUMBER+1-555-1234+1-555-9876
LOCATIONNew YorkBoston
ORGANIZATIONMicrosoftTechCorp
CREDIT_CARD4532-7562-9102-34565678-1234-9012-3456
DATE_TIME2024-01-152023-11-22
IP_ADDRESS192.168.1.110.0.0.5
URLhttps://example.comhttps://test-site.org

Common Patterns

Generate Test Users

def generate_test_users(count: int) -> list:
    """Generate realistic test user profiles"""

    template = """
    Name: John Doe
    Email: [email protected]
    Phone: +1-555-1234
    Location: New York
    """

    users = []
    for _ in range(count):
        result = client.synthesize(template, language="en")
        users.append(result.text)

    return users

# Usage
test_users = generate_test_users(100)
# 100 unique, realistic user profiles

Populate Demo Database

def populate_demo_db(template_data: list):
    """Fill demo database with synthetic data"""

    for template in template_data:
        synthetic = client.synthesize(template, language="en")

        # Parse and insert
        demo_db.insert(parse_profile(synthetic.text))

# Usage
templates = load_production_templates()
populate_demo_db(templates)

Create Training Dataset

def create_training_data(examples: list, count: int):
    """Generate training data from examples"""

    training_set = []

    for example in examples:
        for _ in range(count):
            synthetic = client.synthesize(example, language="en")
            training_set.append(synthetic.text)

    return training_set

# Usage
examples = ["Patient John Doe diagnosed with condition X", ...]
training_data = create_training_data(examples, 100)
# 100 synthetic examples per template

Common Use Cases

Generate test data for automated test suites:
def test_user_registration():
    # Generate unique test user
    test_user = client.synthesize(
        "Name: John Doe, Email: [email protected]",
        language="en"
    ).text

    # Use in test
    response = api.register_user(test_user)
    assert response.status_code == 200
Benefits: Fresh test data each run, no PII in test environments
Create realistic demo data:
# Generate demo customers
def setup_demo_environment():
    templates = [
        "Enterprise customer: Company X, contact: [email protected]",
        "Small business: Company Y, contact: [email protected]"
    ]

    for template in templates:
        synthetic = client.synthesize(template)
        create_demo_account(synthetic.text)
Benefits: Realistic demos without real customer data
Generate data for performance testing:
def load_test_data_generator(count: int):
    """Generate data for load testing"""
    template = "User: [email protected], Session: abc123"

    test_data = []
    for _ in range(count):
        synthetic = client.synthesize(template)
        test_data.append(synthetic.text)

    return test_data

# Generate 10,000 test records
load_data = load_test_data_generator(10000)
Benefits: Large-scale test data without PII concerns
Create safe data for screenshots and marketing materials:
def prepare_screenshot_data():
    """Generate data for product screenshots"""
    user_data = client.synthesize(
        "User: Jane Doe, Email: [email protected]",
        language="en"
    )

    # Use in screenshot - safe for public release
    return user_data.text
Benefits: No privacy risks in public materials

Best Practices

1. Use Templates

Create templates for consistent synthetic data:
# Define templates
TEMPLATES = {
    'user': "Name: {name}, Email: {email}, Phone: {phone}",
    'company': "Company: {company}, Location: {location}"
}

# Generate from templates
def generate_user():
    return client.synthesize(TEMPLATES['user'], language="en")

2. Locale-Specific Data

Use appropriate language for your audience:
# European demo environment
if region == "EU":
    # Generate German data
    demo_data = client.synthesize(template, language="de")
elif region == "US":
    # Generate US data
    demo_data = client.synthesize(template, language="en")

3. Document Synthetic Data Use

Clearly mark synthetic data in your systems:
synthetic_user = {
    'name': result.text,
    'is_synthetic': True,  # Mark as synthetic
    'generated_at': datetime.now()
}

4. Combine with Other Methods

Use synthesis alongside other privacy methods:
# Synthesis for testing
test_data = client.synthesize(template)

# Tokenization for production
prod_data = client.tokenize(real_user_input)

Learn More

Compare with Other Methods