Data Enrichment vs Data Cleansing

These two terms get used interchangeably, but they're fundamentally different operations. Confusing them leads to bad data pipelines and wasted money.

Here's the difference, why the order matters, and how to implement both.

The Core Difference

Data cleansing is subtractive. It removes errors, fixes inconsistencies, and deduplicates records. You're making your existing data more accurate.

Data enrichment is additive. It appends new data points from external sources. You're making your existing data more complete.

	Data Cleansing	Data Enrichment
Direction	Removes / fixes data	Adds new data
Input	Your existing data	Your existing data + external sources
Output	Cleaner version of same data	Same data + new fields
Goal	Accuracy	Completeness
Example	Fix "Jonh" → "John"	Add job title, company, phone number

Data Cleansing Operations

Cleansing catches and fixes problems in your existing data:

Deduplication

You have the same person in your database three times: once from a form fill, once from a CSV import, once from a CRM sync. Deduplication identifies and merges these records.

$ sort contacts.csv | cut -d',' -f2 | uniq -d
ceo@stripe.com
founder@linear.app

Format Standardization

Phone numbers stored as (555) 123-4567, 555-123-4567, 5551234567, and +15551234567 are all the same number in different formats. Standardization picks one format and converts everything.

Typo Correction

Gogle → Google. San Franicsco → San Francisco. Enginnering → Engineering. Automated fuzzy matching catches common typos in company names, cities, and titles.

Invalid Data Removal

Email addresses that bounce, phone numbers that don't connect, addresses that don't exist. Validation removes records that look plausible but are actually dead.

$ cat emails.txt | while read email; do
    enrich email "$email" --validate-only 2>/dev/null && echo "$email"
  done > valid_emails.txt

Null / Empty Field Handling

Decide what to do with incomplete records. Is a contact without a phone number useful? What about a company without an industry? Cleansing defines and enforces your data quality standards.

Data Enrichment Operations

Enrichment adds new information from external sources:

Contact Appending

Start with an email, get back name, title, company, phone, LinkedIn, location. Turn a single identifier into a full profile.

$ enrich email ceo@stripe.com --json | jq '{name: .person.name, title: .person.title, phone: .person.phone}'

Firmographic Appending

Start with a company name or domain, get back industry, headcount, revenue, tech stack, headquarters. Turn a company name into a qualified account.

$ enrich domain stripe.com --json | jq '{industry, headcount, location}'

Technographic Appending

What technologies does a company use? Enrichment can reveal their tech stack — useful for selling developer tools or infrastructure products.

Social Profile Appending

Given a name and company, find their LinkedIn, Twitter, and other professional profiles. Useful for outreach and research.

Why Order Matters: Cleanse First, Then Enrich

This is the single most important thing to understand about data quality:

Always cleanse before you enrich.

Here's why:

1. Don't Waste Credits on Bad Data

If your email list has 20% invalid addresses, enriching without validation means 20% of your credits are wasted on lookups that will return nothing or return data for the wrong person.

# Bad: Enrich everything, waste credits on invalid emails
$ cat raw_emails.txt | while read email; do enrich email "$email"; done

# Good: Validate first, then enrich valid emails only
$ cat raw_emails.txt | enrich email --validate-only > valid.txt
$ cat valid.txt | while read email; do enrich email "$email"; done

2. Don't Enrich Duplicates

If "Jane Doe" appears three times in your database with slightly different email addresses, enriching all three wastes 2 credits and creates 3 enriched records for the same person. Deduplicate first.

3. Don't Build on a Bad Foundation

Enriching dirty data is like painting over a crumbling wall. The enriched fields might look good, but the foundation is unreliable. When you later discover the base data was wrong, the enriched data becomes suspect too.

4. Standardize Before Matching

Enrichment providers match on your input. If your company names aren't standardized (Google LLC vs Google, Inc. vs Alphabet), match rates will suffer. Clean the input to maximize match rates.

A Practical Data Quality Pipeline

Here's a pipeline that does both, in the right order:

Raw Data
  ↓
[1] Remove exact duplicates
  ↓
[2] Standardize formats (phone, email, names)
  ↓
[3] Validate emails (remove bounces)
  ↓
[4] Fuzzy dedup (merge near-duplicates)
  ↓
[5] Enrich valid, unique records
  ↓
[6] Fill gaps with secondary enrichment
  ↓
Clean, Enriched Data

In code:

#!/bin/bash
# data-quality-pipeline.sh

INPUT="raw_leads.csv"
OUTPUT="enriched_leads.jsonl"

# Step 1-2: Deduplicate and extract emails
sort -t',' -k2 -u "$INPUT" | cut -d',' -f2 > unique_emails.txt

# Step 3: Filter to valid emails (basic format check)
grep -E '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' unique_emails.txt > valid_emails.txt

# Step 4: Remove free email domains (optional, higher match rates)
grep -v -E '@(gmail|yahoo|hotmail|outlook)\.' valid_emails.txt > professional_emails.txt

# Step 5: Enrich
while read email; do
    enrich email "$email" --json 2>/dev/null
done < professional_emails.txt > "$OUTPUT"

echo "Enriched $(wc -l < "$OUTPUT") records"

When to Re-Enrich

B2B contact data decays at 22-30% per year. People change jobs, companies get acquired, phone numbers change. A record enriched 6 months ago may already be stale.

Re-enrichment schedule:

High-value accounts: Re-enrich quarterly
Active pipeline: Re-enrich monthly
General database: Re-enrich every 6 months
Trigger-based: Re-enrich when a bounce or job change is detected

With enrichcli, cached results are free. Re-enriching a record that hasn't changed costs nothing because the cache returns the same data. Only genuinely new data costs a credit.

Common Mistakes

1. Enriching Without Cleansing

Leads to wasted credits, enriched duplicates, and false confidence in dirty data.

2. Cleansing Without Enriching

Leads to clean but sparse data. You know the email is valid and the name is spelled correctly, but you still don't know the person's title or company size.

3. Enriching Free Email Domains

Enrichment match rates for gmail.com, yahoo.com, and outlook.com addresses are 20-40%. Don't spend credits on them unless you have additional matching data (name + company).

4. Never Re-Enriching

Data decays. A "clean" database that hasn't been enriched in a year has roughly 25% stale records.

5. Ignoring Match Rate

If your enrichment provider returns results for 40% of your emails, you're either sending bad data in (cleanse first) or using a provider with poor coverage. Good providers hit 60-80%+ on professional emails.

Key Takeaways

Cleansing fixes your existing data. Enrichment adds new data. They're different operations.
Always cleanse before enriching — the order matters.
Don't waste enrichment credits on dirty, duplicate, or invalid data.
Re-enrich periodically because B2B data decays at 22-30% per year.
Build a pipeline that does both, in sequence, automatically.

data enrichment vs data cleansing