Similar Data Finder for Excel — Quickly Locate Matching RecordsFinding similar or matching records in Excel is a common task for data cleaning, deduplication, merging data from different sources, and preparing datasets for analysis. This guide covers methods ranging from built-in functions to add-ins and advanced techniques so you can pick the right approach for your dataset size, accuracy needs, and technical comfort.
Why find similar data?
Most datasets contain inconsistencies: typos, different formatting (e.g., “Inc.” vs “Incorporated”), partial matches, or entries split across multiple columns. Identifying records that are identical or similar helps you:
- Remove duplicates and avoid double counting.
- Merge customer records from multiple sources.
- Prepare clean inputs for analytics and machine learning.
- Improve data quality for reporting and compliance.
When you need fuzzy matching: Use fuzzy matching when exact formulas fail — for example, “Jon Smith” vs “John Smith”, “Main St.” vs “Main Street”, or “Acme, Inc” vs “Acme Inc”.
Basic built-in Excel methods
1) Exact matches with MATCH, VLOOKUP/XLOOKUP
- Use XLOOKUP (Excel ⁄2021) or VLOOKUP for exact matches across tables.
- Good for normalized datasets where values are identical. Example XLOOKUP:
=XLOOKUP(A2, Sheet2!A:A, Sheet2!B:B, "Not found", 0)
2) Conditional formatting to highlight duplicates
- Home → Conditional Formatting → Highlight Cells Rules → Duplicate Values.
- Quick visual way to spot exact duplicates in one column.
3) COUNTIF / COUNTIFS for duplicate counts
- Use COUNTIF to count occurrences and filter rows with count > 1.
=COUNTIF(A:A, A2)>1
Fuzzy matching techniques (for similar — not exact — matches)
1) Fuzzy Lookup add-in (Microsoft)
Microsoft offers a Fuzzy Lookup add-in for older Excel versions and fuzzy matching functionality in Power Query.
- Works on pairs of columns, computes similarity scores, and returns best matches.
- Good for moderate datasets; provides adjustable similarity threshold.
2) Power Query (Get & Transform)
Power Query supports approximate matching for joins (as of recent Excel versions).
- Use Merge Queries → Join Kind → Fuzzy Match.
- Configure similarity threshold and transformation table to map common variants (e.g., abbreviations).
- Ideal workflow: load tables into Power Query, perform a fuzzy merge, review matches, and load back into Excel.
3) Levenshtein / Damerau-Levenshtein via VBA or custom functions
- Implement string distance algorithms in VBA to compute edit distances.
- Use distance thresholds to flag likely matches.
- Example pseudo-VBA approach: compute Levenshtein(A,B) and mark pairs with distance <= 2.
4) Soundex / Metaphone phonetic matching
- Useful for names with spelling variants that sound alike.
- Implement via VBA or use built-in Power Query transformations to normalize text before matching.
Practical workflows
Workflow A — Quick deduplication (small, mostly exact)
- Normalize text: TRIM, UPPER/LOWER, remove punctuation.
- Use COUNTIF or Remove Duplicates (Data → Remove Duplicates).
- Review conditional formatting highlights before deletion.
Workflow B — Merge two customer lists (fuzzy)
- Load both tables into Power Query.
- Normalize columns (remove punctuation, expand abbreviations, standardize address components).
- Merge using Fuzzy Match. Set similarity threshold (e.g., 0.8).
- Inspect a sample of matches, adjust threshold or transform steps.
- Load merged table back to Excel and mark verified matches.
Workflow C — Complex fuzzy scoring (custom)
- Create features: normalized text, Soundex codes, token overlap, address numeric comparisons.
- Compute similarity components: Jaccard/token overlap, edit distance, phonetic match.
- Combine into a weighted score and filter matches above a cutoff.
- Optionally use manual verification for borderline scores.
Example: Fuzzy Merge in Power Query (step summary)
- Data → Get Data → From Table/Range (for both tables).
- In Power Query Editor, apply Transform steps: Trim, Lowercase, Remove Punctuation, Split columns if needed.
- Home → Merge Queries → choose both tables → check “Use fuzzy matching”.
- Click “Fuzzy Matching Options” to set Threshold and transformations.
- Expand the merged columns to get matched fields and similarity scores.
- Filter or tag matches and Close & Load.
Tips to improve match accuracy
- Normalize aggressively: remove punctuation, stop words (e.g., “the”, “co”, “inc”), and standardize abbreviations.
- Tokenize multi-word fields (split into words) and compare token overlap.
- Use numeric anchors where possible — phone numbers, postal codes, or parts of addresses often reduce false positives.
- Start with a higher similarity threshold, then lower it gradually while reviewing results.
- Keep a manual verification step for high-impact merges (billing, legal, customer accounts).
- Record transformations and thresholds so matching can be reproduced.
Tools and add-ins comparison
Tool / Method | Best for | Pros | Cons |
---|---|---|---|
XLOOKUP/VLOOKUP | Exact matches | Fast, built-in | Fails on near matches |
Conditional Formatting | Visual duplicate spotting | Quick, easy | Only exact matches |
Power Query Fuzzy Merge | Moderate fuzzy needs | GUI, configurable, reproducible | Can be slow on very large tables |
Microsoft Fuzzy Lookup add-in | Desktop fuzzy matching | Easy setup, similarity scores | Legacy add-in, limited scalability |
VBA Levenshtein/Soundex | Custom fuzzy logic | Flexible, programmable | Requires coding, slower on large data |
External tools (Python/pandas, OpenRefine) | Large-scale or complex | Powerful, scalable | Requires outside tools and skills |
When to move beyond Excel
If datasets exceed a few hundred thousand rows or matching logic becomes complex (multiple weighted fields, machine-learning approaches), consider:
- Python with pandas + recordlinkage or dedupe libraries.
- R with stringdist and fuzzyjoin packages.
- Dedicated data-cleaning tools (OpenRefine, Talend) or a small database with indexing.
Example Excel formulas for normalization
- Trim and lowercase:
=LOWER(TRIM(A2))
- Remove punctuation (using nested SUBSTITUTE or Power Query for maintainability):
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2, ".", ""), ",", ""), "-", "")
Validation and audit
- Keep an audit column recording original record IDs and matched IDs.
- Sample matches to estimate precision and recall.
- Document thresholds and transformation steps for reproducibility and compliance.
Final notes
A “Similar Data Finder” in Excel can range from simple conditional formatting to sophisticated fuzzy merges using Power Query or custom code. Start with normalization, pick the simplest tool that solves your problem, and add complexity (fuzzy algorithms, phonetic matching, weighted scores) only as needed.
If you want, I can:
- Provide a Power Query step-by-step with M code for an example dataset.
- Share VBA for Levenshtein distance.
- Build a sample workbook template for fuzzy merging.
Leave a Reply