# Fuzzy Duplicate Cleanup for Your Customer List **Recommendation:** Use **Power Query's Cluster Values + Merge with Fuzzy Matching** feature. It's built into Microsoft 365, handles 8,000 rows easily, and catches variations like "Robert/Bob" and "Acme Inc./Acme, Inc" without add-ins. Use a **similarity threshold of 0.80** for names (catches nicknames) and **0.85–0.90** for companies. --- ## Step-by-Step Walkthrough **1. Load your data into Power Query** - Select any cell in A1:D8000 - Go to **Data > Get & Transform Data > From Table/Range** - Confirm headers, click OK **2. Normalize text first (critical for accuracy)** - Select the Name, Company, and Email columns (Ctrl+click) - Go to **Transform > Format > lowercase** - Then **Transform > Format > Trim** - Then **Transform > Format > Clean** (removes hidden characters) **3. Cluster similar values to create a grouping key** - Select the **Company** column - Go to **Add Column > From Text > Cluster Values** - Name the new column "Company_Cleaned" - Click **Fuzzy cluster options** > set **Similarity threshold to 0.85** - Repeat for the **Name** column with threshold **0.80** *Example:* "Acme Inc." and "Acme, Inc" will both cluster to one value. "Robert Smith" and "Bob Smith" at the same company will cluster together. **4. Find duplicates using fuzzy Group By** - Go to **Home > Group By > Advanced** - Group by **Name_Cleaned** and **Company_Cleaned** - Check **Use fuzzy matching** - Set **Similarity threshold: 0.85** - Add aggregations: Count Rows, and "All Rows" (to preserve original data) - Click OK **5. Expand and review** - Click the expand icon on the "All Rows" column - Select all original columns - Filter the Count column for **> 1** to see only duplicate groups **6. Load results for review** - Go to **Home > Close & Load To > Table > New Worksheet** **7. Review before deleting** - Sort by Company_Cleaned so duplicate groups sit together - Add a "Keep" column and mark the record with the most complete data (full email, phone, fewer blanks) - Use `=IF(COUNTIF(...)>1,"REVIEW","")` to flag clusters if needed --- ## Final Tip **Never delete directly from Power Query output.** Export the flagged duplicates to a separate sheet, manually verify 20–30 random matches first (especially near the 0.80 threshold — that's where false positives hide), then use the "Keep" column to filter and delete losers. For merging data from duplicates (e.g., one row has phone, another has email), use the expanded table to copy missing fields into the surviving record before deletion.
Find Near-Duplicate Records in Excel with Fuzzy Matching
Tested prompts for fuzzy match duplicates in excel compared across 5 leading AI models.
You have a spreadsheet with customer names, vendor records, or product SKUs, and you know duplicates are hiding in it. The problem is they are not exact duplicates. 'Acme Corp' and 'ACME Corporation' are the same company. 'Jon Smith' and 'John Smith' are probably the same person. Excel's built-in Remove Duplicates only catches exact string matches, so it misses all of this.
Fuzzy matching solves this by scoring how similar two strings are, usually with algorithms like Levenshtein distance or Jaccard similarity. Microsoft ships a Fuzzy Lookup Add-In for Excel, Power Query has a fuzzy merge feature, and you can also offload the work to an AI model that handles messy variations in context, like abbreviations, nicknames, and transliterations.
This page gives you a tested prompt and four model outputs comparing how each AI handles a realistic dirty-data sample. Below, we cover which method fits your situation, the edge cases that break fuzzy matching, and concrete examples you can copy into your own workflow.
When to use this
Use fuzzy matching when your duplicates stem from human entry variance, legacy system migrations, or data scraped from multiple sources. It shines on free-text fields like names, addresses, and company identifiers where formatting rules were not enforced. If your column was entered by hundreds of people over years, you almost certainly need fuzzy logic, not exact dedupe.
- Cleaning a CRM contact list merged from Salesforce, HubSpot, and a trade show spreadsheet
- Reconciling vendor names across invoices where 'IBM', 'I.B.M.', and 'International Business Machines' all appear
- Deduplicating mailing lists with typos like 'jenifer@gmail.com' vs 'jennifer@gmail.com'
- Matching product SKUs across supplier catalogs with inconsistent hyphenation and padding
- Consolidating survey responses where respondents typed their employer name freehand
When this format breaks down
- Exact identifiers like tax IDs, UUIDs, or primary keys. Fuzzy matching here creates false positives that corrupt joins.
- Datasets over ~500,000 rows in Excel. Fuzzy Lookup becomes unusably slow. Move to Python with RapidFuzz or a dedicated tool like Dedupe.io.
- Regulated data where a false merge has legal consequences, like patient records or financial KYC. Use deterministic rules plus human review.
- Multilingual datasets with mixed scripts. Edit-distance algorithms treat 'Москва' and 'Moscow' as completely different even though they refer to the same city.
The prompt we tested
You are an Excel data-cleaning expert helping a user find and resolve near-duplicate (fuzzy-matched) records in their spreadsheet. Based on the user's situation below, provide a clear, step-by-step walkthrough of how to fuzzy match duplicates in Excel, including the best method for their case (Power Query's Merge/Cluster Values feature, the Fuzzy Lookup Add-In from Microsoft, or formulas like TEXTJOIN with LOWER/TRIM and similarity logic). Include exact menu paths, threshold recommendations, and how to review and remove the matches. Rules: Format the response with a short recommendation first, then numbered steps with exact Excel menu paths (e.g., Data > Get & Transform > From Table/Range). Include a suggested similarity threshold (0.80-0.90), a brief example showing what would match, and a final tip for reviewing results before deleting. Keep it under 400 words and use plain language, no fluff. User's situation: I have a customer list in Excel with about 8,000 rows in columns A-D (Name, Email, Company, Phone). There are lots of near-duplicates like 'Robert Smith' vs 'Bob Smith' at the same company, or 'Acme Inc.' vs 'Acme, Inc'. I'm on Microsoft 365 on Windows and need to find and merge these fuzzy duplicates without losing data. Return the complete walkthrough now.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 9 | $0.0841 | 18.1s | 492 / 1023 |
| Claude Haiku 4.5 | 9 | $0.0040 | 8.0s | 327 / 728 |
| GPT-5.4 | Not run yet | |||
| Gemini 2.5 Pro | 8 | $0.0027 | 16.4s | 303 / 462 |
| Grok 4.1 Fast Reasoning | 7 | $0.0003 | 9.8s | 434 / 366 |
Outputs
# Fuzzy Duplicate Resolution Walkthrough ## Recommendation Use **Power Query's Merge/Cluster Values feature** (built into Microsoft 365). It's fastest for 8,000 rows, requires no add-ins, and lets you review matches before committing. --- ## Step-by-Step Process ### 1. Load Your Data into Power Query - Select your data (A1:D8001, including headers) - **Data > Get & Transform > From Table/Range** - Click **OK** in the dialog ### 2. Access the Cluster & Merge Feature - In Power Query Editor, select the **Name** column (your primary fuzzy-match field) - **Home > Cluster & Merge** (or **Transform > Cluster & Merge** if unavailable in your version) - A clustering dialog opens ### 3. Configure Clustering - **Similarity Threshold:** Set to **0.85** (85% similarity) - 0.90 = stricter (misses "Bob" ↔ "Robert") - 0.80 = looser (may over-merge) - **Language:** Select your data's language - **Algorithm:** Keep default (Levenshtein distance) ### 4. Review Proposed Matches Power Query shows a table of similar values grouped together: - *Example:* "Robert Smith" | "Bob Smith" | "R. Smith" grouped as one cluster - Review each cluster in the preview pane - **Uncheck any incorrect groupings** to exclude them - Click **OK** to apply ### 5. Select the Replacement Value For each cluster, choose which variant to keep (e.g., "Robert Smith" replaces all variants). Select your preferred version from each group. ### 6. Complete the Merge - Power Query replaces all variants with your chosen value - Click **Close & Load** to return the cleaned data to Excel ### 7. Repeat for Other Columns (Optional) - If needed, repeat steps 2–6 for **Company** (e.g., "Acme Inc." vs "Acme, Inc.") - Phone and Email typically need exact-match deduplication instead ### 8. Remove True Duplicates - Select all data - **Data > Remove Duplicates** (or **Data > Data Tools > Remove Duplicates**) - Check columns A–D - Click **OK** --- ## Example Match Results | Threshold | Matches | Notes | |-----------|---------|-------| | 0.85 | "Robert Smith" ↔ "Bob Smith" | Same company, similar enough | | 0.85 | "Acme Inc." ↔ "Acme, Inc." | Punctuation variation | | 0.85 | "John" ↔ "Jon" | ❌ Won't match (too different) | --- ## Final Review Tip **Before deleting:** Export cleaned data to a new sheet, compare row counts with the original (note how many rows merged), and spot-check 10 random clusters manually. Keep a backup of your original file.
For your situation on Microsoft 365, the best built-in tool is **Power Query's Fuzzy Merge**. It's powerful, requires no extra installation, and can match records based on the similarity of text in multiple columns like Name and Company.
Here is the step-by-step walkthrough:
1. **Load Data into Power Query:**
* First, format your customer list as an official Excel Table (**Home > Format as Table**).
* Click anywhere inside your table, then go to the menu **Data > Get & Transform Data > From Table/Range**. This opens the Power Query Editor.
2. **Add a Unique ID Column:**
* In the Power Query Editor, go to the **Add Column** tab and click **Index Column**. This adds a unique ID to each row, which is essential for identifying the original records.
3. **Perform the Fuzzy Merge:**
* On the **Home** tab, click the dropdown for **Merge Queries** and select **Merge Queries as New**. This creates a separate results table, protecting your original data.
* In the Merge window, select your table name in both the top and bottom dropdowns to match the list against itself.
4. **Configure the Fuzzy Match:**
* Select the columns you want to match on. Hold the **Ctrl** key and click the **Company** column header, then the **Name** column header in *both* tables. The order must be the same.
* Check the box at the bottom: **Use fuzzy matching to perform the merge**.
* Set the **Similarity threshold** to **0.85**. This is a good starting point (1.0 is an exact match).
* *Example:* A threshold of 0.85 would easily match "Acme Inc." with "Acme, Inc." and is likely to catch variations like "Robert Smith" and "Bob Smith".
* Click **OK**.
5. **Expand and Review the Matches:**
* A new column appears with expandable table data. Click the expand icon (two arrows) in its header
**Recommendation: Use Power Query's fuzzy merge (built into M365) – ideal for 8k rows, handles multi-column fuzzy matching like Name+Company without add-ins.** 1. Select your data (A1:D8001 incl. headers), go to **Data > From Table/Range** to load into Power Query Editor. 2. In Power Query, right-click your table > **Duplicate** (creates Table2 copy). Close & Load Table1 only. 3. Reopen Table1 in PQ: **Home > Transform Data**. Go to **Home > Merge Queries > Merge Queries as New**. 4. Select Name (A) in Table1, Ctrl+click Company (C), then Table2's Name+Company. Check **Use fuzzy matching**, set **Similarity threshold: 0.85** (catches 'Robert Smith' + 'Acme Inc.' vs 'Bob Smith' + 'Acme, Inc.'). 5. Expand merged column (click icon), select only matching rows (e.g., Table2.Name, Table2.Email). Filter to keep fuzzy matches (non-null). 6. **Home > Close & Load** to sheet. Review matches (e.g., side-by-side columns highlight fuzzy pairs). 7. Manually merge data (e.g., prioritize Email/Phone), flag/delete duplicates via **Data > Remove Duplicates** or filter. **Example match at 0.85:** 'Robert Smith' + 'Acme Inc.' fuzzy-matches 'Bob Smith' + 'Acme, Inc.' (name ~0.82 sim, company ~0.95). **Tip:** Sort/filter results by similarity score column (add via Custom column: `Similarity([Merged])`), spot-check 10-20 pairs before bulk delete to avoid false positives. (248 words)
What makes these work
-
01Normalize before matching
Lowercase everything, strip punctuation, and trim whitespace before running fuzzy comparison. This alone eliminates 30-50% of false negatives. Store the normalized version in a helper column and match on that.
-
02Set a similarity threshold
Fuzzy Lookup defaults to 0.8 but that is often too permissive for short strings and too strict for long ones. Tune it per column: 0.85 for names, 0.75 for addresses, 0.9 for SKUs. Test on a sample and eyeball the borderline matches.
-
03Use token-based scoring for multi-word fields
For company names and addresses, token set ratio beats raw edit distance. 'Acme Inc' and 'Inc Acme' score high on token methods and low on Levenshtein. Power Query's fuzzy merge uses Jaccard which handles this well.
-
04Keep a human in the loop on borderline matches
Export matches with confidence between 0.7 and 0.9 to a review column. These are where the real judgment calls live. Auto-merge above 0.9, auto-reject below 0.7, review the middle.
More example scenarios
Column A contains: 'Robert Smith', 'Bob Smith', 'Rob Smith', 'Roberta Smith', 'R. Smith', 'Robert J. Smith'. Column B contains matching company names: 'Acme', 'Acme Inc', 'Acme', 'Beta Corp', 'Acme Inc.', 'Acme Incorporated'. Identify which rows likely represent the same person and explain your reasoning.
Rows 1, 2, 3, 5, and 6 likely represent the same person (Robert Smith at Acme), with 'Bob' and 'Rob' being standard nicknames for Robert, 'R. Smith' matching by initial and shared employer, and 'Robert J. Smith' adding a middle initial. Row 4 (Roberta Smith at Beta Corp) is a different person despite the similar first name.
Accounts payable has these vendor strings: 'AT&T', 'AT and T', 'A.T.&.T.', 'ATT Mobility', 'AT&T Inc.', 'American Telephone'. Group them into canonical entities and flag ambiguous cases.
Canonical group 'AT&T Inc.': 'AT&T', 'AT and T', 'A.T.&.T.', 'AT&T Inc.'. Separate entity: 'ATT Mobility' is a subsidiary and should stay distinct for cost-center accounting. Ambiguous: 'American Telephone' could refer to AT&T historically but is too generic to auto-merge. Flag for manual review.
Supplier A lists: 'WDG-1001-BLK', 'WDG1001BLK', 'wdg-1001-black'. Supplier B lists: 'WDG_1001_BK', 'Widget 1001 Black'. Match these to a single master SKU.
All five entries map to master SKU 'WDG-1001-BLK' (Widget model 1001, color Black). Normalization rules applied: uppercase, strip separators (- _ space), expand color abbreviations (BK=BLK=Black). Confidence: high. Recommend storing both the raw supplier string and the normalized master SKU for audit traceability.
Mailing list entries: 'jen.doe@gmai.com', 'jen.doe@gmail.com', 'jendoe@gmail.com', 'jennifer.doe@gmail.com', 'jen.doe@gmail.co'. Identify duplicates and likely typos.
'jen.doe@gmai.com' and 'jen.doe@gmail.co' are almost certainly typos of 'jen.doe@gmail.com' (missing 'l' and 'm' respectively). 'jendoe@gmail.com' and 'jennifer.doe@gmail.com' may be the same person but cannot be confirmed from email alone. Recommend merging the typo variants and flagging the other two for verification via a separate identifier.
Common mistakes to avoid
-
Trusting the default threshold
Fuzzy Lookup's 0.8 default will merge 'Smith' with 'Smyth' and also 'Ltd' with 'Ltd', but it will miss 'IBM' vs 'International Business Machines'. Always test against known duplicates before trusting bulk output.
-
Matching on a single column
Names alone are noisy. Match on name plus email domain, or name plus ZIP code, to disambiguate. 'John Smith' in Ohio and 'John Smith' in Texas are almost certainly different people.
-
Deleting the source rows immediately
Never hard-delete until you have a backup of the original sheet and a mapping table showing which rows were merged into which master record. Fuzzy merges are reversible only if you kept the audit trail.
-
Ignoring non-ASCII characters
'José' and 'Jose' score as different under strict comparison. Apply Unicode normalization (NFD then strip combining marks) before fuzzy matching, or your Latin American and European records will stay split.
-
Using fuzzy logic for structured IDs
Fuzzy matching on SSNs, account numbers, or order IDs is actively harmful. A one-digit typo in a structured ID is still a different record, and fuzzy merging can collapse legitimately distinct accounts.
Related queries
Frequently asked questions
Does Excel have a built-in fuzzy match function?
Not natively in the core formula set. Microsoft offers the free Fuzzy Lookup Add-In for Excel (Windows only), and Power Query includes a fuzzy merge option under Merge Queries. There is no FUZZYMATCH worksheet function.
How do I install the Fuzzy Lookup Add-In?
Download it from Microsoft's Download Center (search 'Fuzzy Lookup Add-In for Excel'), close Excel, run the installer, then reopen Excel. A new 'Fuzzy Lookup' tab appears in the ribbon. It only works on Windows, not Mac or Excel for the web.
What similarity threshold should I use?
Start at 0.85 and adjust based on results. For short strings like names, go higher (0.9) because a one-character change has a big proportional impact. For long addresses or descriptions, 0.75 may be appropriate. Always validate against a labeled sample.
Can I fuzzy match between two different sheets or tables?
Yes. Both the Fuzzy Lookup Add-In and Power Query's fuzzy merge are designed for two-table matching. Convert your ranges to Excel Tables first, then point the tool at the left table, right table, and join columns.
How is fuzzy matching different from VLOOKUP with wildcards?
VLOOKUP wildcards only handle patterns you define in advance ('Acme*' matches anything starting with Acme). Fuzzy matching scores similarity across the full string without requiring a pattern, so it catches typos, transpositions, and abbreviations you did not anticipate.
Can AI models like GPT-4 or Claude do fuzzy matching better than Excel?
For small-to-medium datasets (under a few thousand rows), yes. LLMs understand that 'Bob' and 'Robert' are the same name and that 'IBM' equals 'International Business Machines', which pure edit-distance algorithms cannot. For large datasets, use algorithmic tools first then have an LLM review the borderline cases.