You pull a report on your top accounts. The number looks off. You dig in, and there it is: the same company listed three times. "Acme Corp", "Acme Corporation", "ACME Corp Ltd". Three records. One customer. Your pipeline has been wrong for months.
This is the problem entity resolution solves. Not glamorous. Not the kind of thing that ends up in a product demo. But if your company runs on data, it's probably the thing causing the most silent damage right now.
What an entity actually is
An entity is a real-world thing: a company, a person, a location, a product. In your database, that thing might exist as dozens of records, each one entered slightly differently by different people at different times.
Your sales team adds "Microsoft UK" to the CRM. Finance has "Microsoft Ltd" in their system. The partner team refers to them as "MSFT". Same company. Three records. None of them linked.
Entity resolution is the process of finding those records, deciding they refer to the same real-world entity, and linking or merging them. That's it. Simple idea. Brutal to do well.
Why it's harder than it sounds
The obvious cases are easy. "IBM" and "IBM Corp". You can write a rule for that. The problem is the cases that look different but are the same, and the cases that look the same but aren't.
Take names. "J. Smith" and "John Smith" could be the same person or forty different people. "Barclays" and "Barclays Bank PLC" and "Barclays Capital" are three different legal entities that share a parent. A rule-based system either over-merges or under-merges. Both are expensive mistakes.
Then there's the data quality problem. Typos. Abbreviations nobody documented. Address fields used for whatever the person felt like typing that day. One system uses country codes, another uses full country names, a third has a free-text field that contains "UK", "United Kingdom", "Great Britain", and one entry that just says "London".
Real entity resolution has to handle all of it. It uses a combination of exact matching, fuzzy string matching, phonetic matching, and statistical scoring to build a confidence level for each potential link. High confidence: merge. Low confidence: flag for review. The threshold depends on your use case and your tolerance for error.
The three steps
Most entity resolution systems work in three stages, roughly:
- 01Blocking. You can't compare every record against every other record. Even a modestly sized dataset would require billions of comparisons. Blocking narrows the candidate pairs down to the ones worth examining. Group records by postcode, industry, or name prefix. Compare within groups, not across the whole dataset.
- 02Comparison. For each candidate pair, compute similarity scores across multiple fields. Name similarity, address overlap, phone number match, email domain, date of founding. Different fields get different weights depending on how discriminating they are.
- 03Classification. Based on the combined score, decide: same entity, different entity, or uncertain. Uncertain records either go to a review queue or get classified by a machine learning model trained on your historical decisions.
After classification comes the merge step: deciding what the canonical record looks like. Which name do you keep? Which address? This sounds trivial. It isn't, especially when the conflicting records come from systems with different levels of trustworthiness.
What actually breaks without it
Operations teams feel this more than anyone. A few specific examples of what goes wrong:
- Revenue reporting is wrong. You're counting the same customer twice, or splitting their spend across three records. Your top-10 list is fictional.
- Outreach goes sideways. Sales contacts the same company from three different rep territories. The company gets three separate proposals. They notice.
- Segmentation breaks. You try to pull all customers in the financial services sector. Half of them are filed under "Finance", "Financial Services", "FS", and "Banking". You miss them.
- Compliance becomes a problem. GDPR right-to-erasure requests require you to find all data related to a person. If their records are fragmented across five systems, you'll miss some. That's a risk you don't want.
- Scoring and prioritisation go wrong. You're ranking leads by revenue potential, but the scoring model is looking at incomplete records. Your best prospect looks like a small account because their data is split.
Who actually needs this
The honest answer: any organisation that collects data about real-world entities from more than one source. Which is most of them.
The cases where it matters most tend to share a few characteristics. Multiple data entry points with no enforced standards. Data that's been accumulated over years without cleanup. A merger or acquisition that brought in a whole second dataset with its own conventions. A CRM, an ERP, and a marketing platform that were never designed to talk to each other.
Government and public sector organisations are a particularly acute case. Datasets collected by different agencies over different time periods, with different formats, sometimes without unique identifiers. A citizen might appear in a housing database, a benefits system, and a health record under three slightly different name spellings and two different addresses. Linking those records correctly has real consequences.
Financial services too. Know-your-customer processes require a complete picture of a client across all products and relationships. If the same legal entity appears under five names across your systems, your compliance team has a problem.
What good resolution looks like in practice
A well-resolved dataset has a few properties. Every distinct real-world entity has exactly one canonical record. Related records that have been merged are traceable, so you can still see the originals. Confidence scores are preserved so you know which merges were clean and which were judgement calls. And critically, the system can handle new data coming in without requiring a full re-run every time.
The output isn't just a cleaner database. It's a database your team can actually use. Reports run against it produce numbers you can trust. Segmentation produces segments that reflect reality. When your operations lead pulls a list of top accounts, it matches the accounts your sales team thinks it does.
Entity resolution is foundational to data merging. If you're combining data from two or more sources, you're doing entity resolution whether you've named it that or not. The question is whether you're doing it deliberately or just hoping the records line up.
Build it or buy it
There are open-source libraries for entity resolution. Dedupe.io, Splink, and others. They work. They also require a data engineer who knows what they're doing, time to configure and train, and ongoing maintenance as your data evolves.
For most organisations, the question isn't really build vs. buy. It's: do we have the in-house capacity to do this properly, or do we need someone who's done it before? Getting entity resolution wrong is worse than not doing it. A confidently wrong merge corrupts downstream decisions in ways that are hard to detect.
If you're looking at a data merging project and entity resolution is part of it, the most important thing is to be honest about the complexity upfront. Define what "same entity" means for your use case. Agree on how you'll handle uncertain matches before you start. And make sure whoever is doing the work has dealt with messy, real-world data before. Not just clean toy datasets.