How is disambiguation done?

Disambiguation is the process of looking for clues that will give you enough confidence to match records together from different sources. Rarely is one clue enough you give you sufficient confidence. Similarly, some clues are more important than others. This is why they are sometimes assigned numeric weights. Add the numeric weights together and you get a confidence score, with high scores typically meaning a higher probability that two records represent the same person.

The key to disambiguation is to look for expanded context. If John Smith is a medical doctor, then there is an improved chance (but not a certainty) that Dr. John Smith and John Smith MD are the same person. Middle initials help too. Dr. John Charles Smith and Dr. John C. Smith have a greater chance of being the same person. Not a guarantee, just an improved chance. Disambiguation is all about probability.

We can also look at where Dr. Smith works. If Dr. John C. Smith and Dr. John Charles Smith both work at the Dana Farber Cancer Institute, our confidence level goes up even more.

If one of our Dr. Smiths is board certified in anesthesiology, and the other Dr. Smith is writing about radiation oncology, our match confidence is reduced. If one Dr. Smith graduated from UC Davis School of Medicine and the other went to the Duke University School of Medicine, it’s likely we don’t have a match at all.

You can easily see that, depending on the nature of the datasets you are trying to match, you can get almost infinitely clever in the nature of the clues you are examining and the weights you assign to them. Moreover, there are often clues in the nature of the dataset itself. If one of your datasets comes from a source that is known to be precise and rigorous, sometimes this dataset will be treated as more authoritative than all the others to which it is being matched.

Finally, it’s important to keep in mind that what we are discussing here is computerized matching of datasets. Ultimately, however, the most any computer software can do is assign scores to records, and match those that exceed your pre-assigned confidence threshold and kick out those that don’t. And what happens to those records that don’t make the cut? Typically, it’s a manual, human review process to look for more subtle clues. And in the toughest cases, it means sending a confirmatory email or picking up the phone.

Quality data disambiguation is complicated, difficult to fully automate, and thus expensive. But once the hard work of disambiguation is done, it becomes very easy to start reaping the benefits of the many new insights that invariably result from uniting formerly freestanding datasets. Most of the unique answers and insights that Medmeme can deliver are the end result of this complex and laborious process known as disambiguation.