What is disambiguation?

The power of the Medmeme database comes in part from the large amount of data it has aggregated. It also comes from the proprietary data it has collected that is available nowhere else. Medmeme could have stopped there and had a valuable information product. There are many successful data companies whose primary value derives from offering “the most stuff in one place.” And there is no doubt a good level of value that comes from doing just this.

Medmeme, however, goes a step further by connecting its various data sources into a seamless whole. Because it is only after disparate data sources are connected that you can begin to easily extract high value information, analytics and actionable insights.

Typically, different data sources are connected first by finding unique identifiers that are common across the different datasets. Imagine that you had three databases containing company information. If each dataset contained company domain names, you’re all set. All you need to do is match the datasets on domain names, which function as what is called a “unique key”. A unique key is a unique identifier, much like a Social Security number. As it happens, domain names are unique to companies so they are indeed unique keys. There is only one “acme.com” domain name in the world. It can belong to only one company. If you have two different datasets with company records and there are records in both datasets that contain “acme.com,” you can confidently match them and merge the information because it is highly probable they are the same company. So matching disparate datasets is easy … until it’s not.

Medmeme lives in the world of not easy matching. That’s because Medmeme data is organized around medical science researchers, in other words, people. In the area Medmeme operates, there are no unique keys available. Medmeme has to match on the names of individual researchers. Sure, matching on names is simple. If two datasets both have the name John Smith, it’s easy for a computer to match those two records. But here’s the rub: John Smith is a fairly common name. How do you know the two John Smith records you are matching together are really the same John Smith? Welcome to the world of disambiguation.