Data normalisation

We have previously discussed the challenging process of disambiguation and its importance to developing a high value database that easily yields trustworthy and consistent results. Yet despite all the complexity and work involved in disambiguation, it only gets you part of the way to clean and useful data. The other critical process is called normalization.

Normalization is the process of standardizing data so that the computer processes it the same way. Surprisingly, some things that are very easy for humans to do are very hard for computers. Consider the following list:

  • United States
  • U.S.
  • US of A
  • USA
  • America
  • United States of America

To you, all the items on this list represent the same thing: the name of a specific country. Easy, right? Well for a computer, despite all its computational power, all these items on the list are different and distinct. Ask for counts of published research reports by country, for example, and you’ll get a count for each one of these entries. And when you consider you’ll also get a separate count for all the various misspellings of United States, you can start to appreciate how messy data can slow you down and even potentially lead to errors.

The answer is simple, though the work is hard: clean up those country names in order to yield a single, consistent and dependable result. Interestingly, the first step in this process is to determine which spelling of the country name is the right one. These choices become what is called an authority file.

From there, you need to identify every other variation that exists in the database and conform it to your authority file. For smaller databases, this work is sometimes done by hand, but for larger databases, an automated solution is required. Typically, the approach that is used is a brute force approach. A database is created with every known misspelling, along with the correct spelling (from the authority file) right next to it. The computer then goes off and conducts a giant find and replace exercise.

There are lots of nifty automated ways to identify problem entries in a database. But it still takes a human to examine the record and determine that “U.S/A” in the database equates to “United States” in the authority file. Once a misspelling has been mapped to the correct spelling, it never needs to be looked at by a human again. However, the ability of humans to create new variations and typos is seemingly infinite, so the work of normalization is never finished.

Keep in mind too that the process of normalization is required for almost every field in a database. And some of these fields are much more vital than country name. Consider the importance of standardizing drug names for example. Think also about the importance of normalizing numeric fields. What if there is a field for price information in the database and some of the entries are in Euros and some in Renminbi. If the database hasn’t been normalized to reflect a single currency, your ability to query the data and analyze the results is severely compromised.

As you can see, data normalization is about much more than making a database look tidy. It goes right to the heart of how useful a database is going to be. Indeed, normalization is one of the biggest challenges to data scientists today. A New York Times article estimates that data scientists spend 50% to 80% of their time acting as so-called “data janitors,” scrubbing and cleaning data – normalizing it – so that it can be usefully analyzed.

Good databases don’t just happen. They almost always are the result of complex, demanding and time-consuming work to make the data clean enough to produce dependable results. It’s work that Medmeme does every day – work that makes your work easier.