Got Data? Great, let’s clean it

In the AI world, having data is not sufficient. For AI models to return the best results extensive work needs to be done to prepare the data. Steps need ed with any data set:

Gather

Whether a website, a file, an existing database, or a data feed the data needs to be collected and ingested. For each format the extraction of data from the “source” will differ but once the data has been loaded, we can move to cleaning the data.

Gathering data is rarely a one-time event. A frequency must be defined to gather new data should it exist.

Clean

In cleansing the data, the data needs to be analyzed to make sure we have the highest quality of data. Cleaning can include:

Identifying duplicates from the feed
Determining if the data already exists in our database
Ensuring required fields are populated
Matching the data to the expected field type and format
Value validation to make sure the field contains values within expected ranges

The closer the validation is done to the ingest of the raw data, the cleaner a standardized data set can be used by all people. If data from the source is seen as “good” when the data actually has errors, validation is moved to users of the data. They are then responsible for adding their own rules to find the clean records, but that means each user may have a different data set for their models.

Not all records that need to be cleansed pose a problem. Sometimes the fix can be done by users if there is formatting issues. Missing data may be able to be extracted or calculated from other fields. Other records may be out of the expected ranges but again, a calculation using fields from the database may be able to correct the value.

Same but different

How can data be the same, but different? In the world of numbers, the data is the data as the numbers represent the digits and nothing more. With text, each data point is a reflection of the user’s method of interacting with the language.

Capitalization
Punctuation
Abbreviations
Misspelling

The table below provides examples on correcting.

Capitalization	Convert data to single case for validation
Punctuation	Remove punctuations before use
Abbreviations	For known abbreviations can substitute similar value User has tracking table to add more as new abbreviations come in
Misspelling	Hardest to change and needs custom algorithm to make a correction or leave as entered

For all of the examples, the database would contain an additional field for the normalized text data so data users are seeing the same data set.

Upon cleaning the data, users can then move the validated and corrected records into their database.

Bad records

When data has been labeled as being bad, it can either be discarded, or set aside for validation by the user. Users would follow up with the vendor for correct data, or they could track the data to identify systemic issues resulting in bad records. In fact, the record may be clean but the rules applied may need changes. The world is messy and changing and all processes need to account for changes.

Model Benefits

Data is messy but spending time to clean the data results in the best model performance. More data leads to better models. This can be done through volume of records, or done by having more data fields to build models on.

Another benefit of ensuring clean data at ingest, is that a universal database is available to all models. Thus models would be able to predict across the entire spectrum of records, rather than just the records with the fields they used populated.

In Summary

In AI world, the focus is on the output and how it helps people move forward in a faster and more informed way. The reality is that a great model exists due to the extent of cleaning data to get to a standardized record set. The best models are built on volume of records and volume of parameters

Leave a Comment Cancel Reply