In the AI world, having data is not sufficient. For AI models to return the best results extensive work needs to be done to prepare the data. Steps need ed with any data set:
Gather
Whether a website, a file, an existing database, or a data feed the data needs to be collected and ingested. For each format the extraction of data from the “source” will differ but once the data has been loaded, we can move to cleaning the data.
Gathering data is rarely a one-time event. A frequency must be defined to gather new data should it exist.
Clean
In cleansing the data, the data needs to be analyzed to make sure we have the highest quality of data. Cleaning can include:
- Identifying duplicates from the feed
- Determining if the data already exists in our database
- Ensuring required fields are populated
- Matching the data to the expected field type and format
- Value validation to make sure the field contains values within expected ranges
The closer the validation is done to the ingest of the raw data, the cleaner a standardized data set can be used by all people. If data from the source is seen as “good” when the data actually has errors, validation is moved to users of the data. They are then responsible for adding their own rules to find the clean records, but that means each user may have a different data set for their models.
Not all records that need to be cleansed pose a problem. Sometimes the fix can be done by users if there is formatting issues. Missing data may be able to be extracted or calculated from other fields. Other records may be out of the expected ranges but again, a calculation using fields from the database may be able to correct the value.
Same but different
How can data be the same, but different? In the world of numbers, the data is the data as the numbers represent the digits and nothing more. With text, each data point is a reflection of the user’s method of interacting with the language.
- Capitalization
- Punctuation
- Abbreviations
- Misspelling
The table below provides examples on correcting.
Capitalization | Convert data to single case for validation |
Punctuation | Remove punctuations before use |
Abbreviations | For known abbreviations can substitute similar value User has tracking table to add more as new abbreviations come in |
Misspelling | Hardest to change and needs custom algorithm to make a correction or leave as entered |
For all of the examples, the database would contain an additional field for the normalized text data so data users are seeing the same data set.
Upon cleaning the data, users can then move the validated and corrected records into their database.
Bad records
When data has been labeled as being bad, it can either be discarded, or set aside for validation by the user. Users would follow up with the vendor for correct data, or they could track the data to identify systemic issues resulting in bad records. In fact, the record may be clean but the rules applied may need changes. The world is messy and changing and all processes need to account for changes.
Model Benefits
Data is messy but spending time to clean the data results in the best model performance. More data leads to better models. This can be done through volume of records, or done by having more data fields to build models on.
Another benefit of ensuring clean data at ingest, is that a universal database is available to all models. Thus models would be able to predict across the entire spectrum of records, rather than just the records with the fields they used populated.
In Summary
In AI world, the focus is on the output and how it helps people move forward in a faster and more informed way. The reality is that a great model exists due to the extent of cleaning data to get to a standardized record set. The best models are built on volume of records and volume of parameters