Quality data is the grail we're all looking for. It ensures us to design the most accurate models, top-notch analytics and the best reports. And we're wasting so much time in cleaning data instead, not knowing when to stop, because we don't really know what is quality data and how to reach that level of completion where we can stop cleaning and start focusing on the other tasks.
At Hoxn, we know that quality data beat fancy algorithms and what cleaning data should not be painful nor takes so much time. In order to be efficient with data cleaning, we have set a lists of items to check in order to ensure quality data.
What truly is quality data?
Data is information. You can have two different kinds of bad data: the noisy data (think of the grainy TV where you can see some of the content but not everything) and the false data (a wrong label, misleading categories, and so on). Quality data is when you can extract good and true information out of your data.
It will always depends on your use-case. For example, if I'm taking as example a raw dataset about housing prices, if my objective is to get some statistics on the pricing information of homes in a particular city, the only information I would need to ensure quality is the geographical information and the price column. I would not need to clean some additional variables included in my dataset if it doesn't help my analysis: keep it simple - don't waste time on unnecessary cleaning, otherwise you'll end up cleaning all the dataset just for a quick analysis.
The literature reminds us of those 6 main quality pillars:
Validity: is your data fitting its business rules, constraints and syntax? I'm talking about the format, the type, the range of values it should take, and so on. Basically, if you data is brought to you with a data dictionary containing metadata, do your data fit it correctly? Are your dates dates? And your prices numbers with decimal values? How many times I've had got to transform types because the data was saved beforehand in the wrong one... How many times have I got negative prices? Things as obvious as those happen every day and it should be the first pillar towards quality data.
Accuracy: is your data truly representing what it says it does? This is the most tricky part and often the one that takes the most time. Your data is created from a source, and that source might have a specific process in collecting this data. For example, you can have a date column (ah.. date formats ❤) on your dataset that you think is MM/DD/YYYY but this specific source fills it on the DD/MM/YYYY format because they've used to do it (for any reason). If you're looking at a study on the first ten days of a specific month, you have absolutely no way to know what format it is, the only way is to study more about the source than the data itself, and understand their processes. For example you can study an other dataset from the same source and find a date that proves that it's one of the two formats exclusively and then infer it on your first dataset. This is way accuracy is the most time-challenging and hypothesis-taking pillar.
Completeness: this one is quite straightforward, how many missing data do you have in your dataset? Missing data is one of the most followed data cleaning challenges in the world of research, and we'll write a dedicated article about it. However, do not be misled by missing data, the true meaning of completeness is how the required data is known, if some unnecessary data is unknown, the completeness level shouldn't be affected.
Uniqueness: duplicated records can ruin your analysis, can lead to overfitting models and blurry (and wrong) reports. However, it also depends on your use-case, not all datasets and analyses require a duplicates eradication, so be aware that duplicates can either be good or bad information depending on your data. This is why we don't talk about duplicates but about uniqueness: nothing should be recorded more than one regarding what your dataset "identifies".
Consistency: your data should be treated the same way when it looks the same. It's as simple as that. I've seen many datasets with some columns set as one-hot encoding (either a 0 or a 1) and just right next to it some others with a boolean (either a true or a false) value. Consistency means to have the same type of transformation and categorization among similar variables. Also, it goes a bit beyond this, and also asks for you to be consistent among multiple datasets (and files). This is also why, at Hoxn, all our data have the same naming convention across all the website, you can download them, merge them altogether and apply the same functions on the same kinds of columns because all the datasets are consistent with every other dataset here.
Uniformity: Derivation of consistency and the two of them complete each other, uniformity aims to have a standardization of the naming conventions, of the units of measures, and so on. This is also what we've set at Hoxn, all of the datasets are respecting all of the ISO conventions in order to be as plug-and-play as possible with your programming or analysis tools.
Now you're a theoretical expert on data cleaning, congratulations! 🔥🎉But hey, let's dive into more practical stuff, shall we? Let's look at your raw dataset you've just downloaded from an other place than here (obviously, everything's clean here 😎) and let's make it shiny!
And maybe you'll want to send us a message with your cleaned dataset that you want to expose here on Hoxn? If so, please feel free to do so!
How to reach quality data?
Reaching quality data shouldn't be painful, we have to all agree on that right now. We should spend time working on pristine data instead of spending days making it awesome. But hey, datasets are not all shiny like diamonds right now, and need to be shaped in order to extract the most information out of it.
Here at Hoxn, we're all working in data science related jobs for many many years, and we've all come to the same 8-step process that might scare you but it actually pretty simple and intuitive.
First of all, you should know that only a few handful of datasets need to cross the 8 checkboxes, we're often in-between 2 and 6 items to carefully perform. So please be reassured: this list is just some kind of memo that, when you've read it once or maybe twice, you'll just do intuitively, maybe not in the order I've provided, but it doesn't matter, you'll find your own order and process pretty much by yourself.
This order is the one I like but it is entirely up to you to find your own. I'm writing it down as I would have loved to read it before I started my career in data because it would have saved me some tremendous amounts of time and also a lot of stressful moments!
Format handling: the root of many of your problems. Your data might come from a huge variety of different data sources and files, and each of them has a specific handling protocol: Excel documents won't be opened the same way than a REST API or AWS-cloud-stored documents. Format handling also mentions all the file specificities such as the separator, the header that is sometimes missing... and the encoding! This is the most annoying stuff still haunting us with dates formats and should be handled with care.
Irrelevant information removal: as mentioned in the Uniqueness pillar, datasets often contain irrelevant information that need to be removed in order to lower/delete any bias in the data. Duplicates data often are part of this information that we need to erase. When they are not needed, duplicated data create tremendous bias. The Validity pillar makes its way here as well on the removal of negative prices data and so on.
Harmonization: when the right columns are selected and we're fine with our current dataset, we want to rename everything to standard grounds, with the naming convention of your choice (camelcase, snakecase, etc.) and doing so regularly on all of your datasets will make you be swift when writing functions and later analyses because you already know all of your country codes columns are named country_code and your country names columns are country_name. Also, harmonization included all the ISO norms to be applied on columns, such as having ISO 8601 dates, WGS-84 geolocations, and so on.
Consistency assessment: pure application of the eponymous pillar and the Validity's one, we need to clean the consistency between columns types and prepare them for our future pipelines, models and analytics where two columns of the same type with different behaviors will inevitably break the processes.
Integrity check: now that we've checked the columns consistency, we can go more in-depth in the data and ensure that all the values in the columns have the same format, either technical or business. For example, we might want to be sure that weights units "Kg." and "kg" are set to the same value, or that for a country with 5-digit postal codes, the leading zero for the first thousands postal codes haven't had been removed by the encoding of the source, and so on.
Outlier removal: this part is about the detection and handling of outliers that we want to filter out of the data because they will make all our calculations misled by their oversized values. This part is a tedious one because removing outliers can be helpful to tune more easily your model and get some good graphical visualization of data, but on the other hand it can lead to underfitting and the risk of missing out very important information. As we mentioned earlier: data is information, any data you filter out if information you remove.
Missing data handling: some other kind of tricky data that can either help you or completely ruin your dataset with a lot of biased data injection. Should we delete the full row of data? Should be replace the missing value with suggested values or inferred values? This will be the most thoughtful part of your data cleaning and you should really make it depend on your use-case.
Manual corrections: unfortunately, as the name mentions, this is the less automatable of the 8 items. Sometimes, the source for the data you have might have made a mistake, it's pretty common with open-data where some fields are filled by hands and you can spot these if you take a pretty good look at the data. When you do, well... fix it by looking for the true value somewhere you can find it. For example, in our GDPR dataset, the source was wrong with the date several times and we thoughtfully changed it in order to deliver to most quality data we could offer.
All the datasets you can find on Hoxn have been through this whole process, every single one of them, every single time. They didn't need the full 8 items crossed for all of them. Actually, at the time of writing, the free datasets have around 3 crossed items and the paid datasets go between 4 and 7. This distinction helps us define the pricing for them, if less than 3 items are checked, they're free no matter what, otherwise, they've got a price for the work done on the quality and the maintenance of the data.
You'll see in no time that when cleaning your datasets, you'll genuinely perform this list and check every item. When you've done so, feel free to tell your data is now quality data, and you can set this checklist as a milestone to stop cleaning data and start extracting information. When the checklist is done, the further analyses start!
We need your help to grow!
SUBSCRIBE TO THE ACADEMY 💘
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
SHARE THE ARTICLE
If you have any comments, thoughts or questions about the current article or ideas about new ones, please feel free to leave us a message!