The athlete and the coach. The Candidate and the campaign manager. The actor and the producer. In each there is a darling who is the face of the work, and the unsung hero behind the scenes doing the incredibly heavy lifting.
The unsung hero of data analytics is the data engineer that makes the work of the data scientist / statistician / analyst so much easier. This role usually sits behinds the scenes, but super sharp, with little visibility and little credit but supports the analyst making the task of extracting insights out of data so much more productive.
In this series of 3 blog posts, I discuss the important skills of a quality data engineer, and how this oft-overlooked role is the linchpin in any data analytics program worth its salt. Even if you don’t have this exact role, if you are missing any of the core competencies, your data analytics will keep bumping into limits. Find the whole series here:
In part 1, we discussed the important skills of the data engineer to enable data science. Then we drilled into 3 powerful constructs a data engineer builds to aid statisticians and data scientists (semantic layers, data marts, analytical data sets). Now we’ll go deeper into Data Hygiene, and 4 core pillars of usable data.
At a former employer, we had a motto that data should be:
This was a noble but difficult goal but a lot of attention was given to these 4 pillars.
Data should be free of defects. Examples:
- IP Addresses are in the form nnn.nnn.nnn.nnn. If data comes in with IP Addresses that do not conform, that data should be reported on and steps taken to find the root cause of the error and make appropriate fixes. This relies heavily on Regular Expressions (RegEx) for checking syntax.
- There should not be referential integrity issues between tables. If you have a column for a customer ID but that customer ID doesn’t exist in the customer table, there’s an issue that needs to be investigated.
- Columns with discreet possible values must not contain outliers. If a column can have 4 values, A, B, C, and D, if an E shows up in the data, there is an issue.
The process of automating simple checks on data types, values, and expected columns is quick and pays dividends, and while chasing down root causes takes more time, it is pivotal to maintaining usable data you can trust.
Flat out, data should be correct. Sales must be accurate, page views must be within expectations, cost of goods must align within tolerances of the selling price. Business decisions depend on correct data. I’ve seen examples where the credit card number accidentally ended up in the selling price during transmission of the data. You can imagine how a 16 digit number in the price can skew sales on a given day tremendously, causing strange looking results!
Your process should catch these aberrations before reporting is delivered to executives and business line managers or else the analytics team is going to be very embarrassed. One technique I’ve used is to have a bell curve to predict expected values for the day being processed. If the value is outside the expected value within something like a 95% confidence interval, an email is generated and reporting is put on hold until someone can determine if there is an issue. So if sales are expected to be $100,000 for the day ± $25,000 (with 95% confidence), if sales end up being $60,000, someone needs to check into it.
To ask any question of any data at any time, you have to have all the data. This is tricky and not always practical but this should be the starting philosophy. A good example of this is with clickstream data. Bots and spiders are typically omitted from data feeds since this data has little value, meaning it might be tempting to filter all this activity completely out of analytics datasets. But of course there was one time we were asked to do a study on website load, meaning we needed to include every last page load, bot or not. While this data could come from Apache server logs which are vertically comprehensive (i.e. all the server calls), they are too horizontally limited, (i.e. not all the data elements), to really analyze the load. Luckily, we already had data coming in from an analytics provider that contained comprehensive page activity including bots and spiders, and as a practice we saved all that raw data, only filtering bots and spiders at a view level. So the analysis became pretty simple even though this type of question was never anticipated.
By complete, I mean all the data is there as expected. While missing data occurs and there must be strategies to replace it with default values, reporting on missing data and taking actions to rectify instances where this occurs is still required. It’s another type of monitoring and troubleshooting that though it takes time it contributes to a good data hygine practice and the ability to trust and use data without fear.
All the great data in the world doesn’t do a lot of good if it takes forever to extract value from that data. An engineer’s role is to make sure that queries can run efficiently and that everyone can run their queries and get results back in a timely manner. Whether through indexing, partitioning, collecting statistics, MPP, map/reduce, etc., the engineer is responsible for putting the right strategies in place to pull the data in a timely manner.
One table I had to deal with recently had billions of rows of data and initially took forever to query. Through techniques like pseudo-columnar design, table indexing, and good old fashioned people skills like mentoring and sharing query patterns, we were able to pull out insights efficiently.
If your organization is missing some of the key skills or steps needed in making the most out of your data, contact us and lets begin to talk about modernizing your data analytics.