The athlete and the coach. The Candidate and the campaign manager. The actor and the producer. In each there is a darling who is the face of the work, and the unsung hero behind the scenes doing the incredibly heavy lifting.
The unsung hero of data analytics is the data engineer that makes the work of the data scientist / statistician / analyst so much easier. This role usually sits behinds the scenes, usually geeky, with little visibility and little credit but supports the analyst making the task of extracting insights out of data so much more productive.
In my next 3 blog posts I’ll discuss the important skills of a quality data engineer, and how this oft-overlooked role is the linchpin in any data analytics program worth its salt. Even if you don’t have this exact role, if you are missing any of the core competencies, your data analytics will keep bumping into limits. Find the whole series here:
Tag: Data Engineer: Unsung Hero of Data Analytics
The engineer and the analyst are really are two different specialized skillsets. I’ve listed some of the skills required for each role. This is not a comprehensive list but provides some flavor of what’s needed:
Data Scientist / Statistician / Analyst
- Descriptive statistics
- Regression models
- Test/control
- Business insights
Data Engineer
- Data modeling
- Data Harmonization and Governance Implementation
- ETL / ELT
- Query Performance
- Building out semantic layers, data marts, and analytical data sets
- Data Hygiene
The skills of the data scientist/analyst are well known, but many organizations suffer as they miss some or all of the skills of the data engineer. To start, let’s delve into the first 4 skills for the data engineer. Building out semantic layers, data marts, and analytical data sets and Data hygiene are so important that we will save discussion of each of these for their own article.
Data Modeling
Data modeling includes building a representation of business data and the strategies and architecture of how data is stored. I like the Teradata philosophy of ‘ask any question of any data at any time’. This is very powerful and influences how I like to model data. Common practice is to build out a logical and/or conceptual data model, that is, how the data is viewed by the business without the constraints of the source or target systems. The physical model is the practical implementation of the logical model from source data with all their anomalies (recall, source applications are designed to support applications, not to support analytics) onto whatever platform is collecting and storing data. The physical model accommodates strengths and weaknesses of the analytics platform such as indexing and partitioning.
Data Harmonization and Governance Implementation
Then there are multiple applications systems from which data is being sourced, they tend to name things different. For example, one application may call a customer by many names, customer, client, party, individual, etc. And then there are data elements that have a common name in many applications but actual mean something subtly different in each. A great example of this is the concept of a campaign. I’ve seen where campaign can be a collection of common marketing events down to where campaign is defined as a single marketing event.
Data harmonization is the process of defining data elements and then mapping those elements from source systems, regardless of how they name the data elements, into the data models with names that are agreed to by business stakeholders making sure there is an ‘apples-to-apples’ comparison. Data governance, at least in this narrow context, is an agreed dictionary of terminology. In the example of campaign, business may define campaign something like, “a set of marketing activities that support a specific business event with a defined budget and time period.” Of course, definitions will vary by company but the point is to have an agreed definition and the make sure the data models support that definition.
Of course, data governance is a much broader topic addressing assurance, security, access, etc. The data engineer must be cognizant of all these pillars of data governance and work to make sure the data conforms to business and industry standards. A good example is tokenizing credit card numbers to mitigate harm from a data breach. If you are not familiar with tokenizing data, it’s where data is hashed to some value that cannot be reverse-engineered and the actual value is stored elsewhere in a secured location.
ETL / ELT
Once the data is modeled and is in conforms to the definitions in the data dictionary, the next activity is to bring the data from the source applications into the physical data model. This is typically done via tools that support Extract, Transform, and Load (ETL). Another variation is Extract, Load, and Transform (ELT).
There are numerous software tools that perform this function well. A good one should support ‘change data capture’ as well as full data replacement. And data can be landed in a various formats from: unstructured data as you might find in Hadoop; object-oriented data such as some NoSQL databases like MongoDB; semi-structured data; relational (2NF, 3NF, etc.) as is typical from most applications; or, star schema data models.
One of the particular challenges of ETL that the engineer has to solve is performance of the ETL scripts. If business needs KPIs by Monday morning at 8:00, ETLs cannot afford to take days to process. But with some datasets especially in digital and IOT, care must be taken to engineer the ETLs to meet required service levels.
Query Performance
All the great data in the world doesn’t do a lot of good if it takes forever to extract value from that data. An engineer’s role is to make sure that queries can run efficiently and that everyone can run their queries and get results back in a timely manner. Whether through indexing, partitioning, collecting statistics, MPP, map/reduce, etc., the engineer is responsible for putting the right strategies in place to pull the data in a timely manner.
One table I had to deal with recently had billions of rows of data and initially took forever to query. Through techniques like pseudo-columnar design and table indexing and good old fashioned mentoring and sharing query patterns, we were able to pull out insights efficiently.
Next Up: Building out semantic layers, data marts, and analytical data sets
Don’t miss next week as we go deeper into this pivotal topic. As organizations today are flooded with the incredible amounts of data available to the modern digital organization, they quickly find major limitations to the processes that have worked for years. No longer does “build a big dataset and query it” hold up to par.
If your organization is missing some of the key skills or steps needed in making the most out of your data, contact us and lets begin to talk about modernizing your data analytics.