Data Engineer: Unsung Hero of Data Analytics

The athlete and the coach. The Candidate and the campaign manager. The actor and the producer. In each there is a darling who is the face of the work, and the unsung hero behind the scenes doing the incredibly heavy lifting.

The unsung hero of data analytics is the data engineer that makes the work of the data scientist / statistician / analyst so much easier. This role usually sits behinds the scenes, usually geeky, with little visibility and little credit but supports the analyst making the task of extracting insights out of data so much more productive.

In a series of 3 blog posts, I discuss the important skills of a quality data engineer, and how this oft-overlooked role is the linchpin in any data analytics program worth its salt. Even if you don’t have this exact role, if you are missing any of the core competencies, your data analytics will keep bumping into limits. Find the whole series here:

Tag: Data Engineer: Unsung Hero of Data Analytics

In part 1, we discussed the important skills of the data engineer to enable data science. This week, we’ll drill into 3 powerful constructs of a data engineer builds to aid statisticians and data scientists: Building out semantic layers, data marts, and analytical data sets. Let me start with a few definitions:

Analytical Dataset (ADS)

Used by statisticians and data scientists, pre-staged datasets have variables (if you are a statistician) or features (if you are a data scientist) for a business entity and can greatly reduce the time to build regression models. Rather than have the statistician spend countless hours accumulating data, a pre-engineered ADS can derive common features from diverse data sources and centralize them into a file or table. Of course, new features still have to be derived but why repeat work when there are typically common features that are inputs to most models?

In the retail world, common features would be sales, AOV (average order value), loyalty membership, age, family composition, digital engagement, etc. There are typically dozens of standard features that contribute to most models. If these features can be staged prior to model development, the job of the data scientist becomes much less engineering intensive. Focus then shifts to non-standard features that may be more predictive to produce better models.

Data Marts

A Star Schema Example SqlPac | CC BY-SA 3.0

Data Warehousing has been highly influenced by the writings of Ralph Kimball. The concept of star schemas with facts and dimensions has become a standard architecture for analytics and reporting. This enables powerful easy to use drag-and-drop BI tools.

But they don’t just magically happen, they have to be modeled and engineered with high demands on meeting timely service levels. Personally, I like technology that can logically transform base data in third normal form into star schemas using views. That way data is not being moved and copied all over the place.

When I first was introduced to data warehousing, data marts were king. I was asked by the project leader what types of questions should be asked from the data. He would then build out the star schema so those question can be answered. Unfortunately, I told him that I couldn’t list all the questions that might be asked since one question leads to new questions.

Data marts are very powerful and have their role to play but are limited by their nature into new data exploration. However the vast majority of analysis can be accomplished with carefully engineered star schema designs.

Semantic Layers

One thing I found to be very successful was the concept of a semantic layer. The base layer of data in any business can be very complex. If simplification, governance, and access can be controlled through views over the base layer, views called semantic layers, a wider audience of analysts can learn to query the data themselves, especially those with technical skills. For example, a sales transaction can consistent of dozens of tables with foreign keys linking to tables with other foreign keys. If views can be created over those myriad of tables, pulling together a layer with a more complete view without sacrificing too much performance, the simplified semantic data schema can be much more easily understood and queried by a broader audience.

In addition, governance can be applied to the views through business logic. For example, in the retail world, a sale is not recognized until ‘ownership’ of a product is transferred from the retailer to the customer. Backorders, cancels, etc. are not sales. But a typical sales table will have price, quantity, and a status field. A semantic view of the data can contain price, quantity, and status, even though the official data has not logged a recognized sale. Additionally, there can be a new column for sales that will populate based on transactions that reach a certain status. Therefore the analyst does not have to embed those business rules into all their queries since the business rule is already embedded in the data of the semantic layer view of the base data. Simple querying on this sales column is sufficient.

Another advantage of semantic layers is access and security. Within banking and credit cards, there may be data elements such as cardholder name and address and delinquency status. The base layer may store all this data. By using views in a semantic layer, some analysts may have access rights to the name and address only without access to delinquency status. Others may have full access. With semantic layer views, there can be a basic view into the cardholder table that is available to all analysts but an advanced view with delinquency status for authorized individuals only. All this then can be accomplished without moving data all around and the issues that come from that effort.

A note about performance: while I was skeptical about the use of views and how they would perform, a good use of MPP (massively parallel processing) databases can handle most of the performance issues. However some views still have to be materialized. What I mean by that is that data is copied from the base tables into a simplified governed table using a predefined semantic layer view as in:

INSERT INTO myMaterializedTable
SELECT * FROM myMaterializedTableView

A materialized table is commonly a delta of changes that are applied to the table but it should be able to be re-materialized as needed from the base data, it’s simply a ‘view’ that’s been staged into a physical table, to gain the performance advantages over semantic views.

Don’t miss next week as we go deeper into Data Hygiene, and 4 core pillars of usable data.

As organizations today are flooded with the incredible amounts of data available to the modern digital organization, they quickly find the need to ensure it’s usable, and provides a full picture.

If your organization is missing some of the key skills or steps needed in making the most out of your data, contact us and lets begin to talk about modernizing your data analytics.

Featured image: WOCinTech | CC BY 2.0