June 26, 2024 • 4 min read

Key data engineering skills to elevate your data science projects

Rédigé par Yandi Andriamasy

This article targets data scientists who need to survive without a data engineering team. Indeed, not all analytics teams can hire a data engineering team.

As a data scientist, you focus on delivering a working machine learning model. You might be tempted to write a quick and dirty data preparation pipeline code.

However, this is a bad idea. In this article, I will help you by :

  1. Telling a story that illustrates why writing quick and dirty code is a bad idea.
  2. Presenting key data engineering best practices.
  3. Showing that applying these practices is easy and fast with a codebase example.

😬 A quick story: why disrespecting data engineering best practices is a bad idea

Before we begin, let me share what inspired this article. It's the story of Bob, a data scientist.

Initially, Bob was tasked with creating a machine learning proof of concept. In order to deliver fast, he wrote quick and dirty code for his dataset preparation pipeline.

Ultimately, as he had to include more tables, he faced the issue described by C. Mayer in "The Art of Clean Code". While you have a small project, quick and dirty coding may be faster to write and maintain. But as the codebase grows, maintenance time explodes. As opposed to, thoughtfully written, clean code that remains manageable.

Figure showing that for projects with few lines of code, quick and dirty coding is faster than writing thoughtful and clean code. However, as the lines of code increase, implementation time for quick and dirty code grows exponentially, while it remains steady for clean code. This illustrates the importance of data engineering best practices.
Illustration from the book: The Art of clean code by Christian Mayer

Despite his code becoming difficult to read and manage, Bob's ML model succeeded in real-world conditions. As a result, the business team quickly requested the project to be industrialized.

But, this is where things got complicated.

Particularly, the run (or MLOps) team struggled with Bob’s code. It was risky to let them take over the project completely. As a result, Bob had to mentor and help them during one year. Giving him additional work on top of his new project. Bob wished he had collaborated with a data engineering team.

📚 Data engineering good practices

🔧 Modularize your code

Modularizing your code is a basic software engineering principle that also applies to data engineering. Always create separate blocks of code for each data source category.

For example, use functions in Python or common table expressions in SQL.

Here is an example of a monolithic processing script.

Imagine having 20 data sources with complex transformations. It can quickly become confusing and hard to maintain.

Side note: You cannot quickly see if the processing of df1 depends on df2 (or vice versa).

Instead, you should modularize your code by creating functions to handle the processing:

Now, you can see that the processing of data_source1 and data_source2 are totally independent.

This is a step forward towards respecting two guiding principles from the Zen of Python :

“Readability counts”
“Beautiful is better than ugly.”.

In data engineering, these principles are key.

You can even dig deeper and use Ruff linter to boost your code quality.

🔎 Monitor data quality

Monitoring your data quality is crucial because:

  • One day, the source data will change unexpectedly. Your data preparation pipeline will fail. Debugging will take time because the bug is not in the code itself.
  • An even worse situation can occur. Your script might run and succeed but produce a dataset with incorrect data. For example, a column with room temperatures might mix Celsius and Fahrenheit without you noticing.

With this in mind, implementing data quality checks that will run automatically at each step of your pipeline will serve as a safety net to avoid these issues.

To guide you in writing your data quality checks, you can lean on the six data quality check dimensions presented in this article: completeness, uniqueness, validity, consistency, accuracy, and integrity.

Furthermore, there are many technical solutions in data engineering for writing your data quality checks :

💻 How to implement data engineering good practices easily using the medallion architecture

🥇What is the medallion architecture

The medallion architecture is a data engineering design pattern that allows you to organize your data processing in different layers :

Description of the medallion architecture in data engineering. Bronze layer : integration of raw data. Silver layer : simple transformations on data : cleaning, filtering. Gold layer : enrichment of data and making it ready for consumption

This design pattern helps you follow the essential data engineering best practices presented above:

  • The processing structure is clarified thanks to a logical framework. Data quality incrementally improves as you progress through the layers. And within each layer, data has the same level of cleanliness.
  • Data quality checks are easier to implement during transitions between layers.

An implementation example using Object Oriented Programming

⚠️ There are many ways to implement the medallion architecture in data engineering. This is an example using Python Object-Oriented Programming.

Business context

You joined the R&D team of a running app company. They want to predict a runner's marathon time using their past runs. You will create the machine learning model for this.

Medallion architecture layers

For simplicity, all source data are stored locally as .csv files. In real life, the data will be on a cloud provider, and adapting the code will be easy.

The source file is : (a.k.a our bronze layer)

degrees

In our silver layer, we will :

  • modify the runs.date column for consistent formatting.
  • convert the runs.temperature column to degree Celsius.
  • filter out the runs that lasted less than 5 minutes.
  • remove the runs.location column.
  • filter out runners who never ran a marathon.

Then, in the gold layer, we will aggregate the data into the following table:

  • runner_performances: a table with each runner’s last performance on 5K, 10K, Half-marathon, Marathon

To demonstrate this, let’s move on to the code.

First, let’s write an abstract class. We will define in this class, the boilerplate steps within a layer :

  • read the source data.
  • perform a data quality check on it.
  • process it.
  • perform a data quality check on the processed table.

Next, we will write the silver layer and gold layer classes, which will inherit from the ProcessedTable abstract class. In order to make the article easier to read, I have not included the code here. Instead, the code details are available on this GitHub repository.

Processing the data

Let’s see an example of data processing.

Sample of input data :

Input data sample.

When we run the data processing pipeline, I get this clear error :

Error message in data processing pipeline

The error message helps me understand that the temperature in the first row (12.13°B) is not in the expected format. So, I need to contact the team responsible for this data to understand this issue: is it an error in the data or is it an edge case that the data pipeline needs to process?

Now, let’s suppose that the other team fixed the data issue, and let’s rerun our pipeline :

In that case, our data engineering pipeline succeeded.

As a result, here are the sample tables obtained :

Sample of the silver runs table
Silver table: runs
Screenshot of the gold table : runner_performances
Gold table: runner_performances

What we learned

  1. Structure your code: Writing clean, modular code is a crucial data engineering skill to master because it makes your data preparation pipeline easier to manage and scale.
  2. Monitor Data Quality: Implement data quality checks at each step to ensure data integrity and catch issues early.
  3. Medallion Architecture: This architecture helps organize data processing and maintain quality. We saw an example of this using Python.

It is important to realize that many data engineering good practices out there can fit your needs. This article, however, only presented the most essential ones to help a data scientist in their data preparation pipeline.

If you are looking for professional guidance from data experts, don't hesitate to contact us!

Cet article a été écrit par

Yandi Andriamasy