Posted on Leave a comment

What to Log? From Python ETL Pipelines! – Towards Data Science

Publish AI, ML & data-science insights to a global community of data professionals.
– a detailed log structure for ETL pipelines!
As part of my work, I have been converting some of my ETL jobs developed on the traditional tool-based framework into a python framework and came across a few challenges. These few challenges are orchestration, code management, logging, version control, etc. For a few of them, there was not much effort required while developing them on a tool. But, in hand-written code like python ETL’s, these are quite a challenge! And this brings to one such challenge, our topic today!
Also instead of describing much on ‘how’, as there are many good articles already on a platform, our main focus today is ‘What’.
Glossary:
Introduction: When we design and develop ETL pipelines on tools we focus on components like sources, target objects, transformation logic, and a few support tasks to handle pipelines. These tools will also generate logs for all jobs that are actively running and can be monitored with their internal interfaces. Basically, I mean to say developing/maintaining logs is not a challenging activity in the tool. whereas on python, it is a separate activity that needs to be handled. Today, In our post we will discuss the basic skeleton of ETL jobs, a rough idea of details we can record in any pipeline, then later we structure them into our ETL code, and finally, we will develop a sample scenario with logs recorded.
ETL Skeleton: As we already know there are different kinds of ETL jobs like Merge/Upsert process, Staging loads, SCD 2 loads, Delta jobs, direct insert loads, etc. all these ETL jobs have a very basic structure(shown below in python) and i.e. the main function to call the modules in the pipeline, extract module, transformation module, and load module, Now, we can take the below Skeleton to also identify what could be our ETL log look like.
Structure of Log: As we outlined the blueprint of the ETL job, let’s try to list down a rough idea of what details we can track from a job.
A list of details we can log –
Now let’s integrate the above details in our ETL skeleton and see how could be our log structure looks like!
And, if you observe the above code structure, I have created a config file to maintain all job level details. By importing this file into your code you can make use of different parts of the code including for logging.
etlConfig.ini :
Sample ETL job with our log structure:
Few more improvements we can do to our Log:
Sample Code location:
GitHub – Shivakoreddi/ETL-Log-Structure
Thank you for reading this post, future work will involve other challenges while developing python ETL jobs like Orchestration, Code management, version control, etc.
Reference:
Written By

Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Derivation and practical examples of this powerful concept
How to build penalized quantile regression models (with code!)
This is a bit different from what the books say.
In this blog, I will guide you through the process of analyzing Chess Grandmasters’ openings…
Use density-based clustering and survival analysis to estimate when earthquakes occur
And how I placed top 10% in Europe’s largest machine learning competition with them!
A quick tutorial on how to work with these computer-modelled binary files.
Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

source

Leave a Reply

Your email address will not be published. Required fields are marked *