Here we will find my less-flashy but more technical projects that I have been involved in. I’ll provide as many details as possible, but due to NDAs and HIPAA regulations, the details are limited. (Screenshots and illustrations coming soon)
Metadata driven ETL Engine
SQL, C#, Dimensional Data Modeling, ETL Architecture
This has been the project that has followed me throughout most of my professional career due to its versatility and becoming a very handy tool in my belt.
When faced with the challenges of:
1.) Being in a small department and the only ETL architect, and
2.) Having a metric ton of data to consume on a daily basis from hundreds of tables
I needed to come up with a solution that could take a lot of the manual work out of the <Source System> to <Data Lake> to <Data Warehouse> steps. Since this would have to involve programmatically generating a lot of ETL code, I would need to know as much of the source systems’ table designs as possible, and what better way to gather that than to pull the metadata directly? And further, what better way to store that metadata than in a series of SQL tables that would naturally extend the INFORMATION_SCHEMA. This provided a straightforward way of programmatically extracting source tables and loading the necessary data into the data lake, and then through a series of scripts that would parse the SQL code that transformed this data into the target data warehouse, the inter-dependencies between these tables could be inferred and handled automatically during the target system load.
The final result was a very straightforward, hierarchical approach to loading the data warehouse — from the source system all the way down to the individual column — taking into consideration all source system availability, data reconciliation, CDC requirements, and object dependencies, and then load the required target model. The cool thing about the programmatic approach to this was that any single attribute could be tweaked by simply updating a record in the “ETL Control” database, and letting the objects rebuild themselves. For example, if a certain dimension table was set to be Type I and there was a change in requirements that required it to be Type II, a simple flag in the control database would rebuild all necessary tables and regenerate the code that populates this table accordingly. Best of all, since all changes in the data were captured at the data lake level, a shift from Type I to Type II could be created retroactively, and all of the historic changes to each dimension were immediately available as of the start of the data lake.
C#, SQL, Log Parsing
Of course, all of the above is great and all, but without some means of pulling the most critical information out when it is needed the most, and allowing the developer to actively monitor the health of all processes, it would be easy to get lost in the amount of output that is being produced.
So I created a (very simple) C# Form that would keep tabs on all tables and all of the steps they would go through during a regular load. Failures would jump right out at you, and a simple double-click would give you the exact error message that was captured from the process. Since everything was driven by their inter-dependencies, then as soon as the error was resolved, everything would continue on as normal, without the need to rerun or cleanup tables from a bad sync-up.
mETL – The Final Form
SQL, Java, .NET, cron, Azure, AWS CLI, etc.
Metadata driven ETL…mETL…or “metal” as I call it. It’s catchy, right? Through many different incarnations of the above approach, I started to hone in on the most universally sound means of implementation. What started out as a strictly “SQL Server + SQL Agent jobs” combination that was controlled by some .Net applications and coordinated through the Windows task scheduler, eventually became a “DB Agnostic” solution (connections for the ETL Control database set in the config files of the implementation) and the rest was handled by a series of Java applications that ran as daemons in a perpetual loop to constantly be on the lookout for what tasks needed to be run, and reporting back to the control database as needed. The leap from a Windows implementation to a Linux/Unix environment was absolutely painless.
This final step allowed for a much easier startup for whatever environments I would come across — whether it was a Windows or Linux shop, or the target system was a Greenplum, Netezza, or any of the cloud providers, or the preferred control database was Postgres, MySQL, MSSQL, or even AzureSQL or Amazon RedShift, the changes required to the base code were minimal and the engine could be up and running that same day.