You know the scene: sitting there in front of my computer at 11pm in a meeting, trying to square out the puzzle of data issues that we discovered after trying out our ETL from our client’s old system into our brand new one.

My kids showed up 1 hr before, to say good night and wish me luck with the work in front of us.

I then recalled my “old days” at Intel when I was working in the master data team, and how there was so many strucured processes for handling data. The structure was back then slowering, but it was also reassuring.

This other scene may be familiar to you as well: you get a lot of files, ready to import, but that still need some sanitization here and there, which end up multiplying complexity of the whole process.

This article describes some key learnings I got by reflecting on how our ETL imports can be transformed into a robust process that anticipates and prevents data integrity issues when taking massive pieces of data into systems.

There are three perspectives to see these key learnings from: (1) Processes, (2) Tools and (3) Roles

Processes

Here we discuss those set of steps that add value to the whole job to be done:

  1. Data Extraction
  2. Sanitization
  3. Standard Import File Generation and cross-check
  4. Taking the B4 Picture and set expectations
  5. Pre-import Checks and fixes
  6. Import Execution
  7. Post-Import Checks and fixes
  8. Taking the After Picture and planing around it.

Each of these processes have inputs and outputs and a set of tools that aid in their execution.

Tools

As part of the key learnings we identify in the process there are these tools that are instrumental in the execution of each process; especially in providing visibility on each process to anticipate issues and prevent them.

  1. Golden File: This is the data as the customer or data provider signs off for. This golden file is the representation of what the customer wants us to import. It is set as readonly and used later as reference for cross-check verification. (XLSX)
  2. Sanitization/Working File: This begins by being a copy of the golden file, that we used to modify as needed to meet integriry specifications. (XSLX)
  3. Standard Import File: Once the data is properly sanitized, one or more standard import files (CSV) are generated to be imported. These are checked first for cross-check integrity and pre-import integrity.
  4. Cross-check Tools: Once target a standard import file is generated it is compared to the golden source to verify nothing has been set off in tems of key pieces of data and relationships during the sanitization process.
  5. Integrity Check Scripts: These are automated scripts that take a picture of the target system in terms of data and process integrity, so that we can analyze the import process in terms of before and after execution of the imports.
  6. Pre-Import Scripts: These are automated scripts that are executed against candidate target standard import files that validate the content to be imported doesnt break the target system’s integrity.
  7. Import Execution Scripts: The data import process is automated to the max extent possible taking CSV files into the database.
  8. Post-Import Scripts: These scripts check for the result of import operations.

Roles

 

Finally we can look at these key learnings from the roles perspective. These are roles that encompass ownership of different areas of the whole process.

  1. Data Provider: This is the person or team in charge of extracting, preparing and supplying the data to be imported.
  2. Customer Relationship: This is the single point of contact owning communications with customer/data suppliers.
  3. Process and Data Steward: This role is played by one or more individuals in charge of providing orientation to the rest of the team in keeping process and data integrity.
  4. Engineering and Execution Team: This is the group of individuals, including people in the other roles, who take care of actually importing the data to the target syste,/

 

0Shares
Total Page Visits: 1540 - Today Page Visits: 3
Last modified: March 4, 2022