In the process of migrating data from one system to another, there are two primary areas that need to be audited. The first is the data itself. Nobody wants to populate a system with suspect data. The second is the transformation logic. Nobody wants to take data that might be useful and screw it up in translating its content or values in a manner that corrupts it. The overall process is one I refer to as ETVL (Extract, Transform, Validate, Load.
In practice, I should probably refer to it as SESATSYAVL. Scan, Extract, Scan Again, Transform, Scan Yet Again, Validate. Load. The Scans are rudimentary checks to see if the data "looks OK." The Scans and the Validation both use a reference database that is augmented with each successful migration to determine the characteristics of the overall target dataset, such as the average value of a sales order from a "Gold" customer. Values that fall outside historic limits are a signal to look at both the data source and the transformation logic before continuing.
The process is time and resource intensive, and you can wager without risk that it will be under-funded in terms of both dollars and manpower. However, the cost of addressing a coding defect in a deployed system is going to be far less than correcting a data error in a migration process.
Thanks, John. This is really helpful. I like how you break the process down, especially the extra checks before and after each step. You’re right: catching issues early is much cheaper than fixing bad data after a migration. Appreciate you sharing your approach! I’m curious about two things:
- What size of data are you usually working with?
- What tool(s) do you use to run the ETVL / SESATSYAVL process you described? Thank you again for sharing your experience with us
Data volume - Many projects are fairly small, but the big ones include state lottery systems, international bank mergers, and telco “outside plant” provisioning.
Tools used are often basic ones - SQL, Microsoft Access, Excel, Informatica. Other tools are custom-developed. For example, migration from one lottery vendor to another happens overnight. It’s a one-shot affair with substantial penalties for a screw-up and heavy-duty security. The reference data used to highlight unusual spikes in data values was relational/SQL.
An issue that was unexpected is the degree of inconsistency in some legacy databases. In working with the Telco outside plant, a core data structure is the Line Information Database (LIDB). There should be one record for each service line managed by the company. Within the enterprise, there were five different “master copies” of the LIDB, each with a slightly different format depending on whether it was supporting billing, engineering, outside plant service, inside plant service (central offices), or dedicated networks. The LIDB could be used to tell us the total number of configured circuits. One would expect that the number of circuits in each of the LIDBs might vary slightly as circuits were constantly being activated or disactivated, but that “slight variance” turned out to be over 26,000 records.
We should have expected this. Our client had grown significantly in recent years by acquiring smaller phone companies (there are over 2200 of these) and the data acquisition process was somewhat ad hoc. Still, it was an interesting event to start the project.
Data Migration is a really rewarding niche in the world of IT if you want your work to be of practical value, but you need to expect that it will remain under-funded by about 80 percent in both time and money.
Ingest > validate & transform > validate
This article was really insightful, thank you for sharing!
My pleasure! thank you for your comment and feedback glad it was insightful 💐🙏
Amazing content as usual!
Thank you so much for you feedback and comment 💐😊🙏
Thanks for sharing
My pleasure thank you for your comment
Thanks for sharing.
I like the terms of WAP, TAP.
Thank you for your comment 😊🙏
In the process of migrating data from one system to another, there are two primary areas that need to be audited. The first is the data itself. Nobody wants to populate a system with suspect data. The second is the transformation logic. Nobody wants to take data that might be useful and screw it up in translating its content or values in a manner that corrupts it. The overall process is one I refer to as ETVL (Extract, Transform, Validate, Load.
In practice, I should probably refer to it as SESATSYAVL. Scan, Extract, Scan Again, Transform, Scan Yet Again, Validate. Load. The Scans are rudimentary checks to see if the data "looks OK." The Scans and the Validation both use a reference database that is augmented with each successful migration to determine the characteristics of the overall target dataset, such as the average value of a sales order from a "Gold" customer. Values that fall outside historic limits are a signal to look at both the data source and the transformation logic before continuing.
The process is time and resource intensive, and you can wager without risk that it will be under-funded in terms of both dollars and manpower. However, the cost of addressing a coding defect in a deployed system is going to be far less than correcting a data error in a migration process.
Thanks, John. This is really helpful. I like how you break the process down, especially the extra checks before and after each step. You’re right: catching issues early is much cheaper than fixing bad data after a migration. Appreciate you sharing your approach! I’m curious about two things:
- What size of data are you usually working with?
- What tool(s) do you use to run the ETVL / SESATSYAVL process you described? Thank you again for sharing your experience with us
Data volume - Many projects are fairly small, but the big ones include state lottery systems, international bank mergers, and telco “outside plant” provisioning.
Tools used are often basic ones - SQL, Microsoft Access, Excel, Informatica. Other tools are custom-developed. For example, migration from one lottery vendor to another happens overnight. It’s a one-shot affair with substantial penalties for a screw-up and heavy-duty security. The reference data used to highlight unusual spikes in data values was relational/SQL.
An issue that was unexpected is the degree of inconsistency in some legacy databases. In working with the Telco outside plant, a core data structure is the Line Information Database (LIDB). There should be one record for each service line managed by the company. Within the enterprise, there were five different “master copies” of the LIDB, each with a slightly different format depending on whether it was supporting billing, engineering, outside plant service, inside plant service (central offices), or dedicated networks. The LIDB could be used to tell us the total number of configured circuits. One would expect that the number of circuits in each of the LIDBs might vary slightly as circuits were constantly being activated or disactivated, but that “slight variance” turned out to be over 26,000 records.
We should have expected this. Our client had grown significantly in recent years by acquiring smaller phone companies (there are over 2200 of these) and the data acquisition process was somewhat ad hoc. Still, it was an interesting event to start the project.
Data Migration is a really rewarding niche in the world of IT if you want your work to be of practical value, but you need to expect that it will remain under-funded by about 80 percent in both time and money.
Thank you! This is a very good and helpful article!
Thank you for your comment. My pleasure glad you like it 💐🙏