« Ignoring Data Quality on an SAP Migration - What are the pitfalls of ignoring best-practice? | Main | Evaluating data migration service providers »
Tuesday
Jul292008

What type of DQ platform to select for data migration projects

Hi, I am a data analyst for a large utilities company in the US and we're about to start a major data consolidation project.

Our company has opted to deliver the project internally and the architecture team have chosen a scripted approach to data migration.

My question is really surrounding data quality. Myself and 3 colleagues are responsible for carrying out data analysis of the old systems to identify things like mappings to the target app and DQ cleansing issues etc.

We have worked on the systems individually for several years but now we need to merge and link them together I'm interested in getting your opinion on what type of data quality technology would suit this type of activity.

The data is mostly plant infrastructure, service histories, work order type information so no name and address information.

Reader Comments (2)

The Concerns - Keys
Leaving the generic solution aside for others to comment on one of your main issues is going to be primary key matching between systems. Keys are often fairly well controlled within a single system, either by the database engine itself enforcing referential links or by "widows and orphans" already being a serious focus of existing data quality measures in business as usual if not. However, attempt to link multiple systems holding similar data and suddenly the referential model and levels of care break down. Unless there is an existing interface that has already made such a connection an issue bringing the systems together will be virgin DQ territory. So far, so obvious; I am sure the questioner already knew that.

The generic technique in a DQ tool is to stage the data in the tool and then perform a comparison. So that you would be able to bring the information from the tables that should be equivalent into the system and examine the linkage there. This sounds good. We can load our tables, select the candidate keys that are identical and run a comparison, sometimes by setting up a small data flow to do so and sometimes by just selecting an out of the box join test tool (depending on tool). Unfortunately there are at least three flies in the ointment:


The size of the data set
Compound keys
No Common key at all

Size of data set – When the data set is large we may have trouble actually loading it for profiling. A two million row set of customer addresses is not going to fit easily into most tools. What you do next depends on the objective. If you are using DQ as a forensic tool only you will be able to define a subset such as keys that begin with “P” on both sides, see what happens when testing a join between them and then extrapolate through the wider set. E.g. If there are 20 unmatched records in that subset there will then be 20 x 26 (very crudely for the purposes of example only) in the wider set. If however you are looking to solve the problem in the DQ tool itself many will start to struggle as their purpose is often diagnostic rather than curative and the data will overwhelm them.
Compound keys – Beware of “out of the box” join test tools. These often carry the caveat that they only work on a single column. The solution is then to concatenate the columns, sometimes outside of the tool within a database preparation area and compare the concatenated data rather than the original.


No common key - If there is no common key at all and you must for instance compare customer records and make a decision on whether they are equal without a giant clue like CUST190989 being the key on both systems, then you will need a fairly heavyweight tool with comparison and matching capabilities. Such tools are able to conjecture that “H. Smith of 2 Regents Park Road”, “Harry Smith of 2 Regents Park Road” and “Harold Smith of 2 Regents Park Road” are the same person (or not as there could easily be two H Smith’s in the same house …father and son …Henry and Harry!)

Selecting Appropriate Solutions

The point, now that we finally reach it, is that the size of the data set, whether you are in “forensic” or “fix” mode and the type of problem may have a considerable influence on your choice of tool.

Presented with a very large data set with compound keys and a desire to fix I would probably just use SQL outer joins and trust to the DB engine to cope with capacity.


Presented with no common key and a reasonable data set I would see using a tool as a necessity
on investigation as hand matching strings and playing with sound based matching (Smith – Smythe) is not ideal in a database engine, though
it must be added that both SQL Server and Oracle have Soundex() functions these days.


Presented with no common key and vast data set I would possibly nibble my way through, using the “names that begin with P” technique, though of course that is fraught with danger in some circumstances… “Cristopoles” or “Kristopoles”?
Perhaps I would use another field such as a date to provide or reinforce the groupings. The technique in both cases is to build the output table gradually and defensively instead of running at it
in a single pass. Matches of which we are unsure can sometimes be left for hand finishing. Then again, once we have the problem identified from the DQ tool we may be better using something with heavier processing capabilities
on a massive data set situation, such as ETL or a database when we move into solutions mode.

Having undermined the reputation of DQ tools on joins I should perhaps repair it to some extent by explaining that the visually prepared data flows available in some tools are a) self documenting and b) an excellent communications tool in walking a business executive through the steps you have taken in the problem so that they also may better understand it. SQL is all very well, but it often takes an extra swatch of PowerPoint slides to explain what you are doing
for your sponsors.

Summary

DQ tools are good at identifying and characterising problems. This is their primary role.
DQ tools struggle on some solution scenarios. However, solution is often their secondary role.
Some DQ tools are visual and explanatory of the process that you are following. This can be a powerful tool for discussion.
Some large problems can be solved in a DT tool in chunks, but not all.
There is no disgrace in using SQL or ETL on the bulk once a DQ tool has identified a problem

That's only one aspect of key matching discussed. I leave it to my peers
to comment on profiling generally and which tools might be best in which
scenarios. I hope this helps.
Regards

John Platten

Principal Consultant

Vivamex Limited
www.sapdatamigration.co.uk

Sun, October 19 | Unregistered CommenterJohn Platten

Before talking about tools let me brifly address the data consolidation challenge.

Data consolidation projects are usually extremely difficult, because the data among consolidated systems overlap. While most tools focus on simple duplicates and subject matching across systems, the main challenge is posed by numerous conflicts between the specific data attributes across systems. This is especially true for historical data and data for state-dependent entitites. Deciding which data elements to trust is never trivial and inevitably backfires at the data quality.

The traditional approach is to setup a winner-loser matrix indicating which source data element is picked up first. For instance, date of birth will be taken from System A if present, from System B otherwise, and from System C if it is missing from both A and B. This approach rarely works because it assumes that data on System A is always correct – a laughable assumption.

To mitigate the problem, the winner-loser matrix is usually transformed into a complex conditional winner-loser hierarchy. Now we take the date of birth from System A for all males born after 1956 in California, except if that date of birth is January 1, 1970, in which case we take it from System B, unless the record on System B is marked as edited by John Doe who was fired for playing games on the computer while doing data entry, in which case we pull it from Spreadsheet C…

At some point the winner-loser hierarchy is so complex that nobody really understands it. It becomes impossible to manage even for simple indicative data elements, and reaches mind-boggling complexity for historical data stacks, especially event and state-transition histories. An even more serious issue in this model is that we absolutely cannot cleanse the data before conversion, because it is impossible to determine which data elements will really be used.

The correct approach to data consolidation is to view it in a similar light as data cleansing! We start with a comprehensive set of tests, comparing the data across all sources. We now have a full list of discrepancies. These data inconsistencies are conceptually very similar to the errors found by the data quality rules in the process of data quality assessment. While some of these discrepancies may be legitimate, without proper care they will most likely turn into true data errors after consolidation.

The next step is to analyze the discrepancies and look for patterns. Suppose we conclude that any time values of a certain attribute in Systems A and B coincide, they can be trusted (regardless of the values in database C). We can mark those values as trusted and eliminate the discrepancies from the list. We can also make corrections to the mismatching values on C. For each group of discrepancies, we make an individual decision using a variety of conditions. Every time a decision is made, good data must be marked as trusted and bad data can be corrected. With this technique, we decompose the list of all discrepancies into a set of simple groups and derive a straightforward solution for each group. With every step, we move closer to the ultimate data quality objective.

When the number of data sources is large (more than three), the effort to compare data among all sources increases exponentially. We end up with a rather complex algorithm to pick up data marked as “trusted” from all sources. A simpler (though conceptually identical) technique is to start by selecting for each target data element a primary data source (PDS), which will be used in conversion. Then we design a comprehensive set of tests, comparing the selected primary data sources against other data, and use these additional data sources for “data cleansing.”

In a general case, different databases will serve as PDSs for different data elements. Even for a given data element, different PDSs can be elected for different subject populations and time periods. Once data in each PDS is validated and cleansed, the data consolidation is performed by data conversion algorithms from various PDSs.

This is it in the nutshell. Of course I oversimplified the approach. In reality, consolidating an event history (for example consolidation employment and payroll histories for company employees) takes a lot of effort, experience, and creativity. I have done half dozen large scale consolidation projects in my time and they always required greatest effort. So, do not underestimate the beast!

Now, to the tools questions. There is some mileage you will get out of the existing tools. Of course you should use a tool for matching and de-duplication of subjects across systems. You should also use a data profiling tool. There are severla on the market and they mostly have similar functionality. Your organization might already have one. If not, select the one your team likes to use the most. It will not make dramatic difference which one you go with.

But beyond matching, de-duplication, and column profiling, there is not much on the tool market. There are no good tools for comprehensive data qualtiy assessment and rule-driven data consolidation/cleansing. I always used my own tools and custom coding. Of course, custom coding is challenging as you need to build good mechanisms to audit trail the data changes and to keep track of the data lineage.

If you have any further questions, please, do not hesitate to e-mail me directly.

Arkady Maydanchik
Data Quality Group LLC
www.dataqualitygroup.com
arkadym@dataqualitygroup.com

Sun, October 19 | Unregistered CommenterArkady Maydanchik
Member Account Required
You must have a member account on this website in order to post comments. Log in to your account to enable posting.