« Data migration project checklist - a template for more effective data migration planning | Main | What is landscape analysis and why is it important to your data migration? (Part 1 - Overview) »
Thursday
Nov272008

How to create a data quality rules management repository (Part 1 - What are data quality rules?)

In this two part series we will look at the importance of creating a data quality rules process in your data migration project and how to create an accessible online application for storing, reporting and coordinating these rules.

If your project does not have a structured approach to discovering, measuring and resolving data quality rules then the likelihood is you will become one of the 4 out of 5 migrations that fail or suffer severe delays.

TIP: For more Data Quality Tutorials check out Data Quality Pro, our sister site with the largest collection of hands-on tutorials, data quality guides and support for Data Quality on the internet. 

What are data quality rules?

 Every system needs checks and balances to run smoothly.

Information systems are no different. There are literally thousands of data quality rules in existence throughout even the most modest of businesses.

These rules dictate how the information should be stored and handled in order to maintain the current business operations.

Sadly, these rules are very rarely recorded.

If you walked into any organisation and asked for a definitive set of rules that are enforced on their data you would mostly find some scant design documentation and if you're lucky one or two domain experts who can still decipher the complex application rules that govern the data.

Data quality rules provide two main components that help us ensure our legacy data is fit enough to survive the arduous journey ahead.

  • Data quality measurement
  • Improvement activities

Data Quality Measurement

A data quality measure simply gives us a metric for gauging the health of our data.

For example, if we were migrating hospital records we could create a data quality rule that states there should be no duplicate patient information.

An example metric could be 127 patients were found to be duplicated out of a list of 10,000.

Data Quality Improvement Activities

The second part of our data quality rule is designed to help us manage any defects in the data.

In our previous example the migration may well fail if we leave those customers in as duplicates so our improvement activities would define the steps needed to improve the data.

Tip: Check out Data Quality Rules Process for Data Migration, featuring John Morris 

Why are data quality rules so beneficial?

 Like all great ideas, data quality rules are very simple.

But unfortunately they are very rarely adopted in a structured manner and this causes one of the main causes of project failure in most migrations.

A lot of projects will perform data profiling or testing (an important element of data quality rules management) but not in a unified, coordinated manner.

Projects are often reactive in nature, only when issues are found during the ETL process or in the final load testing for example will they document issues and attempt to resolve them.

When implemented correctly, the process of discovering and managing data quality rules should take centre stage on your migration.

This pivotal role of data quality rules makes them so important and useful to the project.

They are the "glue" that binds the business sponsors and stakeholders to the project and ensures they take ownership of the issues the data presents.

Data can rarely be "cleansed" or improved without significant business input so the business must take an active role in the data quality rules process.

 

How do we create data quality rules?

 

The key to creating useful and accurate data quality rules is to get the right people into a workshop.

Taking a "techie" approach by hammering away with profiling tools without consulting the right business domain experts will just give you a whole lot of wonderful charts and statistics that are meaningless.

Together, data quality analysis and business experts make a wonderful partnership so plan your data quality rules workshops in advance, bring the right level of intelligence and analysis into the sessions and define the rules that are important to your project.

Start with legacy rules

There is a temptation to launch into defining the rules that will map our data to the target platform.

This is a mistake, we need to first define the rules that govern our legacy environment to ensure local consistency.

We examine the local quality of our data based on some basic conceptual and logical data models that define what data items we need for the migration.

For example, if we define a basic customer to service relationship rule that is in evidence in our legacy business, how many occurrences are there across our systems that breaks this rule?

Even though our data model may theoretically prevent this kind of defect from occurring, in reality our knowledge workers are adept at conjuring up weird and wonderful ways to massage the data into a defective rule.

Legacy system consolidation is a key challenge. If you think there are defective rules in a single system, just wait until you throw multiple databases into the mix!

Once again, define your logical model that has been agreed with the business as "the way we do business".

Create the necessary data quality rules that reflect that thinking and then examine the data across all the legacy systems to see where rules are broken.

A recent example I found was where a data quality rule stipulated that a company account should only have one promotional offer per year.

That rule was well and truly broken when we consolidated all of the different customer databases together and found that in one instance a single company had been duplicated 48 times, meaning they had received over 200 discounted orders for the year!

So, defining data quality rules are not only sound practice for data migration projects, they will save you money in your ongoing data quality activities too.

As we uncover our rules and discover their defects, we need to record the activities required to either mitigate or eliminate the problems.

Once again, the business has to make these decisions in unison with the project team, it is not our data, the data stakeholders and sponsors need to determine what level of resource and funding they can provide in order to bring the data up to a satisfactory level.

Wherever possible, aim to fix the problem at source as opposed to a "quick-fix" in the migration technology itself. This way you get benefits back into the business and simplify the migration logic.

Legacy to target rules

Okay, we've got our legacy environment in order with a healthy set of rules and well managed data, now we need to see how fit the data is to migrate and support our new business environment.

The key here is to look not just at the target schema but the target business model.

Will the new business services place increased pressure on our legacy data to perform new functions?

Your data could be fit for purpose in the old world but chronically unfit for driving automated processes in the new world.

So create rules that not only uncovers gaps between the legacy and target models but also the target business functions.

Go agile

This is not a waterfall process, it requires an iterative approach, repeatedly learning and adapting as you uncover more rules and the target environment begins to take shape.

The key is not to stop and wait for things like the target schema, get started on day 1 of your project and do what you can.

Iterative "sprints" are useful in this respect. By creating time-boxed discovery activities you get the project used to delivering benefits early and often, don't wait until testing to uncover these issues.

 

Data Quality Rules Templates

 

There is a far more detailed explanation of Data Quality Rules in John Morris' Practical Data Migration (see our online bookstore).

John has kindly supplied us with several templates that we will use in the next part of our data quality rules series where we show you how to build an online application for managing the entire data quality rules process.

To access the templates, just go to the documents area of our download centre and you will see the "Data Quality Rules Template" pack.

In addition, John has recently created a free online course that takes the student through the data quality rules process in more detail and also provides education on many other data migration topics and techniques: click here for details.

 

Next Tutorial

 

Stay tuned for the next part of the series which will provide details of how to take the templates provided and create a complete online application that your project can utilise to form a central management facility for the entire data quality rules process. 

Useful Links

You may find the following of use when implementing or learning about data quality rules in a data migration project:

Do you currently manage data quality rules on your project? What approach do you use? Is there an area you wish to improve upon?

Please add your comments below.

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (2)

Hi!
Attention for data quality is always good (while preparing and/or executing a Data migration).
In my data migration solutions however I never start from the -in my opinion- 'theoretical' point of view suggested in this article.
The Data migration systems I build (greatly supported by the dedicated Migration workbench, program-generator and specific modules) allways consists of 2 parts:
- data quality assurance
- actual migration.
The actual migration for here speaks for itself.

The standard within the developed systems is, that no data containing errors will be passed to the next phase.
There are working mechanisms to adjust requirements/quality so in the end up to 100% data can be passed in 100% (technical and mostly also functional) quality to the actual migration.

Lets zoom into the data quality part:

Since I design the DM systems as batch system, the first step to be taken is to download the data into flat files (fixed, delimited, xml though I don't prefer the latter because of the amount of non-information overhead).
The selection of tables to be downloaded is roughly derived from the project plan (which -parts of the - system are to be replaced).
Whenever 1 column (field) from the source table is needed, the whole table is selected (this prevents adjustments later on).
The definition of the files is fed into the workbench.

Every table is scanned based upon its definition. In this way a clear sight upon the usage of the various columns is created (field width usage, field filling, code fields etc).
Based upon the retreived information the description in the workbench is adjusted.

Next step is to generate programs based upon the data description.
Within the quality assurance part there are 3 kinds of programs:
- check basic quality
- check consistency within system (part: eg. all tables related to customer, all tables related to article)
- Check consistency over system(part)s.
Especially the check-basic-quality programs are largely parameter-driven, so we don't need program adjustments for every change.

Every program in the QA part consists of 2 logical sections (in this sequence so every quality adjustment will be followed by the quality check):
- quality improvement
- quality check
Both contain generic mechanisms and possibilities for tailormade adjustments.

On starting up, the improvement part will be empty.

As you can see, I prepare for quality improvement.
In general I prefer to make quality improvement part of the solution mainly because:
- in this way I keep it all in 1 hand
- last-minute-arisen errors from any known kind will be corrected.
- there is no need for (however it could be done) adjusting the data in the source system: adjusting it there could always cause problems, some system owners prefer to leave the data in the source system as it is: it works now, why take any risk?

From the moment the mapping rules start shaping, there will be additional requirements to the source data. They kan easily be built into the quality checks.
Part of the additional requirements will arise from testruns when the converted data has been loaded into the target system.

For all errors and adjustments, reporting is being done (if desired so).

By setting the system up in this way
- there is no need for a study on beforehand.
- results from the scan can be used to discuss the content of the source environmen.
- there is no need for additional work in the source environment
- last-minute occurences of known lypes of problems will also be corrected.

It saves a lot of time especially on the side of the owner of the source system, who can now concentrate on the specification of correction- and quality rules.

Since complete tables will be used, within the workbench we can see which columns will be 'left behind' thus giving a clear picture of the coverage of the mapping rules.

Any questions or remarks? Just let me know!

Fri, November 28 | Registered CommenterRuud Koorneef

I think I have to defend the DQR process against the charge that it is "Theoretical". I titled the book on the subject "Practical Data Migration" for a reason - because it is based on my 10 plus years working as a Data Migration consultant on some of the biggest projects in the UK. DQR are not an alternative to using a work bench, rather the use of a workbench or any other software toolset is a subset of the activities covered within a DQR approach.

Perhaps it is easier to see what I mean if we think of some of the things that a workbench does NOT do:

It cannot perform reality checks - that is the difference between what is in the data base and what is out there in the real world (eg the data says I have 200 customers but I know I only have 152).

The more sophisticated tools we now use are capable of inferring business rules but no one would trust them without back checking rules with users who really understand the business. A Workbench can not find business domain experts, much less ask them their opinion.

And when we find errors, the workbench may not be able to determine the business rules to fix them and will not be able to decide where they should be fixed (or if they should be fixed at all).

For all of these purposes we need people processes that turns our technology from being theoretically interesting into practically useful.

Now we can either embed the software seamlessly in the process (as I prefer) or we can have a clear hand off between the IT folk and the business chaps. The problem with the latter approach is that inevitably a responsibility gap will open up between the two sides and the migration spirals into a series of mutual recriminations and failed dependencies.

As an aside Ruud, don't be too quick to dismiss fixing data in the legacy. For any data quality issue there are four generic solutions - fix in legacy, fix in the migration tool, fix in the target, don't fix. (There is a fifth approach - allow to fail then re-enter by hand, but that only works where there are small numbers of instances and is best where the cost of writing a coding fix outweighs the cost of data entry).

Each approach has its merits. Fixing in the legacy has the benefit of adding value to the business before the new system is implemented but has a potential resource impact on the business. Fixing in the migration tool has the least business impact but has the costs of coding, testing and potential faults at run time. Fixing in the target allows migration timeline objectives to be met (which may be mandatory for legal or contractual reasons) but is a dirty approach that may mean bad data enters the new system. Not fixing at all is the cheapest (at least in the short term) but definitely leaves bad data in the new system. On anything other than the simplest migration you will use all four approaches.

To come to a judicious decision each case needs to be considered on its merits. The DQR process, by bringing together the technical, the data owner and the business domain experts to jointly make the decision is the best guarantee that the most appropriate decision is made. Our job as data migration consultants is to facilitate the best decisions being made, not forcing a preference onto the process. And the best mechanism I have found for getting a considered judgement made, is the DQR approach.

Sat, January 10 | Unregistered CommenterJohny Morris
Member Account Required
You must have a member account on this website in order to post comments. Log in to your account to enable posting.