Monday
May192008
How can I estimate how long an ETL project will take?
Monday, May 19 This question seems to be a recurring theme, several new members have posted the same issue:
"I need to estimate how long it will take me to perform extract transfer and load
on some data.
Can anyone point me at some resources - or even better, give me some
advice? "
5 Comments | | in
Estimation 
Reader Comments (5)
Hi,
First of all you should tell us about the source and the target of your data.
Second. You should elaborate on what information you would like to migrate from source to the target. e.g. Would it be only contacts information, accounts etc.
Then we can estimate how much time it will take you to perform extract transfer and load on some data.
Regards,
Sergei
Apatar, Inc
Commonly the Load is the slowest part of the ETL process, and often this is to the extent that the E & T times are unimportant. You may have an age to get the next set ready before the L "hopper" empties of the rows it was first fed...
"Extract" is ideally a fairly logic free "dumb" dump so there is no real reason why it should be slow.
"Transform" can be carried out on specialist platforms and hardware. If it gets too slow you can easily "cheat" by using a fast procedural language for the tricky bits or by analysing which parts are slow and unrolling those execution loops or using other performance boosting techniques. (If you don't know how to: read Code Complete by Steve McConnell, Microsoft Press). Put another way - the transform is 100% in your hands to make go faster. So I'm sure you will find a way!
"Load" requires the target system to take on the data. This is not so easy. The data must integrate properly in a manner that supports business processes adequately. In many complex systems you will be forced to use an Application Programming Interface (API). When you feed the API a row of data it typically does not inject it to a single table or use a simple SQL update; it often runs a full procedural code script with many branches and possibilities that ensure checks and balances are performed, data is validated and only legal rows are loaded, dependent items are set up and a rollback of these side effects is possible if the row itself needs to be rejected. With every row potentially triggering a full program execution in this manner it is unsurprising the execution rate can be very slow. The alternative of injecting data directly ino the database is only rarely available as the load scripts are there for a reason and trying to replicate them externally is to second guess the API's designers who are probably far more expert in the target system than we could reasonably hope to be as migration specialists.
So... To know how fast/slow your process will be I suggest you ask the target system vendor as in all likelihood theirs are the controlling numbers in the ETL.
Be prepared for a surprise. Some individual row loads can be counted in seconds or even minutes.
Regards
John Platten
Principal Consultant
Vivamex Limited
John.Platten@vivamex.com
or via my DMPro profile
Can I respectfully disagree with John's comment above regarding the extract. Often, extract can be pain-free, if you have a nice, well documented source system and good quality data or a good ERP connector.
However, this isn't the case in many situations - including mainframes, undocumented extensions to ERP systems, anything that is non-relational - and so I would caution against planning for 'E' to be pain-free without up-front due diligence on the source system archtiecture.
The question was about timing specifically and I wanted to get quickly onto the load and why business rules processing often makes it the problem step to watch for in go-live execution planning. I've seen sums come out as weeks which can be a real deal breaker or require special handing and trickle across tools such as Celona.
I had meant to say the extract is "often less significant in timing terms", but Nathan is right to correct me for wandering instead into an implcation that the extract would not be complex.
To cover that step briefly, now I appear to have stepped on it:
Extracting the tables as they stand is often quoted as an idealised strategy for the simple reason that transforming on the tranformation stage is in the hands of the migration team whereas complex scripts on the extract step are more often in the hands of expert legacy resource, and this has serious implications for waiting times on rework when issues are found during testing, especially if the extract authors are not fully seconded to the migration team.
BUT this is only an idealised principle and Nathan is quite right to say this should be checked for problematic siutations on a per project basis. The real world is not an idealised case!
My apologies for generalising horribly in one area (complexity) while answering a question on another (timing). It's an easy pit to fall into on an expert forum as opposed to a specific project.
John
I rarely quote timings on an ETL process until I've actually seen and processed a full data set. There are a tremendous amount of variables that impact each step, particularly if you are looking at large, aged, and/or disparate data sets.
In many cases, understanding of the legacy data may be incomplete or incorrect; once the data is extracted, the method of transmission can severely impact your estimate; how likely you are to encounter exceptions during the transform process and the intended resolution to those exceptions is another key factor. It is likely that some exceptions will need to be resolved on the legacy system, which will require re-initiation of the entire process. You can really only guess at these until you have at least done significant analysis on the legacy system.
If the customer is insisting on timings before you have had a chance to fully evaluation the environment, you should caution them on the significant risks involved. Should they persist, you will need to pad any estimates substantially to deal with a variety of "unknowns".