on 06-26-2011 2:26 AM
Hi Experts,
I am new to Data Services where my requirement is to extract data from 3 different source systems : Flat Files, Oracle, Mainframes, consolidate the data , do cleansing the data and send the files to the Target systems.
Can anyone please let me know the intial steps from a high level? Like first do I need to create 3 different Datastores, one for each Legacy system?
Thanks for your help in advance.
Naga.
From a high level:
1) Obtain access (logins & passwords) to your three sources of data. Make datastores pointing to them.
2) Pull your sources of data into one, unified place, wherever you're going to find it convenient to profile and analyze the data. Personally, I like SQL Server, because I'm most adept writing T-SQL, but doesn't matter -- MySQL, Oracle, Access, whatever.
And now the real work begins. Put Data Services down, pick-up your query and/or reporting tool of choice, and start looking at the data to develop your "real" specification. Unless you're blessed with a specification from a hot shot business analyst, developed in the full knowledge of the source data, based on a thorough, picky, detailed analysis thereof, it's likely that your spec will be a wishlist. That's fine -- you need to know what they want -- but if they want fresh mangoes every day at the South Pole, it might be a challenge. Meaning: you need to figure out if or how well the data supports the desires. They want a "100% accurate customer list". Well, that's impossible, for instance. One of their customers might be moving and getting a sex change operation today, and there's no way to know that today's Bob of Kentucky will be Roberta of Tennessee next week. You can make their data better; you can integrate a lot of it. But where it gets interesting is with the edge cases, where they'll need to decide how much they want to pay to get from 99.78% accurate to 99.83% accurate. DO NOT START ETL CODING UNTIL YOU'VE THOROUGHLY ANALYZED THE DATA and found the edge cases, the "whoops," the "well, we didn't think about that," the "oh, we didn't think there would ever be more than one of those," the "oh, gee, we figured they always had at least one of those." It's sort of like the role of mise en place in cooking -- you don't want the be caught w/ your garlic unminced just when your onions have gotten soft, do you?? If your current spec says silly things like "pull their home address," and nobody's checked to see how "home addresses" are identified as such, and if everybody has at least one, and if there are multiples, and, when there are multiples, how to pick the "primary" home address, then you've just got a wishlist.
Once you actually know your data and what the hell you're doing, then do the ETL coding, which will be fairly easy if you've done the mise en place correctly. The hard work of ETL comes before the ETL.
Best wishes,
Jeff Prenevost
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks Jeff for your input.
They want to extract all data from those 3 legacy systems and stage it into SQL Server. In order to do this, do I create one Job with one workflow and create 3 source data stores ( one for Oracle, one for DB2 and one for Files), use Query Transform and create 1 Target Datastores ( with database being SQL Server)?
In order to do profiling and data cleansing, do I create 2 jobs one for Profiling and the other one for Data Cleansing?
Please advise.
Thanks,
Naga
Yes, you'll need four datastores, as you've outlined.
As for how many jobs & workflows & dataflows: doesn't matter all that much during development; generally speaking, during dev, you're likely to have one "test" job where you put one unit of code at a time, working with it till the unit is right. As you get each hunk right (typically, dataflows), you assemble them into a sequence, based on dependencies, inside one or more workflows, which, in turn, can be nested and sequenced. At the end, you may end up with one job, "Daily Job", in which you put all your code that needs to run daily (typically). Or, you may end up with three jobs. Or four. All depends on the data and business needs, and the application of some common sense.
You're encouraged to keep your units of work relatively small, simple, and discrete; don't try to write dataflows with hundreds of transforms inside stretching all over the place. You can, yes, but you'll find them hard to handle. But all that's pretty easy stuff, Naga, and I suggest that you're focusing on unimportant matters. 99% of the work is going to involve dealing with the data itself, profiling it, and writing your spec, aka "ETL mapping document", unless you're lucky enough to have someone competent write it for you.
Best wishes,
Jeff Prenevost
User | Count |
---|---|
101 | |
13 | |
13 | |
11 | |
11 | |
7 | |
6 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.