Sunday, October 04, 2009

Data Cleansing Tutorial : What is Data Cleansing

I'm reading: Data Cleansing Tutorial : What is Data CleansingTweet this !
I have worked on data migration projects where I have been a part of data cleansing activities using SSIS. One of the reader of my blog, requested me to post some tutorials on Data Cleansing. So based on my experiences, I would like to answer a few questions on Data Cleansing. The point of interest is "What is Data Cleansing" and then the next question is "How can Data Cleansing be commenced and concluded" ?

When is Data Cleansing required ?

Generally Data Cleansing is required in a systems integration flavor of project. For example when different systems which might be logically related like back-office and front-office systems, or non-related systems like different business application that might be independent in its own respect but functioning under a common umbrella. In both these examples one thing is common, these systems operate differently and are subject to integration.

What is Data Cleansing ?

Data is the above mentioned systems are not always as expected, or the data is not of the quality that is required to be fed into new system. For example, there can be OLTP applications or front-office systems where data is not fed in the precise manner due to operator error or limited system design. So these data issues needs to be fixed, mostly in the host or intermediate system which is mostly effective to the requirements of the new system where data of a particular quality level is required.

How to do Data Cleansing, using SQL Server in specific ?

There are different ways of doing Data Cleansing, and there is not particular theory or formula to it, in the best of my knowledge. But one thing that applies universally to any data cleansing exercise is, the data should be profiled thoroughly. Thorough data profiling is directly proportional to level of data cleansing. Mostly the hard part is data profiling, than data cleansing. It takes more business knowledge combined with technical analysis of the data for profiling. Once the details about what to fix in the data is catalogued, it's generally not that hard to fix those issues using ETL tools like SSIS. SSIS comes with a dedicated task for Data Profiling and a dedicated tool for viewing the profiled data.

3 comments:

Alexander said...

I found this very interesting and helped me understand about data cleansing.

Ashley said...

I only know that data cleansing is the act of detecting and correcting corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.

After reading this article i understood why we are using it.

Thank you for sharing a nice post.

Chris Johnson said...

great post. i read these every week. Keep up the good work.

Related Posts with Thumbnails