Wednesday, June 10, 2009

Data Quality

In 2003 the Data Warehousing Institute calculated that bad data quality leads to a whopping loss of, approx $600 billion annually.

Data Quality Improvement is the processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.

The reasons that adversely affect the data quality are:
Legacy Systems and data: Legacy Systems may/may not have validations in built into them. Legacy Systems tend to have redundant data, composite keys and referential integrity issues.
Application Evolution : Applications evolve over time and the data entry operations, client and server side validations are often overlooked resulting in bad data quality.
System Work-Around: More often than not, immediate results and often temporary measures are deployed to meet time deadlines or technology limitations.
Time Decay: The best of the systems cannot stand the test of the time. What better example than Y2K bug. Data quality deteriorates with time.
Lack of common data standards: Companies do not always invest time and resources into creating best practices, standards and checklists. Simple tasks such as having universal naming conventions can improve quality of the data.
Data Entry issues: Data entry issues are top1 reason for adversely affecting the quality of the data. If data entry is performed by customers or web based users, it is most likely that junk and misplaced information will be gathered. Even internal data entry operations are compromised because of the ‘remarks’ or ‘comments’ sections.

So how can this data be cleared up?
DIY: Do It Yourself by looking into databases, forms, applications etc. Of course, this is not the best choice. But is a beginner’s step that can lead you to a roadmap for data cleansing.

Invest in Data Cleansing Tools: Outsource to who can do it best. Yes, now we are talking business. Invest in identifying the suitable tools in the market. Here are some:
· Informatica Power Center
· Trillium Software
· Business Objects Data Integrator
· Data Flux

So, going forward how do you prevent rather than cure? Here are some ideas gathered by us.

Data Profiling – analyze the date for correctness, completeness, uniqueness, consistency, and reasonability. This must be done in the order of column profiling, dependency profiling, and then redundancy profiling.
Data Cleansing – Detect and correct corrupt or inaccurate records from a record set, table, or database. The common methods are parsing, data transformation, duplicate elimination, and many statistical methods.
Data Defect Prevention –Set up a data governance group that will take control and responsibility of the various databases and enforce/introduce data quality rules. They will conduct regular audits and data cleansing programs. They also have to take charge of the training of data entry and other personnel.


Data Quality: Authored by Vivek and Devi. Cleansed by Vai :-)