Hotelscombined Data Scraping: Proactive Approach For Improved Data Quality In Data Warehousing

Ever since data warehousing is being used as a facilitator for strategic decision making, the importance of the quality of the underlying data has grown many folds. Data quality issues are much like the software quality issues. They both can sabotage the project at any stage.

This being my first article ever, is more of a loud thinking than a definitive set of steps. In subsequent articles I will discuss data quality issues more in depth.

1. Data collection process:

Many organizations depend on the ETL tools available in the market to make their transactional data ready for OLAP. These tools would be much more effective if the data coming from the day to day used systems is having valid contents. So the data quality checks should be applied right from the data collection process.

For example we see that in case of feedback collection where users write ad-hoc feedback for the open ended questions. To ensure valid feedbacks are registered, techniques ranging from parsing feedback text for some keywords to complex text mining algorithms are employed. More efficient techniques of data quality checking will offload data quality burden from subsequent stages of the DW projects.

According to me there are many separate aspects of looking at data collection. One way to look at it is implicit data collection and explicit data collection. For example, data collected at the server, proxy or client level for tracking user's browsing behavior will have to be treated separately while preparing it for mining in comparison to data collected through data entry forms.

However proactively taken steps to ensure that valid content gets into the databases would be useful in either case (e.g. In explicit form, it could be string pattern matching tasks like validating the email addresses pattern using which we may not allow the form to be submitted or in case of implicit data collection we need to distinguish between actual user clicks and a bot or a scraping program clicking links on your web pages automatically).

2. Data cleansing process.

Data cleansing is a difficult process due to sheer size of the source data. It is not easy to pick out the badly behaving data from a collection of few terabytes of data. The techniques used here are many ranging from fuzzy matching, custom de-duplication algorithms, and script based custom transforms.

The best approach is studying the source data model and building basic rules for the checking of data quality. This can also be done iteratively. In many cases clients do not provide data upfront but data model only with trial data. The BA and domain expert can with mutual consultation come up with certain rules as to how the actual data should be. These rules may not be very detailed but that is OK as this is just a first iteration. As the understanding of the source data model evolves, so can the data quality rules. (This might sound almost heavenly to anyone who has been a part even a single data warehousing project but it is an approach worth trying.)

Please note that this is different from data profling tools which run on source data. We are trying to analyze metadata and the project requirements so as to specify the data quality.

Generally building this rule requires the sound knowledge of the industry concerned and also the consistent and in-sync data dictionary but the worse part is once these rules are built; data modeling team also has to carry out the actual data verification against these rules manually. This process being cumbersome and error prone might compromise on data quality. We will discuss more about how can this be reduced and possibly automated in the next article.

Source: http://ezinearticles.com/?Proactive-Approach-For-Improved-Data-Quality-In-Data-Warehousing&id=829164

Hotelscombined Data Scraping

Friday, 13 September 2013

Proactive Approach For Improved Data Quality In Data Warehousing

No comments:

Post a Comment