A few years ago I was working on configuring a test for comparing data transformation and loading into a variety of target platforms. Essentially I was hoping to assess the comparative performance of different data management schemes (open source relational databases, enterprise versions of relational databases, columnar data stores, and other NoSQL-style schemes). But to do this, I had two constraints that I needed to overcome. The first was the need for a data set that was massive enough to really push the envelope when it came to evaluating different aspects of performance. The second was a little subtler: I needed the data set to exhibit certain data error and inconsistency characteristics that simulated a real-life scenario.
Come on, seriously?
Poor data quality and Hurricane Irene
Just to prove to a content aggregator that I am serious about social media, I am putting this code:
inside a blog post.
I frequently monitor the price of my books on Amazon and noticed that this afternoon the Practitioner’s Guide to Data Quality Improvement was selling for $33.45, which is a discount of 44%. If you have been waiting to buy the book, now is a good time. This is the lowest price I have seen so far.
Here are some aspects I have tried to cover in the book:
- The ability to build a business case for instituting a data quality program;
- Assessing levels of data quality maturity;
- The guidelines and techniques for evaluating data quality and identifying metrics related to the achievement of business objectives;
- The techniques for measuring, reporting, and taking action based on these metrics; and
- The policies and processes used in exploiting data quality tools and technologies for data quality improvement
Filed under: Data Governance, Metadata, Recommendations, Uncategorized
At the recent DGIQ (Data Governance and Information Quality) conference, I had the opportunity to chat with Ian Rowlands, Senior Director of Strategy at ASG about historical trends in computing. In particular, we discussed how concepts such as centralization and distribution have come into and then out of vogue, as I pointed out that the new trend towards the “cloud” was essentially a re-boot of the old concept of time sharing on a mainframe.