<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Practitioner&#039;s Guide to Data Quality Improvement</title>
	<atom:link href="http://dataqualitybook.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://dataqualitybook.com</link>
	<description></description>
	<lastBuildDate>Tue, 04 Jun 2013 15:19:02 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
		<item>
		<title>Big Data, Sensors, and Data Integration as Part of the Machinery</title>
		<link>http://dataqualitybook.com/?p=383</link>
		<comments>http://dataqualitybook.com/?p=383#comments</comments>
		<pubDate>Tue, 04 Jun 2013 15:19:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Integration]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=383</guid>
		<description><![CDATA[Despite my clear understanding that the world’s data volumes are growing by leaps and bounds, I sometimes wonder whether the information management industry’s hyperfocusing on unstructured data seems a bit over the top. Yes, I know that social media channels such as Twitter and LinkedIn and Facebook, and etc. are pushing mounds of what we [...]]]></description>
			<content:encoded><![CDATA[<p>Despite my clear understanding that the world’s data volumes are growing by leaps and bounds, I sometimes wonder whether the information management industry’s hyperfocusing on unstructured data seems a bit over the top. Yes, I know that social media channels such as Twitter and LinkedIn and Facebook, and etc. are pushing mounds of what we want to believe is valuable content that can be mined for exploitation in terms of targeted marketing and upselling and cross-selling. But when you actually sit down and <em>read </em>a series of Twitter tweets, for example, you might notice a few things. First of all, a lot of the activity is not original, but is merely a repeat of something someone else said. Second, the ability to follow a thread based on the hash tags is limited by the absence of all metadata; the same tag may be used for any number of concepts, and presuming they can be converged is actually somewhat naïve. Third, much of the content is formulaic and even automatically generated as part of a corporate social media initiative designed to maintain a social media presence, even at the mercy of publishing anything with significant content.<span id="more-383"></span></p>
<p>On the other hand, I am a proponent of big data and big data analytics, so these comments might seem somewhat contrarian. However, I have a continued fascination with what I think will be the most relevant sources of big data in the near-to-long term future: sensors. Actually, the relevance of machine-generated data from a broad network of interacting nodes is not new, especially in the world of computer networking (hint: think about how email actually works).  But more and more things are being outfitted with sensors to the point where there are mounds of devices always generating streams of information that can be subjected to analysis.</p>
<p>And yet the data integration challenges remain, particularly if you are relying on a single landing pad or staging area for lots and lots of data. As the number of devices and sensors generating data increases, there is a corresponding need for aligning data integration and transformation within individual devices. As devices are connected together, the ability to embed data transformations at strategic points across the network can not only reduce a computation bottleneck at the ultimate target destination, it can also optimize the computation as a result of data distribution and task parallelization.</p>
<p>I see two specific values in Informatica’s announcement of Vibe. First, because your developed transformations and integration directives can be developed on top of Vibe in one environment and can be deployed to any other platform running Vibe, you have effectively defined a standard for development and implementation. It allows you to develop within a controlled environment but deploy anywhere. Second, if I understand correctly, Vibe has a small-footprint that allows it to be embeddable. Informatica has embedded it into some applications via OEM relationships, and it powers most of its existing products. The roadmap includes shrinking the footprint even more for devices and sensors. This addresses the expectation for the active network, in which computations and transformations can be layered into the interconnectivity of devices.</p>
<p>Last, if you consider the various dynamic topologies of these interconnections, you begin to see how embeddability really can add value. For example, smart devices generating location data can sync up in self-organizing networks and perform transformations as aggregated statistics are sent to mobile towers. Road sensors can go beyond transmitting and begin to incorporate logic in relation to aggregated traffic data. Connect the device topologies with cloud applications that managed device profiles. There are many different examples, and it is clear that the product roadmap is intended to accommodate a wide variety of sensor-based big data applications.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=383</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Criteria for a Data Replication Solution</title>
		<link>http://dataqualitybook.com/?p=379</link>
		<comments>http://dataqualitybook.com/?p=379#comments</comments>
		<pubDate>Tue, 30 Apr 2013 10:35:11 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=379</guid>
		<description><![CDATA[In my last post we looked at the environmental drivers for the assessment criteria for a data replication solution, including environmental complexity, the need for application availability, the need to accommodate different types of systems and models, the growing volumes of data, and going beyond a point-to-point set of solutions. As I suggested, these frame [...]]]></description>
			<content:encoded><![CDATA[<p>In my last post we looked at the environmental drivers for the assessment criteria for a data replication solution, including environmental complexity, the need for application availability, the need to accommodate different types of systems and models, the growing volumes of data, and going beyond a point-to-point set of solutions. As I suggested, these frame the dimensions by which one might scope a data replication solution, and in conversations with both Ash Parikh and Terry Simonds from Informatica (<a href="http://bit.ly/10Ati8l">here is the third installment of that conversation</a>), we shared some thoughts about how they are approaching these dimensions in ways that reduce costs, speed delivery, and limit risk:<span id="more-379"></span></p>
<ul>
<li><strong>Ease of implementation</strong> – Instead of defaulting to working within the framework of complex systems that are managed via command-line operation and parameter-based scripts, look for a data replication solution that reflects simplistic configurability via GUI-based tools and reusable components that can deployed directly as services. This approach reduces the effort for programming and configuring scripts, both speeding delivery while reducing costs.</li>
<li><strong>Non-intrusiveness</strong> – The process for initial synchronization of a replica should connect to the sources and extract data rapidly without introducing any kind of performance drag on the source system. Thereafter by utilizing a data replication technology based on log-based change data capture for continuous incremental delivery of data, , the degree of intrusiveness into the production environment is minimized. With a nonintrusive data replication solution, the work necessary for maintaining a consistent set of replicas is amortized over time, and once initially configured, has a relatively small demand on resources.</li>
<li><strong>Heterogeneity</strong> – This is critical to enable a seamless range of data availability. As I have noted in the last two posts, there is bound to be a wide variety of hardware, software, database, and data models that need to be made available, so a desirable feature of a data replication solution is broad support for heterogeneous systems.</li>
<li><strong>Scalability</strong> – Many solutions can be scaled with enough application of elbow grease. However, automating the capabilities that make a solution scalable (such as automatic determination of optimal methods of data loading, automated parallelization, and deployment across commodity components) reduces effort and decreases costs.</li>
<li><strong>End-to-end interoperability</strong> – Lastly, there is a growing recognition that data integration in general is becoming more of a fundamental infrastructure requirement (as opposed to a supporting technology on a project-by-project basis). Replication itself should support the full spectrum of data availability, from a standardized set of methods for accessing data sources to standards for data delivery, and be part of a holistic strategy for data integration. Look for vendors whose data replication solutions are not dissociated from an end-to-end approach to data integration.</li>
</ul>
<p>While data replication has long been deployed to ensure predictable performance for geographically disperse environments or as part of a general business continuity strategy for continuous availability, these criteria also address newer usage scenarios such as data warehouse population and continuous refresh and synchronization of data to ensure consistency across different operational environments, as well as supporting master data management and data federation and virtualization.  Therefore, keeping these criteria in mind for evaluation will help decision-makers determine the solutions that best meet their holistic operational and analytical data availability requirements.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=379</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Characteristics Driving a Data Replication Solution</title>
		<link>http://dataqualitybook.com/?p=377</link>
		<comments>http://dataqualitybook.com/?p=377#comments</comments>
		<pubDate>Sat, 27 Apr 2013 10:32:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[cdc]]></category>
		<category><![CDATA[change data capture]]></category>
		<category><![CDATA[data integration]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[replication]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=377</guid>
		<description><![CDATA[In my last post, we discussed two (presumably) complementary business drivers for instituting a standard enterprise-wide strategy for data availability: the desire to absorb massive amounts of data for analytical purposes (AKA “big data”) while simultaneously enabling accessibility to internal data stored across a variety of different siloed systems that have evolved organically over the [...]]]></description>
			<content:encoded><![CDATA[<p>In my last post, we discussed two (presumably) complementary business drivers for instituting a standard enterprise-wide strategy for data availability: the desire to absorb massive amounts of data for analytical purposes (AKA “big data”) while simultaneously enabling accessibility to internal data stored across a variety of different siloed systems that have evolved organically over the years. Yet while the desire for decreasing the latency for data access, often to the point of what is fuzzily referred to as “real-time,” drives the expectation for immediate accessibility to all data sets, it is valuable to take a step backward and consider the characteristics of the environment that need to be effectively addressed:<span id="more-377"></span></p>
<ul>
<li><strong>Complexity of the <em>de facto</em> environment for implementation</strong>: Siloed application development carries along a silo mentality for operations and maintenance, in which the patterns associated with system management reflect the idiosyncrasies of the original approaches to deployment. For example, older system may be configured using command-line requests and parameter-based scripts, with little or no oversight for ensuring consistency. Instituting data replication within this type of organizational model takes additional time and carries increased costs.</li>
<li><strong>Maintaining high availability for production applications</strong>: This is a recurring theme in any environment in which a fundamental capability needs to be modernized or improved while it remains in production. Companies cannot afford to take their systems down for months at a time while they are augmented with new functionality.</li>
<li><strong>Variety of data systems and data representation</strong>: There are few environments that have completely standardized along a particular hardware and software vendor for data management, and over a 30-40 year time frame, there are differences in the models, approaches, and even sophistication of the different data subsystems. Data replication applications that are limited to a small coterie of vendor approaches pose a risk to maintaining data availability.</li>
<li><strong>Scaling to accommodate large data volumes</strong>: The growing interest in expanding data volumes for analytics is a common performance roadblock that is supposed to be alleviated using replication. However, you don’t want to manually engineer the necessary scalability (including manually parallelizing database access and distributing data) into your implementation.</li>
<li><strong>The need for interoperability</strong>: Any modern data integration application has to not only account for the islands of data that exist across the organization, it should not force developers to create new interfaces for delivering accessed data either. Replication solutions must contribute to interoperability.</li>
</ul>
<p>These characteristics influence the definition of different criteria for a data replication solution, and we will examine those criteria in my next entry. <a href="http://bit.ly/17ngS7A">But you can learn more by listening to the second part of my conversation with Terry Simonds at Informatica.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=377</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ensuring that Data Availability Meet the Business Needs</title>
		<link>http://dataqualitybook.com/?p=375</link>
		<comments>http://dataqualitybook.com/?p=375#comments</comments>
		<pubDate>Wed, 24 Apr 2013 10:32:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[cdc]]></category>
		<category><![CDATA[change data capture]]></category>
		<category><![CDATA[data integration]]></category>
		<category><![CDATA[data replication]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=375</guid>
		<description><![CDATA[Almost everywhere you look these days, there is talk about big data, big data analytics, and the value of massive data volumes, and underscoring the demand for exploiting big data is the need to manage big data. This will be critical when dovetailing the desire for instituting analytical systems and addressing real-time needs for operational [...]]]></description>
			<content:encoded><![CDATA[<p>Almost everywhere you look these days, there is talk about big data, big data analytics, and the value of massive data volumes, and underscoring the demand for exploiting big data is the need to <em>manage </em>big data. This will be critical when dovetailing the desire for instituting analytical systems and addressing real-time needs for operational decision-making. Whether your company is looking to streamline supply chain management and inventory control, or deriving insight for enhancing customer experiences using numerous data streams linked with existing customer profiles, the best advantage comes from enabling the integration of analytics with operational systems in real time, or at least within the window of a defined (typically short) time frame.<span id="more-375"></span></p>
<p>At the same time, even with all the talk about big data analytics and the potential value of analyzing massive volumes of data from a variety of external sources, there is the risk that we lose sight of some of the more challenging aspects of data accessibility and management that plague existing infrastructures that have grown over time through organic siloed application development. The most prevalent manifestation is often described as “islands of data” in which data systems associated with various and sundry applications (including transaction processing systems, operations management systems, and analytical environments) all struggle to live together and satisfy the needs of the data consumers (figuratively speaking, of course). The greater the degree of systemic variety and isolation, the greater the costs to manage heterogeneous access, the more complexity in system integration, and the greater the risk of not being able to deliver actionable information to the right individual when it is needed.</p>
<p>Whether we are forward-facing and looking to scale up to absorb many streams carrying massive amounts of data, or backward–facing and looking to enable systemic interoperability and information delivery in a timely manner, there are some similar criteria for managing the data demand: provide predictability in the timeliness of data delivery, provide a level of trust in the consistency of the data, operate using a standardized mechanism that has limited impact to existing production systems, reduce the number of point-to-point solutions, and be scalable in relation to both data size and variety, among others.</p>
<p>There are a number of technical approaches. One example is stream processing that attempts to incorporate filtering, business rules, and triggers within the information flow network to help manage real-time events. Another, data federation and virtualization, looks at smoothing the differences across heterogeneous systems while embedded caching helps in improving access speed. These are both valuable techniques, but there is another technique that is not only regularly used in production to address long-standing demands for managing rapid data accessibility while remaining consistent, it also can easily satisfy the criteria I identified in the previous paragraph. Data replication enables high-speed data access, and when coupled with trickle-feeds and change data capture, retains a level of consistency with the original source that engenders a level of trust in the data.</p>
<p>To hear more about this topic, check out <a href="http://bit.ly/10NGIep">this conversation I had with Terry Simonds at Informatica</a>.</p>
<p>In my next two blog entries, I will look at two aspects of data replication. First we will drill further into understanding the capabilities worth evaluating when considering a data replication solution, and then we will contrast potential pitfalls that one must look out for when considering replication solutions.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=375</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Managing Information Consistency and Trust During System Migrations and Data Migrations</title>
		<link>http://dataqualitybook.com/?p=371</link>
		<comments>http://dataqualitybook.com/?p=371#comments</comments>
		<pubDate>Thu, 29 Nov 2012 21:12:25 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Metadata]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=371</guid>
		<description><![CDATA[If you have been following this series of articles about data validation and testing, you will (hopefully) come to the conclusion that there is a healthy number of scenarios in which large volumes of data are being moved (using a variety of methods), and in each of these scenarios, the choices made in developing a [...]]]></description>
			<content:encoded><![CDATA[<p>If you have been following this series of articles about data validation and testing, you will (hopefully) come to the conclusion that there is a healthy number of scenarios in which large volumes of data are being moved (using a variety of methods), and in each of these scenarios, the choices made in developing a framework for data movement can introduce errors. One of our discussions (both in the article and in <a href="http://vip.informatica.com/?elqPURLPage=10470&amp;BK=DVOLOSHINSERIES-DL">discussions with Informatica’s Ash Parikh</a>) focused on <a href="http://dataqualitybook.com/?p=362">data integration testing for production data sets</a>, while another centered on <a href="http://dataqualitybook.com/?p=368">verification of existing extraction/transformation/loading methods for data integration</a> (you can listen to that conversation also).</p>
<p>In practice, though, both of these cases are specific instances of a more general notion of <em>migration</em>. There are basically two kinds of migrations: data migrations and system migrations. A data migration involves moving the data from one environment to another similar environment, while a system migration involves transitioning from one instance of an application to what is likely a completely different application.</p>
<p><span id="more-371"></span>An example of the first type of migration occurs when two businesses merge, and both are running the same underlying ERP software. As the corporate merger proceeds, the ERP functionality will be combined as well, and this implies that the data in one instance of the ERP environment will be migrated into the surviving ERP environment. An example of the second type of migration occurs in a similar scenario, except that the merging companies are running completely different ERP systems. In this case, the data from the system to be retired needs to be extracted and migrated into the new system – essentially migrating off of the old system and onto the new system.</p>
<p>These are not the only situations for either type of migration. In general, enlightened enterprises often take the opportunity to review the existing infrastructure and seek ways to renovate the environment in anticipation of future needs. Hardware renovations require one set of skills, and consolidation of the application environment requires additional care in ensuring that the system migration is verified. That being said, in most cases of application renovation, the data sets from the existing systems need to be migrated. And as most migrations focus on the <strong>system</strong> and less on the data, the migration process may be prone to introducing errors if not properly monitored.</p>
<p>In fact, the need for comprehensive inspection of the validity of the data is even greater for migration situations, specifically because of the precision necessary to ensure business continuity. By definition, almost any migration situation involves the need to maintain the integrity of existing systems that are being retired while enabling new ones. Transactions logged into the legacy system that are not properly migrated to the new system pose a serious risk to the business. Consider this famous situation in 2001 when athletic shoe manufacturer Nike rolled out a new supply chain system while the previous one was slated for retirement. At the time, (<a href="http://www.informationweek.com/i2-says-you-too-nike/6505338">as reported in Information Week of March 1, 2001</a>), “<em>Some orders were placed twice, by the old and new systems, and the new system let orders for new shoe models fall through the cracks.</em>” An over order of some shoe models and under-ordering of others resulted in Nike being forced at the last minute to make some types of shoes and having them shipped via expensive air freight instead of the typical more cost-effective means. Reading between the lines, one can infer a serious data validation issue related to data and system migration. By the way, this issue resulted in a sales projection shortfall of $80-$100 million, as well as a 25% drop in the value of Nike’s stock, a significant real example of a severe business impact directly related to data validity.</p>
<p>Managing consistency and accuracy of both data and system migrations cannot be left to chance. As we have seen in our previous articles, manual data review for the purpose of validation is tedious, sleep-inducing, and generally prone to error. The alternative is to employ automated methods for managing consistency of migrated data. When the migrations involve copies of data, they can be validated using direct comparisons of source and target data sets. In more complex system migrations transitions, some data transformations may have been introduced; in this case, use automated validation and verification tools that can be augmented with the same business rules to ensure that the transformations were applied in the right way.</p>
<p>It is also worth reviewing three key points suggested in this series of articles. First, while the degree of maturity in software testing has increased over the years, we are still at an early stage of maturity when it comes to data testing. Second, there are many situations that would benefit from introducing best practices for data validation and testing: reconciling production data assets, extract/transform/loads from a variety of sources into a data warehouse or set of data marts, as well as the subject of this article, migrations. Third, and most important, manual attempts of data validation are going to be decreasingly effective as data variety expands and volumes grow. The conclusion is that employing automated data validation and verification in concert with good metadata management along with best practices and disciplines for process oversight will result in increased levels of trust for data across the enterprise.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=371</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ETL Verification: Do Your ETL Processes Do What You Think They Do?</title>
		<link>http://dataqualitybook.com/?p=368</link>
		<comments>http://dataqualitybook.com/?p=368#comments</comments>
		<pubDate>Thu, 01 Nov 2012 10:01:59 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[data validation]]></category>
		<category><![CDATA[data validation option]]></category>
		<category><![CDATA[ETL validation]]></category>
		<category><![CDATA[ETL verification]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=368</guid>
		<description><![CDATA[What is now generally referred to as “data integration” is a set of disciplines that have evolved from the methods used for populating the data systems powering business intelligence: extracting data from one or more operational systems, their transfer to a staging area for cleansing, consolidation, transformations, and reorganization in preparation for loading into the [...]]]></description>
			<content:encoded><![CDATA[<p>What is now generally referred to as “data integration” is a set of disciplines that have evolved from the methods used for populating the data systems powering business intelligence: extracting data from one or more operational systems, their transfer to a staging area for cleansing, consolidation, transformations, and reorganization in preparation for loading into the target data warehouse. This process is usually referred to as ETL: extraction, transformation, and loading.</p>
<p>In the early days of data warehousing, the ETL scripts were, as one might politely say, “hand-crafted.” More colloquially, each script was custom-coded in relation to the originating source, the transformation tasks to be applied, and then the consolidation, integration, and loading. And despite the evolution of rule-driven and metadata-driven ETL tools that automate the development of ETL scripts, much time has been spent writing (and rewriting) data integration scripts to extract data from different sources, apply transformations, and then load the results into a target data warehouse or an analytical appliance.<span id="more-368"></span></p>
<p>As long as the data sources remain static, the existing ETL scripts are sufficient for loading data into a target system, and as long as sufficient testing has been applied to those scripts, they are probably trustworthy. But it would be rare to be in an environment in which there is not ongoing, if not continuous change, and as the origins, volumes, and variety of source data sets grow, though, so does the complexity of the transformations. This can lead to an unpleasant situation: long-time trusted ETL processes incrementally begin to generate unexpected results.</p>
<p>It is worth considering some of the reasons that long-time production ETL processes generate unexpected results. In pour consulting practice, we have seen numerous environmental changes that have ramifications for existing ETL processes, such as:</p>
<ul>
<li>Changes in source data structures – An update to a data set changes the data type of one or more data element, adds or removes data elements from the table, or there are realignments in the relational structure, but the ETL scripts are not modified to accommodate the change in the source. In one example, one data element’s data type was changed to accommodate a data exchange requirement, which inadvertently changed one of the data transformations to execute incorrectly.</li>
<li>Changes in source data semantics – Adjustments to the meaning of source data elements should have implications for interpretation of the value, and consequently, the transformations applied, but this is not communicated to the ETL development team. These types of changes often lie hidden for some time until an incorrect value leads to questions about data warehouse results.</li>
<li>Changes in reference data sets – Addition or removal of reference data values may have implications for existing transformations or may require the creation of new code for proper transformation. We have seen numerous if-then-else and case statements in code that are triggered off hard-coded reference data values. Unless the programmer is notified, there would be no way to determine that an update to the code is required.</li>
<li>Introduction of unsanctioned or ungoverned steps in the production flow – Often, in order to expedite the delivery of a new report or analysis, developers bypass the standard system development lifecycle stages and introduce data dependencies that are not recognized as part of a production process. Because of this, failures in the materialization of the dependent data sets are not flagged.</li>
</ul>
<p>This last bullet item is the most prevalent, and perhaps the most insidious. For example, in one organization, we saw hand-extracted desktop spreadsheets loaded into a collaboration and sharing tool, from which the file was then downloaded, transformed, and then loaded into a data warehouse. In another example, a sequence of process steps involved dropping transformed files in target directories that were to be “swept” by another process and loaded into a target data mart. However, if one of the early steps failed, the previous day’s files were not removed from the target directory, and the sweep process picked up those same files and loaded them for a second time into the data mart. In both of these cases, failures in some part of an ungoverned process could ultimately impact the transformation and loading of data into the data warehouse.</p>
<p>The challenge, though, is that after the ETL scripts are tested and put into production, they are typically trusted to do what they were originally designed to do. So instead of continuous rigorous testing, verification of the ETL results is limited to sampling, parity checks (such as reasonableness with respect to aggregate sums or averages), or even manual review and scanning of target values.</p>
<p>Yet the types of scenarios listed earlier occur with some frequency. Therefore there is a need for continuous monitoring of existing ETL activities to ensure that over time there is no depreciation in ETL fidelity. But similar to the situation we discussed in my previous discussion of production data reconciliation, parity checks, spot checks, and manual review are going to be insufficient and patently non-scalable, especially as data volumes grow and the variety or originating sources expands.</p>
<p>This suggests the need for a more comprehensive approach to ETL validation and verification that is:</p>
<ul>
<li>Comprehensive in its ability to provide validation of the entire spectrum of data sets conveyed from the source to the target;</li>
<li>Scalable to enable completeness of validation of all the data values;</li>
<li>Automatable so that the validation is not dependent on allocation of staff resources for manual review; and</li>
<li>Reflective of business rules, especially when there are transformations directly embedded in the ETL processes.</li>
</ul>
<p>But mostly, this automated solution must be integrated within a stewardship/governance framework that alerts data practitioners and ETL developers when a violation of expectations is identified. This is especially effective when the data rules used for validation can be applied at different stages of the ETL process flow, since that enables the data stewards to determine the point at which a flawed value was introduced, allowing them to isolate the root cause of the unexpected results. The result is the ability to eliminate the source of the problem, instead of relying on downstream data corrections that will ultimately lead to inconsistencies with the original sources.</p>
<p>In essence, an approach that automates ETL verification via validation of defined business rules will enable a level of trust when ETL processes generate the appropriate results, yet provide immediate alerts and feedback when changes in the environment allow errors and flaws to sneak into the target systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=368</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>iOS6, Apple Maps, and the Biggest Data Quality Story This Year</title>
		<link>http://dataqualitybook.com/?p=366</link>
		<comments>http://dataqualitybook.com/?p=366#comments</comments>
		<pubDate>Fri, 28 Sep 2012 14:11:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=366</guid>
		<description><![CDATA[As anticipated, as part of Apple’s recent release of iOS6, the incumbent Google Maps application was replaced by Apple’s homegrown version. Excitement has quickly degenerated into disappointment (at best) and anger (at worst) over the flaws in Apple’s version. And as of this morning, a quick scan at Google News reported almost 2000 articles reflecting [...]]]></description>
			<content:encoded><![CDATA[<p>As anticipated, as part of Apple’s recent release of iOS6, the incumbent Google Maps application was replaced by Apple’s homegrown version. Excitement has quickly degenerated into disappointment (at best) and anger (at worst) over the flaws in Apple’s version. And as of this morning, a quick scan at Google News reported almost 2000 articles reflecting Apple’s <em>mea culpa </em>culminating with a <a href="http://www.usatoday.com/tech/story/2012/09/28/apple-ceo-apologizes-for-maps-flaws/57850850/1">personal message from CEO Tim Cook</a> stating:</p>
<p><em>&#8220;At Apple, we strive to make world-class products that deliver the best experience possible to our customers. With the launch of our new Maps last week, we fell short on this commitment. We are extremely sorry for the frustration this has caused our customers and we are doing everything we can to make Maps better.&#8221;</em></p>
<p>It looks like quite a firestorm over what is basically a data quality issue&#8230;<span id="more-366"></span></p>
<p>I was a devoted fan of the Google Maps application, and after taking a quick look at the Apple app this morning, I was surprised to find that I lived practically next door to the “Hampshire-Langley Shopping Center” (actually it is about 6 miles away), that Our Lady of Good Counsel High School was up the street (it moved from Wheaton to Olney in 2006), the Torah School is next to the 7-11 (actually, it is miles away as well), and in fact many of my familiar businesses just don’t show up at all.</p>
<p>The outrage over the flaws in app is quite telling about our dependence on quality information, especially in the context of location. I have written many pieces about the value of location, and the fact that the iPhone Maps app is basically unusable highlights the degree to which poor data quality has significant impacts:</p>
<ul>
<li>Wasted time and rework in finding alternatives to the app to get the right locations and directions</li>
<li>Significant brand risk and degeneration</li>
<li>Warranty costs to scramble to fix the app</li>
</ul>
<p>Here is a big one: when iOS 6 was released, Apple’s stock price hovered around $700 per share; on September 27, the stock closed near $675 per share. According to my back of the envelope calculation, this corresponds to a loss in Apple’s market capitalization of approximately $23 billion. Is this all attributable to the Maps debacle? Probably not all of it, but I believe some of the value of the company evaporated as a result of this source of poor data quality.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=366</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using Data Integration Testing for Reconciling Production Data Assets</title>
		<link>http://dataqualitybook.com/?p=362</link>
		<comments>http://dataqualitybook.com/?p=362#comments</comments>
		<pubDate>Fri, 14 Sep 2012 13:41:51 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[data consistency]]></category>
		<category><![CDATA[data integration]]></category>
		<category><![CDATA[data migration]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=362</guid>
		<description><![CDATA[In my last post, we started to discuss the need for fundamental processes and tools for institutionalizing data testing. While the software development practice has embraced testing as a critical gating factor for the release of newly developed capabilities, this testing often centers on functionality, sometimes to the exclusion of a broad-based survey of the [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://dataqualitybook.com/?p=358">last post</a>, we started to discuss the need for fundamental processes and tools for institutionalizing data testing. While the software development practice has embraced testing as a critical gating factor for the release of newly developed capabilities, this testing often centers on functionality, sometimes to the exclusion of a broad-based survey of the underlying data asset to ensure that values did not (or would not) incorrectly change as a result.</p>
<p>In fact, the need for testing existing production data assets goes beyond the scope of newly developed software. Modifications are constantly applied within an organization – acquired applications are upgraded, internal operating environments are enhanced and updated, additional functionality is turned on and deployed, hardware systems are swapped out and in, and internal processes may change. Yet there are limitations in effectively verifying that interoperable components that create, touch, or modify data are not impacted. The challenge of maintaining consistency across the application infrastructure can be daunting, let alone assuring consistency in the information results.<span id="more-362"></span></p>
<p>This presents an opportunity to consider the first of our use cases for data integration testing: assuring consistent performance with respect to existing production data assets.</p>
<p>The core theme is straightforward: defining a set of rules that specify consistency between two copies of what should be the same thing. As a simple example, your organization decides to transition from using one vendor’s database environment in favor of a different vendor’s database system. This requires a data migration, involving an extraction of the data from the current environment and loading that same data into the new environment. It is a no-brainer to expect that the data in the new target should be identical to the data in the soon-to-be-deprecated source.</p>
<p>But even in places that are aware of the need to do data testing, the practices in place are often uncoordinated and rely on manual efforts, data sampling, or customized SQL queries and extracts for comparison. The biggest risks of the uncoordinated approach fall into three categories:</p>
<ul>
<li>Resource-constrained – these customized methods are expensive to assemble, requiring dedicated staff to pull together the right queries and extracts. Yet the completeness of the tasks is limited when the data sizes are large (or increasingly, massive) and the resources used for testing (like desktop spreadsheets) have artificial limitations. The result is that the testing itself is likely to not be thorough, and may miss both general and edge-case expectations.</li>
<li>Non-repeatable – When tests are specific to a single instance, there is no ability to rely on existing corporate knowledge that might imply curious dependencies. The result is a need for rediscovery of all latent business rules for validity, and as a result there is little or no reuse of existing testing protocols.</li>
<li>Ungoverned – The absence of a general repeatable methodology not only leaves open the door for introduction of errors while testing, its lack of a logging capacity to track what tests are performed and the corresponding results (and associated remediation tasks) exposes the migration to errors that often stay hidden until much later in time, when the impacts are much more insidious. No audit trail means that for any future issues, the data practitioners must basically start from scratch to figure out the original source of the introduction of the issue.</li>
</ul>
<p>The two biggest issues, in my opinion, are the need for scalability and coverage, and the need for oversight and governance. Both of these can be addressed through data testing automation.</p>
<p>The first step involves accepting the idea that <strong>all</strong> of the data needs to be reviewed for a production data migration or exchange. Sampling is just not enough, despite what the statisticians tell you. Therefore, defining a set of rules for consistency between the source and the target should be a practical task, which may be simplified if the data models are also identical. However, most migrations also allow for some innovation in the underlying models to address any modernization or reengineering implied by the transition to a new environment. That means adding rules for the structural changes that still verify a consistent move. The increasing sizes of production data sets implies scalably comparing all corresponding data instances, and while the monotony of manual queries and review may lull your analysts to a mind-numbing stupor, rule-based automated testing will look at all the data and reduce the manual review to discrepancies that can be effectively linked to defined rules.</p>
<p>The second step builds on the first to address oversight and governance, largely manifested in traceability of applied rules, logging of results, a versioning of defined rules to monitor newly-discovered consistency issues that must be compared, and history (as well as trending for reviewing consecutive iterations of the data movement, if necessary). Providing an audit trail that tracks consistency and validity by specific business rule establishes a record that can be used in the future for isolating those latent data issues that might crop up in the future.</p>
<p>A database migration is just one use case; there are many other variations on the same theme, such as data movement from source systems into operational data stores, movements into alternate analytics platforms (such as Hadoop), or data exchanges. Defining the consistency rules should be straightforward (largely a question of verifying exact equality), but may include some levels of complexity when there are data format or type changes.</p>
<p>Lastly, remember that this is production data – the data you run your business on. You don’t want to mess with that! In this way, defined business rules and automated data validation can be used to identify unexpected production data results and how those same defined rules can help a data analyst find the root cause of the inconsistencies and remediate them.</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=362</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Best Practices for Data Integration Testing Series &#8211; Instituting Good Practices for Data Testing</title>
		<link>http://dataqualitybook.com/?p=358</link>
		<comments>http://dataqualitybook.com/?p=358#comments</comments>
		<pubDate>Fri, 03 Aug 2012 19:48:12 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Metadata]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=358</guid>
		<description><![CDATA[I have been asked by folks at Informatica to share some thoughts about best practices for data integration, and this is the first of a series on data testing. It is rare, if not impossible, to develop software that is completely free of errors and bugs. Early in my career as a software engineer, I [...]]]></description>
			<content:encoded><![CDATA[<p>I have been asked by folks at Informatica to share some thoughts about best practices for data integration, and this is the first of a series on data testing.</p>
<p>It is rare, if not impossible, to develop software that is completely free of errors and bugs. Early in my career as a software engineer, I spent a significant amount of time on “bug duty” – the task of looking at the list of reported product errors and evaluating them one-by-one to try to identify the cause of the bug and then come up with a plan for correcting the program so that the application error is eliminated. And the software development process is one that, over time, has been the subject of significant scrutiny in relation to product quality assurance.</p>
<p>In fact, the state of software quality and testing is quite mature. Well-defined processes have been accepted as general best practices, and there are organized methods for evaluating software quality methodology capabilities and maturity. Yet when all applications are a combination of programs applied to input data to generate output information, it is curious that the testing practices for data integration and sharing remain largely ungoverned manual procedures.<span id="more-358"></span></p>
<p>As we move into a more data-centric world, the fact that many organizations are still relying on manual data reviews or spot-checking is puzzling at best, and risky at worse. Manual data testing takes many different forms, such as eyeballing (scanning through record sets looking for anomalies), hand-coded SQL scripts to look for specific known issues, dumping data into desktop spreadsheets and doing file diffs, or even coding scripts and programs for each data set to do validation. These artifacts reflect an absence of process and methodology; they are not reusable, may not provide complete verification, are not auditable, and worse yet, may themselves be prone to errors.</p>
<p>As we hear more and more about newer technologies for big data, analytical appliances, data repurposing, or text analytics (to name a few), it is uncommon for enterprise to have an organized framework for assurance that the data assets modified or created by <em>any</em> process are consistent with either the end-user expectations, or even that the data was not inadvertently modified as a result of a process failure somewhere along the line. More to the point: we need to develop mature processes and tools for automated, repeatable and auditable data testing that can be applied in general practice.</p>
<p>To tell the truth, that is what motivated my transition from being a software engineer to being a data practitioner: the desire to formulate methods and tools for automatically testing data validity. The entire premise of <a href="http://www.amazon.com/gp/product/0124558402/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0124558402&amp;linkCode=as2&amp;tag=wwwknowledgei-20">my first book on data quality</a> was predicated on the idea that one could define sets of business rules for data that could automatically be integrated into a rules engine for continuous data quality assurance.</p>
<p>Today, many tools provide some level of this capability, but it is valuable to consider the same ideas in a number of specific use cases for data reuse, sharing, exchange, or integration (which I will explore in upcoming notes over the next few months). In particular, consider the specific application of three key practices that can supplement and enhance your organization’s level of trust in shared and repurposed data while driving operational aspects of data stewardship and governance:</p>
<ul>
<li><strong><em>Alignment of enterprise data rules:</em></strong> Application developers only look at the specifics of their program’s transformations of input to output within the confines of the immediate application requirements, but don’t consider holistic expectations for downstream data use. Yet there must be some assurance to the consumers of the application’s output that the data has not been warped in any way shape or form. This is particularly true for extract/transform/load (ETL) processes, in which there must be some verification that the transformations were applied appropriately. To enable any kind of data testing, there is a need for defining and managing business data rules that characterize end-user expectations for integrated data.</li>
<li><strong><em>Instituting data controls:</em></strong> One cannot verify that data sets remain consistent or verified unless you introduce data controls at various points along the data integration pathways. Data controls can examine consistency and validity at different levels of granularity. An example of a coarse-grained control compares the number of records in a data file before and after the data integration task, while more finely-grained controls will verify that the business data rules are applied correctly at the record or even the data value level.</li>
<li><strong><em>Automated testing:</em></strong> This third practice is automating the data controls for validation. As the volumes of data absorbed, manipulated, integrated, and shared across the enterprise grow, eyeballing and manual reviews of (pseudo-)randomly selected data instances will prove to be non-scalable and unsatisfactory.</li>
</ul>
<p>Employing our first practice of defined business data rules within our third practice using an automated data testing harness enables the broad deployment of consistent data controls for verification of expectations as well as identifying and reporting on any violations.</p>
<p>There are basically two abstract use cases that can benefit from automated data testing. The <em>first</em> is when input data sets are transformed into output data sets, and verification is required to ensure that the transformations were correctly applied, such as when ETL processes are updated or modified, or when new rules are added to a data integration process. The <em>second</em> is where the data sets need to remain identical, such as the data migrations necessary as legacy applications are retired and replaced with new systems.</p>
<p>A relatively good overview is provided <a href="http://vip.informatica.com/?elqPURLPage=10368&amp;BK=DVOAUGRPLY-DL">in this recorded webinar</a> from <a href="http://www.informatica.com/us/products/enterprise-data-integration/powercenter/options/data-validation-option/">Informatica</a>. In addition, as part of this series, I will be participating with Informatica in a number of short webinars over the next months to discuss some of the different use cases and environments in which automated data testing can provide benefits for data validation and verification, especially as heterogeneous data volumes increase in size and complexity.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=358</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Response to &#8220;Eight Problems with Big Data&#8221;</title>
		<link>http://dataqualitybook.com/?p=352</link>
		<comments>http://dataqualitybook.com/?p=352#comments</comments>
		<pubDate>Thu, 26 Apr 2012 14:11:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Business Impacts]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[KDD]]></category>
		<category><![CDATA[pregnancy]]></category>
		<category><![CDATA[pregnant]]></category>
		<category><![CDATA[target]]></category>

		<guid isPermaLink="false">http://dataqualitybook.com/?p=352</guid>
		<description><![CDATA[After reading Jay Stanley’s ACLU article on “Eight Problems with Big Data,” it is worth reflecting on what could be construed as a fear-mongering indictment of the use of big data analytics and the implication that big data analytics and its implementation of data mining algorithms are tantamount to all-out invasion of privacy. What is [...]]]></description>
			<content:encoded><![CDATA[<p>After reading Jay Stanley’s ACLU article on “<a href="https://www.aclu.org/blog/technology-and-liberty/eight-problems-big-data">Eight Problems with Big Data</a>,” it is worth reflecting on what could be construed as a fear-mongering indictment of the use of big data analytics and the implication that big data analytics and its implementation of data mining algorithms are tantamount to all-out invasion of privacy. What is interesting, though, is the presumption that privacy advocates have been “grappling” with data mining since “not long after 9/11,” yet data mining was already quite a mature discipline by that point in time, as was the general use of customer data for marketing, sales, and other business purposes. Raising an alarm about “big data” and “data mining” today is akin to shutting the barn door decades after the horses have bolted.<span id="more-352"></span></p>
<p>However, to be fair it is worth reviewing some of the author’s points, all of which I am directly quoting from the article:</p>
<ol>
<li><strong>1.    </strong><strong>“It incentivizes more collection of data and longer retention of it. If any and all data sets might turn out to prove useful for discovering some obscure but valuable correlation, you might as well collect it and hold on to it. In long run, the more useful big data proves to be, the stronger this incentivizing effect will be – but in the short run it almost doesn’t matter; the current buzz over the idea is enough to do the trick.”</strong></li>
</ol>
<p>Of course, there is the attitude that more collection of data and longer retention might turn out to be of value, yet without a crystal ball to tell you exactly <em>what </em>data is going to be useful to you in the future, you’d have to store <em>all the data</em>. A point in fact: one of our (social services) customers expressed regret that they only began to track certain pieces of information long after their systems were initially put in place. The missing data enabled potential scenarios in which benefits might be provided to people who did not truly qualify, yet the gap allowed people to game the system. But without knowing that the agency would need that data when the application was deployed, those information gaps remain.</p>
<p>And even if organizations do feel incentivized to collect and retain data, few of these organizations have ironed out the gaps in their own process maturity be effective at extracting much value. In the foreseeable future, the costs and efforts for capturing, managing, archiving, and ultimately, trying to <em>even find the data they need</em> is going to far exceed the value most businesses can derive.</p>
<ol>
<li><strong>2.    </strong><strong>“When you combine someone’s personal information with vast external data sets, you create new facts about that person (such as the fact that they’re pregnant, or are showing early signs of Parkinson’s disease, or are unconsciously drawn toward products that are colored red or purple). And when it comes to such facts, a person a) might not want the data owner to know b) might not want anyone to know c) might not even know themselves. The fact is, humans like to control what other people do and do not know about them – that’s the core of what privacy is, and data mining threatens to violate that principle.”</strong></li>
</ol>
<p>I feel compelled to push back on this one on a number of fronts. First of all, data analysis does not “create facts” – the discovery that a person exhibits purchasing behaviors that are consistent with pregnancy does not make that person pregnant (at least according to what I learned in biology class). Granted, though, there may be situations in which people will mistake analytical results for facts; however, that can happen with any inference, such as guessing that a celebrity is pregnant based on the exposure of her bump in the tabloids.</p>
<p>On the second front, most interactions (business- or otherwise) involve an exchange of information. People who want to control what other people know and do not know about them must be aware of this fact. Alternatively, each of us derives some value from the interaction. Instead of suggesting that people need to control information, a better approach might be to suggest that those people must weigh the value they get from the interaction and the cost of exposing the data.</p>
<p>For example, when you search for information through using a search engine, you benefit by getting access to the information you were looking for. On the other hand, the search engine has to capture what you are looking for in order to help you, and the search histories help refine the algorithms and make the results more precise. So you get some information in return for providing some information. It is up to you to decide whether you got a good deal or not.</p>
<p>And again, in some situations the availability of data that can help inform a decision about what is (or is not) a fact is benefited by big data analysis. For example, large-scale analysis of clinical health care information can help a provider in more precise (and hopefully more accurate) diagnosis, as well as effectiveness research that can help in identifying the choices for treatment that have been most successful under similar circumstances.</p>
<ol>
<li><strong>3.    </strong><strong>“Many (perhaps most) people are not aware of how much information is being collected (for example, that stores are tracking their purchases over time), let alone how it is being used (scrutinized for insights into their lives).  The fact that Target goes to considerable trouble to hide its knowledge from its customers tells you all you need to know on that front.”</strong></li>
</ol>
<p>Actually, stores and many other businesses have been collecting information for decades, and much of that data is willingly offered because the information provider expects to get some benefit in return. Supermarket loyalty cards, joining frequent flyer programs, filing out product registration forms, even that drawing you entered when you filled out that entry and dropped it in the box at the mall – each of these present some opportunity for value in return for information. And when that data is used for insights that are beneficial, many people don’t have a problem with it. For example, my kids’ English teachers suggest looking up books they liked on Amazon and scanning the other suggested titles to find books that they might enjoy.</p>
<p>In many cases, the data is offered without any expectation – consider those who willingly allowed ratings companies to monitor their television viewing choices, or who regularly respond to form or telephone surveys. The point is that if you willingly offer information without any constraints or limitations, don’t be surprised when that data is used.</p>
<ol>
<li><strong>4.    </strong><strong>“Big data can further tilt the playing field toward big institutions and away from individuals. In economic terms, it accentuates the information asymmetries of big companies over other economic actors and allows for people to be manipulated. If a store can gain insight into just how badly I want to buy something, just how much I can afford to pay for it, just how knowledgeable I am about the marketplace, or the best way to scare me into buying it, it can extract the maximum profit from me.”</strong></li>
</ol>
<p>All sales processes are driven by the need for the sales person to influence the decision of the buyer, and therefore you could extrapolate that in every sales situation the buyer is being “manipulated.” The use of big data analytics does not change the core sales process, it merely informs the salesperson. In addition, the suggestion that a buyer is subject to manipulation is actually somewhat insulting to the buyer, who is also getting something out of the transaction, not just forking over the money.</p>
<p>If a company has a lot of insight but provides bad products, poor customer service, or limited warranty protection for their products or services, customers will still go to other places to gets the things they want. On the other hand, if a company uses customer insight to provide better products and services, stock the kinds of products the customers want, and engages the customer in a relationship, customers will go there, even without the data mining.</p>
<p>Lastly, companies are usually in the business of generating profits, so it is ingenuous to fault them for wanting to extract the maximum profit. In fact, analyzing customer sensitivity to product pricing might show that <em>lowering </em>certain product prices might increase volume sales, leading to greater profits. In this case they are “extracting the maximum profit,” but not necessarily through increased prices.</p>
<ol>
<li><strong>5.    </strong><strong>“It holds the potential to accentuate power differentials among individuals in society by amplifying existing advantages and disadvantages. Those who are savvy and well educated may get improved treatment from companies and government – while those who are poor, underprivileged, and perhaps already have some strikes against them in life (such as a criminal record) will be easily identified, and treated worse. In that way data mining may increase social stratification.”</strong></li>
</ol>
<p>It is not clear what big data has to do with this; even without data mining, those who are savvy and well educated may get improved treatment as a result of their savviness and education – they may make themselves better informed, or ask better questions. Those with a criminal record won’t be able to hide that either, since it is certainly legal to perform a criminal background check under defined situations. Again, that has nothing to do with big data and has a lot to do with the bill of rights, in retrospect (“habeas corpus” anyone?).</p>
<ol>
<li><strong>6.    </strong><strong>“Data mining can be used for so-called “risk analysis” in ways that treat people unfairly and often capriciously – for example, by insurance companies or banks to approve or deny applications. Credit card companies sometimes <a href="http://abcnews.go.com/GMA/TheLaw/gma-answers-credit-card-companies-financially-profiling-customers/story?id=6747461&amp;singlePage=true">lower a customer’s credit limit</a> based on the repayment history of the <em>other customers</em> of stores where a person shops. Such “behavioral scoring” is a form of economic guilt-by-association based on making statistical inferences about a person that go far beyond anything that person can control or be aware of.”</strong></li>
</ol>
<p>An organization has a fiduciary responsibility to attempt to limit its risk, especially when it comes to offering credit or lending money. The models for risk analysis do (and probably should) take behavior characteristics into account. For example, parachuting out of airplanes probably impacts your life insurance premium. And you don’t need data mining to figure that out, but rather the standard statistical and probability analysis that actuaries have been doing for a long time.</p>
<p>Identifying the characteristics of a pool of individuals that increase risk of defaulting on credit payments is a way of (1) protecting the corporate self-interest but also (2) protecting the interests of those who don’t choose to default on their payments but are the ones whose interest rates are raised to accommodate for loss and fraud.</p>
<p>Let’s cast this concept into perspective of the recent worldwide financial crisis that was, at its core, driven by unscrupulous practices in lending money to individuals who essentially could not afford to pay it back. One of the root causes? Simplified somewhat, subprime lending to individuals with low documentation or no documentation on their applications, coupled with the obfuscation of the real risks of default as the mortgage pools were continually mashed together, reconstructed, and re-rated by the ratings agencies, together eventually led to the 2008-2009 blowup. Perhaps, in this case a little <em>more </em>data might have helped?</p>
<ol>
<li><strong>7.    </strong><strong>“Its use by law enforcement raises even sharper issues – and when our national security agencies start using it to try to spot terrorists, those stakes can get even more serious. We know too little about how our security agencies are using Big Data, but such approaches have been discussed since the days of the <a href="http://www.aclu.org/technology-and-liberty/data-mining">Total Information Awareness</a> program and before – and there is strong evidence that it’s being used by the NSA to sift through the vast volumes of communications that agency collects.  The threat here is that people will be tagged and suffer adverse consequences without due process, the ability to fight back, or even knowledge that they have been discriminated against. The threat of bad effects is magnified by the fact that data mining is <a href="http://www.nap.edu/catalog.php?record_id=12452">so ineffective</a> at spotting true terrorists.”</strong></li>
</ol>
<p>I recall reading some papers after 9/11 that demonstrated that analyzing data about the hijackers that demonstrated a high degree of relationship among them, implying that had we done the analysis, they could have been stopped. I actually looked back at some writings of author Malcolm Gladwell, in early 2003, which refers to “creeping determinism,” a term he notes was coined three decades earlier by Baruch Fischhoff, to refer to “the sense that grows upon us, in retrospect, that what has happened was actually inevitable.” To some degree, the fear of big data emanates from the thought that government agencies are able to predict who is or who is not a terrorist.</p>
<p>First of all, if this is true, then where were all the analysts on September 10<sup>th</sup>? The gap in terrorism prediction is probably a byproduct of the massiveness of the volumes of data. You have so much data it is hard to know where to look. In the past, I have suggested that as opposed to the common metaphor of looking for a needle in a haystack, it is akin to looking for a <em>specific </em>needle in a huge field filled with stacks of needles.</p>
<p>However, I will grant the author his point here, but the failure is not with the data but rather with the people and their elected representatives they support, who rushed through legislation such as the PATRIOT Act that effectively abdicated the right to privacy. The widespread support for yielding their liberties as a way to save the country from terrorism is probably what Benjamin Franklin had in mind when he said “<em>Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety.</em>”</p>
<p>On the other hand, there are situations in which security agencies have used their data analysis to spot risky situations and true terrorists. The problem for people like me and you is that since the catastrophes are prevented, we’ll never know…</p>
<ol>
<li><strong>8.    </strong><strong>“Over time such consequences will lead to chilling effects, as people become more reluctant to engage in any behaviors that will put them under the macroscope (more about that in a future post).”</strong></li>
</ol>
<p>Actually, perhaps when people realize the degree to which they have willingly compromised their own private information for little return, they <span style="text-decoration: underline;">should</span> become more reluctant to engage in those behaviors.</p>
<p>In summary, while many of the concerns are valid, they are orthogonal to the questions of using big data and data mining. The risks are not in the bigness of the data, but rather the responsibilities and accountability of those using the data. Are the companies using data mining ensuring that the source data sets of measurable quality that is suitable for the analyses? Do the consumers of the results of the data mining believe that the results are “factual” or that they are informative? Are they drawing correct or incorrect conclusions? Do they have the corporate maturity to integrate the analytical results to benefit the consumer communities? Do they publish their data use practices? Do they allow for individuals to opt out?</p>
]]></content:encoded>
			<wfw:commentRss>http://dataqualitybook.com/?feed=rss2&#038;p=352</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
