Predictive analytics tells you what will happen; prescriptive analytics tells you what to do about it.
Decision Support and Analytics has traditionally addressed Descriptive Analytics and Predictive Analytics. Jeff Bertolucci highlights this domain founded on the methods of Operations Research and called by IBM, “the final phase” in business analytics.
An article posted to Information Age 18 February, 2013 Teradata CTO Stephen Brobst highlights the schism that has overtaken traditional Decision Support and the new-age Big Data camp, noting at a recent Stanford University very-large-database conference “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce.'” (Swabey, 2013 ). The inclusion of the Hadoop platform by name and the technology’s rapid ascendancy is striking in its proliferation progressing from initial release to core services in multinational platforms in less than six years (Hadoop Releases, 2013), yet it represents the lion’s share of the commercial Big Data marketplace. Fanatical zeal aside, should it be the sole platform for knowledge management and creation?
Much is made of the dimensions by which we assign special treatment to “Big Data”. These facets are known popularly as “The Three V’s”, which are defined by Gartner as “high-volume, high-velocity and high-variety information assets”. Additional V’s are sometimes added to suit the audience as necessary including Veracity (What is big data?, 2013), Variability, and Value (Fan, 2013). In the December 2013 issue of the ACM SIGKDD, Wei Fan and Albert Bifet explore the current and future state of Big Data. They allude to signals that the technology adoption has overshot the technical ecosystem’s ability to give it proper perspective providing seven factors they consider to be controversial (Fan, 2012):
There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again […]
Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool […]
In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency […]
Claims to accuracy are misleading […]
Bigger data are not always better data. It depends if the data is noisy or not, and if it is representative of what we are looking for […]
…[Is it] ethical that people can be analyzed without knowing it […]
Limited access to Big Data creates new digital divides […]
Further supporting Fan and Bifet’s arguments, Stephen Brobst notes, “A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new. What’s new is the velocity of change in the structure of data.” (Swabey, 2013).
Google (noticeably silent in the Big Data marketplace) abandoned the batch processing approach underlying Hadoop in favor of a real-time, service-based processing architecture originally called Dremel and outlined in a paper from Google research (Melnik, 3010). Google’s BigQuery cloud service, used extensively at Google internally, takes a differing tack that “builds on ideas from web search and parallel DBMSs”—core competencies for the company. In a January 2013 consortium organized by IBM and Arizona State University, Dr. K. Selcuk Candan (Candan, 2013) highlights six key outcomes which may be summarized as a need for better data fusion, data analysis algorithms, data models, scalable architectures, and real-time analysis. While several vendors are visibly out front with custom Hadoop builds for real-time analysis, two non-Hadoop projects, S4 in the Apache Incubator and the production-ready Storm (http://storm-project.net/) show promise a general-purpose parallel computing engines.
While Apache Hadoop project has staged an impressive entrance, broken through the Relational and OLAP paradigms, and shown the viability of open source software, I intend to keep an eye on the companies that have avoided the hype such as Google (Regalado, 2013) and observe as the market polarizes into real-time analysis and those who never needed it.
Candan, K. Selcuk. (2013, June 25). Hunting for the Value Gaps in Data Management, Services, and Analytics. Retrievedfrom http://wp.sigmod.org/?p=904 .
In “HOW SECRECY CAN DISTORT DATA” (http://www.newyorker.com/online/blogs/elements/2013/06/the-problem-with-secret-information.html), David Berreby cites two studies that posit that an individual will rate classified information, on average, with 15% more credibility than non-classified information. I find the article and the studies cited to be naive in their approach to supporting the notion that adding a classification label lends some inherent credibility to information when judged by legitimate professionals.
The methodologies of these studies don’t exactly lend themselves to authoritative results. Perhaps if individuals from the Intelligence services were recruited for comparison it may be slightly more informative but even within those groups the ability to discern credibility (and the responsibility to make that judgement) run a very broad spectrum. Further, gauging between classified and unclassified sources is probably not meaningful as decisions are made from multiple lines of evidence in _any_ field meaning that a bias, should it actually exist, would likely not be a factor in real-world decisions. I would be very interested to see a study performed with a more valid population and a measure inserted to test if these biases actually influenced any decision in a meaningful way.
Anyone know of any studies analyzing these factors?
Analytics, Data Mining, and Big Data architectures and supporting practices by Don Krapohl