Most statisticians have probably heard about ”Big Data”, sometimes in combination with the term “Predictive Analytics”. But what do these terms actually stand for, and do they represent something that is fundamentally new, or are they just new variations of concepts, which are already known from disciplines like “decision support systems”, “data mining”, and “business intelligence”?
Three examples of applications
There are three successive applications of ”Big Data” and “Predictive Analytics”, which are often mentioned in the literature: Google Translate, the PriceStats project, and Google Flu Trends.
Google Translate is a break-through within the discipline of automatic translation. For at least half a century, highly competent researchers have tried to create computer programs for translations between natural languages – without succeeding. The attempts have mainly built on linguistic theories in combination with artificial intelligence. Google Translate uses quite a different approach, based on statistical analyses of documents, which have been translated by professional translators. The translations made by Google Translate are far from perfect, but they often make sense and are useful in situations, when you do not have the time or money to contract professional translators. For details about methods and algorithms used by Goole Translate, make a Google search on “google translate, methodology” or read , https://en.wikipedia.org/wiki/Google_Translate.
Figure 1. Inflation in the United States. PriceStats Index vs Official CPI. From .
The PriceStats project, described in , uses price data from the Internet for producing an alternative or complement to the traditional survey-based Consumer Price Index (CPI), where field workers collect price data from physical shops. The new, Internet-based method is both faster and less costly and have turned out to provide results, which are consistent with the results from the traditional CPI; see Figure 1.
Google Flu Trends uses statistical analyses of Google searches concerning flu symptoms in order to predict the proliferation of a flu. Google Flu Trends were able to predict the proliferation about one month faster than traditional epidemiological methods. However, after using the algorithms for some time, they began to provide results of lower quality than before. For details about the method and the quality problems, see , https://en.wikipedia.org/wiki/Google_Flu_Trends.
Concepts and definitions
We shall now discuss the concepts of “Big Data” and “Predictive Analytics”. ”Big Data” refers to the data sets used and how they are managed. “Predictive Analytics” is a class of methods for statistical analysis, which are often used in combination with “Big Data”.
Gartner, an American research and advising firm, has over the years delivered a number of slightly different definitions of the concept of “Big Data”. All of these definitions include the so-called 3V, volume, velocity, and variety, a triple which was first introduced 2001 in an article by Doug Laney on data management.
A typical definition of “Big Data”, based on different formulations from Gartner, could be as follows:
Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight, decision-making, and process optimisation.
This definition combines three major aspects:
WHAT? Contents and characteristics:
”high-volume, high-velocity and high-variety information assets” (3V)
HOW? Methods and tools:
”cost-effective, innovative forms of information processing”
(data collection, data management, analysis and presentation)
”enhanced insight and decision making
The definition reflects the typical data collection – processing – analysis – decision – action – evaluation feedback loop, illustrated by Figure 2, and which is known from earlier phases in the development of decision support systems, including Business Intelligence (BI). However, in the context of Big Data, several steps in this process model are enriched and need to be re-interpreted to some extent.
Figure 2. Collecting, processing, and using information for analytical purposes. From .
The meaning of “the 3V” in the definition of “Big Data” can be described as follows:
Volume: Large sets of relevant and useful data can often be obtained from the Internet and from operative systems like business systems, public administrative systems, and systems for monitoring and supervision of real-world systems like traffic control. New sources and data collection methods are often used, for example social media and sensors (Internet of Things). The data may be stored in traditional databases and data warehouses, before they are analysed, but streaming data may also be analysed “on the fly” in order to give feed-back to the control of the processes that generate them.
Velocity: Data are often generated more or less continuously (streaming data) by events occurring with a high frequency and have to be taken care of with high speed. Figure 3.
Variety: There are many different types of data that need to be managed and analysed in combination with each other: more or less structured data, free text data, pictures and photos, sound, videos, etc. The management of these data requires new types of databases in addition to traditional SQL databases.
Streaming data may be used “on the fly” without storing them first in a database. This may speed up the analyses, so that they can be used “in real time” as well, for example with the purpose of controlling and optimizing an on-going process by means of feedback from the analyses; see Error: Reference source not found3.
Figure 3. Streaming data vs stored data.
The data and the databases may be more or less structured. So-called relational databases, or SQL databases, have been a dominating standard for several decades now, and data in such databases are typically highly formatted and well structured. However, with increasing use of free-text data and other less structured and more heterogeneous data, often captured from the Internet, it has become necessary to develop new types of databases, sometimes called “NoSQL databases”, where “NoSQL” stands for “Not Only SQL”. As indicated by this term, a data warehouse built for analysis and decision-making may make use of both traditional, well-structured SQL databases, and other types of databases, suitable for the analysis of less structured data.
One may distinguish between “found data” and “made data” , In traditional statistical surveys and experiments, the data collection and subsequent processes are typically designed by the users, researchers, and statisticians, who order and execute the surveys and experiments. Data are thus “made” in a purpose-driven and controlled way. In contrast, when using data from existing sources, for example administrative registers, business systems, process monitoring systems, or social media, the researcher has no or little control over the generation of data or how the data quality is ensured. The researcher has to accept the “found” data as they are and may possibly be able to complement with own controls and auxiliary processes to make the best of the situation.
In most cases data obtained from different sources are stored in a well-organized database or data warehouse, containing both the data themselves and metadata describing the data. The metadata may originate from both design processes and operational processes, for example measurement and observation processes. The design processes will generate definitions of concepts and measurement procedures. Measurement and observation processes may generate metadata about non-response and other errors, which may cause uncertainties and quality problems. When the data used for analytical purposes by one organisation emanate from databases and data collection procedures in other organisations, it is important to acquire not only the data from the other organisations, but also documentations and metadata.
The metadata may also include so-called paradata, data about the processes which may be used for monitoring on-going processes and possibly adjusting them dynamically, “on the fly”, for achieving better process performance in terms of quality and efficiency.
The term ”predictive analytics” is used for a class of statistical methods, which are often used for discovering and analysing statistical relationships in “big data”. The methods may be grouped into two main categories:
Methods based on correlation and regression, for example linear regression, discrete choice models, logistic regression, multinomial logistic regression, probit regression, time series models, survival or duration analysis, classification and regression trees, multivariate adaptive regression splines
Methods based on machine learning, for example neural networks and pattern recognition: neural networks, multilayer perceptron (MLP), radial basis functions, naïve Bayes for performing classifications tasks, pattern recognition methods, k-nearest neighbours, geospatial predictive modelling
Despite the word “predictive” in “predictive analytics”, the methods are used also for other purposes than for making predictions and prognoses. Some common application areas are:
Risk assessments in banking and insurance businesses
Estimation of the potentials of different customer categories in marketing
Estimation of security risks
Some typical characteristics of methods and applications are:
Capture relationships among many factors to allow assessment of risks or potentials associated with particular sets of conditions
Provide a predictive score (probability) for each individual (customer, employee, patient, product, vehicle, machine, component, …) in marketing, banking, insurance, security risks, fraud detection, health care, pharmaceuticals, manufacturing, law enforcement, ...
What is new?
”Big Data” and ”Predictive Analytics” may be seen as the latest step in a development that started decades ago, and which includes:
Decision Support Systems (DSS)
Business Intelligence (BI)
All these methods of collecting, processing, and usage of analytical information may be described by means of the basic process model in Figure 2 above.
What is new with ”Big Data” is, among other things, what is characterised by the three Vs: the data volumes, the velocity by which data are generated and must be managed, and the diversity of data types, often in combination with each other. The Internet, streaming data, and data generating sensors (the Internet of Things), have created radically new conditions. New methods and tools for data management are needed.
Another important novelty is the decoupling of analyses from domain-specific theories and models. A fundamental thesis within traditional statistics is that statistical relationships as such, for example correlation between variables, cannot as such prove any causal relationships, unless they are combined with domain-specific theories and models. However, the successful applications of “big data” and “predictive analytics” indicate that, at least sometimes, it is possible to obtain very useful knowledge without necessarily building on causal relationships.
What is new in the different steps in the process model?
Data sources and data collection methods:
”Found data” rather than ”made data”
Available data sources and available data (from the web, administrative registers, operational systems, including process control systems and sensor data) rather than
probability-based sample surveys (with growing costs, measurement problems, and problems with huge and biased non-response
controlled experiments in laboratory-like environments
Data storage, data processing, and quality control of data:
Streaming data and new types of databases rather than traditional databases alone
Using macro-editing (selective editing, or significance editing), for optimising resources used for identifying and modifying suspicious data, and for replacing missing data by imputed values, rather than traditional massive, costly, and time-consuming data editing based on rules and expert judgements; Nordbotten 
Data analysis, problem solution, and decision-making:
Using correlation and regression analyses and other statistical relationships in the data rather than testing theory-based testing of hypotheses for generating problem solutions and decisions
Self-learning systems learning from experiences and corrections by experts rather than systems based on domain theories and decision rules formulated by experts
But certain basics are still necessary…
Big Data does not eliminate the need for certain basics, such as
data governance and data quality,
data modelling, data architecture, and data management
All of the typical steps necessary to transform raw data into insights still have to happen. Now they may just happen in different ways and at different times in the process.
Big Data vs Business Intelligence
Big Data and Business Intelligence have a lot in common, but
Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc.
Data mining can be seen as a part of the concept of Big Data and Predictive Analystics. Data mining is the process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence and statistics
Data mining extracts previously unknown patterns such as groups of data (cluster analysis), unusual data (anomaly detection), and data associations and dependencies. These patterns can then be used in further analysis or, for example, in machine learning and predictive analytics.
The terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
In the 1960s, statisticians used terms like "data fishing" or "data dredging" to refer to what they considered the bad practice of analysing data without an a-priori hypothesis.
Thomas Kuhn  introduced the concepts of “paradigm shift” and “scientific revolution” for describing revolutionary changes in scientific theories, changes which may also affect our worldview, like, for example, when the worldview with the earth in the centre was replaced with a worldview with the sun in the centre, or when Einstein replaced Newton’s mechanics with new theories, which we will accept as true, until they have been falsified as well. (It seems that all theories sooner a later will turn out to be false or incomplete.)
One may ask whether “Big Data” and “Predictive Analytics” really represent a paradigm shift. Some scientists think so, both those who are positive to using statistical methods of analysis, which are not linked to domain-specific theories, and those who condemn such usages, such as prominent statisticians like Gary King at Harvard University, and domain theorists like the linguist Noam Chomsky.
Some of those who are actively engaged in the development of the new analytical methods, which are not necessarily linked to domain-specific theories, seem to have been hit by a certain amount of hubris, for example the AI giant Peter Norvig , , and the journal editor and debater Chris Anderson . The later exclaims:
"All models are wrong, but some are useful." So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models … seemed to be able to consistently, if imperfectly, explain the world around us. Until now. …
This is a world where massive amounts of data and applied mathematics replace every other tool ... Out with every theory of human behaviour, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. …
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. …
There is now a better way. We can stop looking for models. We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. …
There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?
The linguist Noam Chomsky objects that no real knowledge about a topic can be attained without well founded, domain-specific theories and models. “Statistical data dredging” may at most be useful for practical purposes in certain cases, but it does not contribute to real understanding. The statistician Gary King adds that one cannot be sure of anything, unless one knows, which variables are important for a problem, and one is able to measure these variables in a reliable way. Moreover, all conditions are variable over time.
More examples of critical points and arguments, not least regarding quality issues, are discussed in , , and , where there are also further references to articles dealing with these issues.
It is controversial whether “Big Data and Predictive Analytics” implies a paradigm shift, and, if so, if it is a desirable paradigm shift. Both enthusiasts and critics seem to agree that the new methods may be useful, even if they are not (yet) properly understood, and even if they do not reflect how human beings think. Maybe one should rather regard “Big Data and Predictive Analytics” as a so-called “disruptive innovation”, a radical change of methods and tools. We have witnessed many such disruptive changes during the last century, for example the refrigerator replacing ice distributors, digital technology replacing mechanics, digital mass media and the Internet replacing paper media, gramophone records, video films, and traditional distribution of these media.
The new paradigm, based on Big Data and Predictive Analytics, produces better and more useful solutions to some important problems.
But the new paradigm does not enlighten our understanding of the mechanisms behind the problems and the solutions
But do we have to understand? Yes and no.
Consider a patient with some medical problems, showing certain symptoms. A first priority is to cure the patient, or at least to improve the patient’s situation. A doctor would use the symptoms in combination with medical theory and test results in order to arrive at a diagnosis and a treatment.
According to the new paradigm, a piece of software might find statistical relationships between data in a database on symptoms, test results, expert diagnoses, treatments, and outcomes, which could be used for generating even better results than those achieved by doctors.
However, there would still be a need to develop a better understanding of diseases and treatments, based on a better understanding of processes in the human body.
The scientific battles around “Big Data and Predictive analytics” can be seen another phase in the battle between empiricists and rationalists, between those who believe that human knowledge is primarily based on empirical observations, and those who emphasise the importance of reason and theories. The unprecedented successes in natural sciences during the last few centuries are in fact based on a synthesis of these approaches: theories are built from available observations and are then tested by controlled experiments.
Maybe we shall get a new synthesis between “dumb” statistical analyses, based on enormous sets of empirical data, and future theories, which will make us understand these statistical methods better, why they may work, and which pitfalls they are associated with. This development may certainly also contribute to the emergence of more realistic and empirically well founded theories within different scientific disciplines.
There is reason to recall the wise words of Turing , where he suggests that the question “Can machines think?” be replaced by the question “Can machines do what we, as thinking entities, can do?” Maybe computer-supported systems, loaded with “big data” can help us achieve new results by using methods that are fundamentally different from the way we, as human beings, are used to think, model, analyse, reason, and derive new knowledge.
The issues treated in this article are covered more extensively in , , , , .
Wikipedia, Google Translate, Wikipedia (2017-02-04), https://en.wikipedia.org/wiki/Google_Translate
AAPOR Big Data Task Force, AAPOR Report on Big Data (2015), http://www.aapor.org/AAPOR_Main/media/MainSiteFiles/images/BigDataTaskForceReport_FINAL_2_12_15_b.pdf
Wikipedia, Google Flu Trends, Wikipedia (2017-02-04), https://en.wikipedia.org/wiki/Google_Flu_Trends.
T. Kuhn, The structure of scientific revolutions, University of Chicago Press (1962), http://projektintegracija.pravo.hr/_download/repository/Kuhn_Structure_of_Scientific_Revolutions.pdf
P. Norvig, Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning, Significance, August 2012, pp 30-33, http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2012.00590.x/epdf
K. Gold, Norvig vs. Chomsky and the Fight for the Future of AI, Tor.com (2011),
C. Anderson, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Wired Magazine (2008-06-23), http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory
B. Sundgren, The concept of information, Pro Libera Scio (2015).
B. Sundgren, Big Data and predictive analytics - a scientific paradigm shift? Pro Libera Scio (2016).
A. Turing, Computing machinery and intelligence, Mind (1950), pp 433-460,
B. Evelson, BI and Big Data: Same or Different - An Approach To Converge The Worlds of Big Data And BI, Boris Evelson’s Blogs (2015-03-27), Information Management, http://www.information-management.com/blogs/Business-Intelligence-Big-Data-Same-Different-10026730-1.html
American Statistical Association and Royal Statistical Society, Big Data, Significance Special Issue (2012), Wiley Online Library, http://onlinelibrary.wiley.com/doi/10.1111/sign.2012.9.issue-4/issuetoc
C.K. Ogden et. al., The Meaning of Meaning: A Study of the Influence of Language upon Thought and of the Science of Symbols, Harcourt Brace (1923, 1956), http://editura.mttlc.ro/carti/55_Charles_Ogden_The_Meaning_of_Meaning_volume_one.pdf
S. Nordbotten, Editing and Imputation by Means of Neural Networks. Statistical Journal of the UN/ECE, 13 (1996). Free downloading from www.nordbotten.com.