The Big Data Revolution Part 1

The Big Data Revolution Part 1. Intelligenthq

Twenty five years ago, data on economic activity and human behavior in general was relatively limited. As Eric Schmidt, the former CEO of Google, said in 2010: “There was five exa bytes of information created between the dawn of civilization through 2003, but that much information is now created every two days, and the pace is increasing.”

What could explain this dramatic change occurring in data gathering, in such short period of time? One obvious reason is the growth of internet. Practically everything on the internet is recorded. If you read a newspaper online, watch videos, or track your personal finances, your behavior is recorded. Any search on Google or Bing, is permanently registered,  including every single click or everything we buy. Individual behavior is stored as well with text messaging, cell phones and geo-location devices. Finally we have scanner data, employment records, and health records, which are all part of the data footprint that we now leave behind us.

A similar evolution has happened in business activity, as companies have shifted their daily operations  to the internet environment. Now it is possible to compile rich datasets of sales contacts, hiring practices, and physical shipments of goods. The same phenomenon happens in the public sector in terms of the ability to access and analyze tax filings, social insurance programs, government expenditures and regulatory activities. What enables ao this is the  increased capacity to store, aggregate, combine data , and perform deep analyses of the same data. For example, for less than $600, an individual can purchase a disk drive with the capacity to store all of the world’s music.

What’s the novelty of all this? The short answer is that data is now available at an increasingly fast pace, and covers multiple new areas that previously were not available. One can say that we are embedded in expansive amounts of data coming from both millions of networked sensors located in devices such as mobile phones and automobiles, all the time sensing, creating, processing and communicating data, or by multimedia and common people with smartphones, operating within social network sites, that continuously produce data.

image credit: The Economist

A definition of Big Data

What does it mean ” Big Data”? The consulting agency Mckinsey published in 2011 an extensive report on Big Data, entitled “Big Data, the next frontier for innovation, productivity and competition”, that defined big data as: “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. ”

The subjectivity of the definition proposed by McKinsey, aimed to encompass the possibility of approaching big data within the parameters of a moving definition, that could easily adapt itself to the expansive changing size of datasets considered to be “big data”. Another factor is that depending on the sector, datasets quantity varies considerably. As such, big data today can range from a few dozen terabytes to multiple petabytes (thousands of terabytes) depending on the sector.

Another interesting definition of big data is offered  by researchers at St Andrews . These define big data as: “The storage and analysis of large and/or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.The main elements revolve about volume, velocity and variety.”

Big Data and privacy

Big data might be quite frightening to the population in general, that tends to perceive these gigantic compilations of information as highly intrusive of their privacy. The distopian novel nineteen eighty four, written by George Orwell in 1949, revolved around a society, engaged in an omnipresent government surveillance, conveyed by the creepy character of its anonymous dictator: Big Brother. As Anne Devineaux investigated in her journalistic piece: Big Data, Big Brother and the death of privacy in the digital age  Big Data is having a tremendous impact on individual privacy and that can be perceived as extremely uncanny by the population in general. The disclosures published by newspapers  The Guardian and The Washington Post on June 6, 2013 about the U.S. government’s massive spying program called PRISM awakened us to the fact that Big Data is eradicating everything we thought we knew about privacy. The news worsened as large companies with terabytes of customer data came under fire for supplying the National Security Agency (NSA) with information about their customers. While those companies have denied direct involvement in the program, millions of people became suspicious about the whereabouts of their data. The mentioned events raised heated debates pledging for companies that hold any amount of data on their customers,  to take immediate steps to reassure customers that their data is safe and private.

Big Data and the Economy

Taken from a different angle, Big Data impact on private commerce and national economies, is being assessed. According to the aforementioned McKinsey´s report “Big data: The next frontier for innovation, competition, and productivity”, done in 2011 Big Data can benefit the population in general:

“Our research finds that data can create significant value for the world  economy, enhancing the productivity and competitiveness of companies and the public sector and creating substantial economic surplus for consumers.”

How to Analyze Big Data

An important question regarding Big Data is how to analyze it. According to the  McKinsey´s report referred previously, it is now possible to obtain valuable insight from the data by using varied available softwares. Further, the ability to generate, communicate, share, and access data has been revolutionized by the increasing number of people, devices, and sensors that are now connected by digital networks. The techniques that can be used to analyze datasets are varied and all draw on statistics and computer science (particularly machine learning). Some of these can be:

  •  A/B testing, Association rule learning, Classification, Cluster Analysis, Crowdsourcing, data fusion and data integration, data mining, ensemble learning, machine learning, neural networks, NLP, Genetic Algorithms, Visualization.

There are a growing number of technologies to aggregate, manipulate, manage, and analyze big data. McKinsey report extensively details some of the more prominent of these, cautioning the reader how more technologies continue to be developed to support big data techniques. Some of the examples given were:

  • Big Table , Inspiration for HBase, Business intelligence (BI), Cassandra, Cloud computing, Data mart, Data warehouse, Dynamo, Google File System, Hadoop, Mashup , Metadata, MapReduce

Visualization

Presenting information in such a way that people can consume it effectively is a key challenge that needs to be met if analyzing data is to lead to concrete action. For this reason, there is currently a tremendous amount of research and innovation in the field of visualization, i.e., techniques and technologies used for creating images, diagrams, or animations to communicate, understand, and improve the results of big data analyses. Some of these are:

Tag cloud
Tag Cloud has been around for quite a long time. It concerns the visualization of the text of the report in the form of a tag cloud, i.e., a weighted visual list, in which words that appear most frequently are larger and words that appear less frequently smaller.

Clustergram
A clustergram is a visualization technique used for cluster analysis displaying how individual members of a dataset are assigned to clusters as the number of cluster increases. The choice of the number of clusters is an important parameter in cluster analysis.

Cluster_ image source: McKinsey

History flow
History flow is a visualization technique that charts the evolution of a document as it is edited by multiple contributing authors.Time appears on the horizontal axis, while contributions to to the text are on the vertical axis; each author has a different color code and the vertical length of a bar indicates the amount of text written by each author.By visualizing the history of a document in this manner, various insights easily emerge.

History Flow_ image source: McKinsey

 

Spacial Information Flow
Another visualization technique is one that depicts spatial information flows. The example shown in McKinsey’s report is entitled the New York Talk Exchange. It shows the amount of Internet Protocol (IP) data flowing between New York and cities around the world. The size of the glow on a particular city location corresponds to the amount of IP traffic flowing between that place and New York City; the greater the glow, the larger the flow. This visualization allows us to determine quickly which cities are most closely connected to New York in terms of their communications volume.

McKinsey’s report suggested that big data could be used to create value across sectors of the global economy. They previewed that society in general was at the brisk of a fantastic wave of innovation, productivity, and growth, as well as new modes of competition and value capture, which was being driven by big data. McKinsey’s analysts aimed to answer the question of “why should this be the case now? Haven’t data always been part of the impact of information and communication technology?

The conclusion of their research suggested that “the scale and scope of changes that big data are bringing about are at an inflection point, set to expand greatly, as a series of technology trends accelerate and converge. We are already seeing visible changes in the economic landscape as a result of this convergence.”

The Big Data Revolution – part two