My first technology post (in fact post of any kind) for a while. As in the past I’ve decided to commit to my blog thoughts that are whirling around my head that I don’t want to lose, and am interested to share with others that mind find it. Views, of course, are my own and not necessarily those of IBM.
I’ve recently been developing a paper for use inside IBM on the topic of Big Data in the context of Financial Services. I have been working with Big Data technologies in a variety of contexts for the past year or so, and the paper has been a good opportunity not only to explore the topic with my peers, but also to take stock of what I have learned in that time. Whilst the paper is an IBM-specific view, in the process I have been refining my own point of view, and that is what I’ve decided to record here as a series of observations that I’ve made in this time.
Thanks to Mark for his additional review and comments.
What’s in a name?
As technicians we are naturally wont to try and find the absolute meaning of any given piece of terminology, which means that when terms like “Big Data” or “Cloud” come along, a lot of time is spent deciding on what the “true” meaning really is. Published definitions of Big Data vary, generally tend to be at a high level, and reflect the wider strategy of the organisation making it. For example, the IBM web site defines Big Data in the context of the increasingly connected and instrumented world in alignment with the Smarter Planet agenda:
“Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is big data.”
A cursory look on Wikipedia yields a less applied definition as follows:
“Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.”
I could go on but sufficed to say, trying to tie Big Data down too firmly is clearly not helpful. What is interesting is examining some of the definitions of the term that I have heard myself in a variety of fora, such as:
- Social media analytics.
- Hadoop and MapReduce
- Stream analytics and complex event processing.
- Unstructured data.
- Data gathered from smart energy meters.
It is tempting in such circumstances to critique each example for accuracy and completeness against a chosen definition, but in the end I have reached the conclusion that the answer is that Big Data is all of the above, and many more things besides. This leads me to my first conclusion:
Big Data as a term is deliberately open to interpretation to accommodate a variety of possible lenses through which to view it, and the many and varied definitions reflect this variety.
Noticeable traits of Big Data scenarios
The format and structure of the data are not constrained to those of traditional business data models
One of the key themes of Big Data is the removal of traditional constraints around the type of data that can be leveraged in support of the business. Taking a Hadoop-type environment as an example, a key advantage is that data of any kind can be harnessed quickly from its raw format, without the need for a full scale data modelling exercise.
It is important to position how some of the Big Data technologies fit with the traditional data warehouse approach. One clear difference is the nature of how the data is stored and made available for analysis. Traditional data warehouses store data in well-defined structures to support Online Analytical Processing (OLAP) in the context of business intelligence initiatives. Typically a data warehousing project involves significant analysis to determine the business data structures into which the data is to be loaded for consumption in this way.
In a Big Data scenario, the source data is typically accessed in its raw format e.g. log files, audio, text. There can be a number of reasons for this, ranging from the sheer volume of data that would make traditional handling inefficient and costly to the uncertainty of the requirements and primitive nature of the data which would render a traditional data modelling exercise extremely difficult. Furthermore, the rapidly changing nature of Big Data sources, the business pressures of time to market and agility and the fact that we are only just starting to understand the possibilities also means a traditional approach is unlikely to be effective.
Data may be sourced from a variety of sources inside and outside the enterprise, including the public internet.
Another key point is that from a data ownership perspective, it may not just be about you any more. The “Big” in Big Data may refer to size, but it equally true may refer to scope — i.e. bigger than one organisation alone. It may of course refer simply to sources within an enterprise that have not been put together before, for example analysis of call centre records combined with an existing data warehouse. Social media analytics of the public internet is a good example where data beyond the “four walls” can be integrated with business-as-usual processes to improve performance.
The data itself may be analysed either in static data store or as a continually changing data flow.
As discussed previously, Big Data embraces a multitude of interpretations, one of which is the concept of “Big” indicating speed of data movement, or at least that the underlying data set may otherwise be fluid and/or with a temporal element to the business use case.
Again, the field of social media analytics offers a good example, wherein we are harnessing a constantly varying source of data. This in turn may be coupled with a fluid stream of business queries — for example, measuring the impact of recently-launched or enhanced marketing campaigns. This is a good example of a varying data set where the analysis occurs on a static, point-in-time snapshot of the data — data “at rest”.
In Financial Markets, algorithmic trading is a well known example where”Big” refers to the velocity of change, and the demand for fast response time. In this scenario, the data is analysed “in motion” as a continuous stream, with the Big Data tools providing the capability to spot potentially valuable patterns that indicate particular circumstances are occurring, in this case an order being made automatically at the right time.
Requirements for applications in the environment are often fluid and evolutionary.
As discussed above, to a large degree this is unsurprising given the emerging nature of the subject area. Technology-led exploration grows an increased appreciation of “the art of the possible”, and technologies such as Hadoop are very amenable to agile, rapid experimentation — indeed, one of the key value propositions of Hadoop is the ability to get started quickly and cost effectively, and the agility of the environment.
The ability for technology to handle Big Data in solving business problems removes some of the traditional IT constraints on thinking, and this naturally tends towards an exploratory approach to innovation with analytics. The flexibility inherent in IT tools such as Hadoop enables new degrees of business innovation, potential for value creation, and differentiated products and services. Factor into this the highly competitive and market-driven nature of consumer-facing fields such as retail and consumer finance, and this is a recipe for an ever-changing set of requirements.
“Big” is a subjective measure and specific to the context in question.
“Big” is very much in the eye of the beholder — earlier in this post I talked about the variety of definitions for the term Big Data, and largely this stems from the use of this inherently subjective term. “Big” to a business analyst at a bank may mean too many rows for their standard spreadsheet to handle any more. On the other hand, “Big” to a data-centric organisation Google means something different entirely.
Another definition of “Big” is not as a measure as such, but as an indicator of being “outside of conventional bounds”, for example drawing in data from social media or third-party organisations. In this sense “Big” becomes synonymous with “uncharted” and possibly “hard to manage” within the confines of the traditional enterprise scope.
Having concluded that there are many possible perspectives on Big Data, there is an emerging set of recurring attributes of a Big Data environment when one drops down a level of detail to examine the technical requirements.
Business scenarios for Big Data
It is interesting to note that the terminology itself is inherently technical, which instinctively leads a lot of the current thinking into the world of implementation technology. This naturally leads to a “bottom up” view of the problem space — i.e. here is what particular technology allows you to do, now think how you can apply that capability to your business and see what fits. From a technologist’s perspective, this is exciting because one can see the possibilities, and this natural enables an entrepreneurial approach to IT. This can however end up becoming the archetypal technical solution looking for a problem.
It is interesting to note that there is no one obvious place to start in terms of a business problem space addressed by Big Data. A few are emerging, for example those associated with social media analytics (marketing and campaign management, product development and so on), but actually it is likely that in many cases the Big Data thought is something one goes armed with when the top down analysis and requirements gathering begins, rather than a precise piece part that fits a specific problem. For example, there is not the same defined link as exists between a single view of the customer type business problem and a master data solution.
It is that new art of the possible, and suspending judgement on what can be done that is the real benefit of the Big Data thought from a top-down business perspective.
Whilst there are a growing family of technology pieces in the Big Data solution story, you may not realise you have a Big Data business problem until you get there.