Big Data – what’s the Big Idea?

My first technology post (in fact post of any kind) for a while. As in the past I’ve decided to commit to my blog thoughts that are whirling around my head that I don’t want to lose, and am interested to share with others that mind find it. Views, of course, are my own and not necessarily those of IBM.

I’ve recently been developing a paper for use inside IBM on the topic of Big Data in the context of Financial Services. I have been working with Big Data technologies in a variety of contexts for the past year or so, and the paper has been a good opportunity not only to explore the topic with my peers, but also to take stock of what I have learned in that time. Whilst the paper is an IBM-specific view, in the process I have been refining my own point of view, and that is what I’ve decided to record here as a series of observations that I’ve made in this time.

Thanks to Mark for his additional review and comments.

What’s in a name?

As technicians we are naturally wont to try and find the absolute meaning of any given piece of terminology, which means that when terms like “Big Data” or “Cloud” come along, a lot of time is spent deciding on what the “true” meaning really is. Published definitions of Big Data vary, generally tend to be at a high level, and reflect the wider strategy of the organisation making it. For example, the IBM web site defines Big Data in the context of the increasingly connected and instrumented world in alignment with the Smarter Planet agenda:

“Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is big data.”

A cursory look on Wikipedia yields a less applied definition as follows:

“Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.”

I could go on but sufficed to say, trying to tie Big Data down too firmly is clearly not helpful. What is interesting is examining some of the definitions of the term that I have heard myself in a variety of fora, such as:

  • Social media analytics.
  • Hadoop and MapReduce
  • Stream analytics and complex event processing.
  • Unstructured data.
  • Data gathered from smart energy meters.

It is tempting in such circumstances to critique each example for accuracy and completeness against a chosen definition, but in the end I have reached the conclusion that the answer is that Big Data is all of the above, and many more things besides. This leads me to my first conclusion:

Big Data as a term is deliberately open to interpretation to accommodate a variety of possible lenses through which to view it, and the many and varied definitions reflect this variety.

Noticeable traits of Big Data scenarios

The format and structure of the data are not constrained to those of traditional business data models

One of the key themes of Big Data is the removal of traditional constraints around the type of data that can be leveraged in support of the business. Taking a Hadoop-type environment as an example, a key advantage is that data of any kind can be harnessed quickly from its raw format, without the need for a full scale data modelling exercise.

It is important to position how some of the Big Data technologies fit with the traditional data warehouse approach. One clear difference is the nature of how the data is stored and made available for analysis. Traditional data warehouses store data in well-defined structures to support Online Analytical Processing (OLAP) in the context of business intelligence initiatives. Typically a data warehousing project involves significant analysis to determine the business data structures into which the data is to be loaded for consumption in this way.

In a Big Data scenario, the source data is typically accessed in its raw format e.g. log files, audio, text. There can be a number of reasons for this, ranging from the sheer volume of data that would make traditional handling inefficient and costly to the uncertainty of the requirements and primitive nature of the data which would render a traditional data modelling exercise extremely difficult. Furthermore, the rapidly changing nature of Big Data sources, the business pressures of time to market and agility and the fact that we are only just starting to understand the possibilities also means a traditional approach is unlikely to be effective.

Data may be sourced from a variety of sources inside and outside the enterprise, including the public internet.

Another key point is that from a data ownership perspective, it may not just be about you any more. The “Big” in Big Data may refer to size, but it equally true may refer to scope — i.e. bigger than one organisation alone. It may of course refer simply to sources within an enterprise that have not been put together before, for example analysis of call centre records combined with an existing data warehouse. Social media analytics of the public internet is a good example where data beyond the “four walls” can be integrated with business-as-usual processes to improve performance.

The data itself may be analysed either in static data store or as a continually changing data flow.

As discussed previously, Big Data embraces a multitude of interpretations, one of which is the concept of “Big” indicating speed of data movement, or at least that the underlying data set may otherwise be fluid and/or with a temporal element to the business use case.

Again, the field of social media analytics offers a good example, wherein we are harnessing a constantly varying source of data. This in turn may be coupled with a fluid stream of business queries — for example, measuring the impact of recently-launched or enhanced marketing campaigns. This is a good example of a varying data set where the analysis occurs on a static, point-in-time snapshot of the data — data “at rest”.

In Financial Markets, algorithmic trading is a well known example where”Big” refers to the velocity of change, and the demand for fast response time. In this scenario, the data is analysed “in motion” as a continuous stream, with the Big Data tools providing the capability to spot potentially valuable patterns that indicate particular circumstances are occurring, in this case an order being made automatically at the right time.

Requirements for applications in the environment are often fluid and evolutionary.

As discussed above, to a large degree this is unsurprising given the emerging nature of the subject area. Technology-led exploration grows an increased appreciation of “the art of the possible”, and technologies such as Hadoop are very amenable to agile, rapid experimentation — indeed, one of the key value propositions of Hadoop is the ability to get started quickly and cost effectively, and the agility of the environment.

The ability for technology to handle Big Data in solving business problems removes some of the traditional IT constraints on thinking, and this naturally tends towards an exploratory approach to innovation with analytics. The flexibility inherent in IT tools such as Hadoop enables new degrees of business innovation, potential for value creation, and differentiated products and services. Factor into this the highly competitive and market-driven nature of consumer-facing fields such as retail and consumer finance, and this is a recipe for an ever-changing set of requirements.

“Big” is a subjective measure and specific to the context in question.

“Big” is very much in the eye of the beholder — earlier in this post I talked about the variety of definitions for the term Big Data, and largely this stems from the use of this inherently subjective term. “Big” to a business analyst at a bank may mean too many rows for their standard spreadsheet to handle any more. On the other hand, “Big” to a data-centric organisation Google means something different entirely.

Another definition of “Big” is not as a measure as such, but as an indicator of being “outside of conventional bounds”, for example drawing in data from social media or third-party organisations. In this sense “Big” becomes synonymous with “uncharted” and possibly “hard to manage” within the confines of the traditional enterprise scope.

Having concluded that there are many possible perspectives on Big Data, there is an emerging set of recurring attributes of a Big Data environment when one drops down a level of detail to examine the technical requirements.

Business scenarios for Big Data

It is interesting to note that the terminology itself is inherently technical, which instinctively leads a lot of the current thinking into the world of implementation technology. This naturally leads to a “bottom up” view of the problem space — i.e. here is what particular technology allows you to do, now think how you can apply that capability to your business and see what fits. From a technologist’s perspective, this is exciting because one can see the possibilities, and this natural enables an entrepreneurial approach to IT. This can however end up becoming the archetypal technical solution looking for a problem.

It is interesting to note that there is no one obvious place to start in terms of a business problem space addressed by Big Data. A few are emerging, for example those associated with social media analytics (marketing and campaign management, product development and so on), but actually it is likely that in many cases the Big Data thought is something one goes armed with when the top down analysis and requirements gathering begins, rather than a precise piece part that fits a specific problem. For example, there is not the same defined link as exists between a single view of the customer type business problem and a master data solution.

It is that new art of the possible, and suspending judgement on what can be done that is the real benefit of the Big Data thought from a top-down business perspective.

Whilst there are a growing family of technology pieces in the Big Data solution story, you may not realise you have a Big Data business problem until you get there.

About these ads

13 responses to “Big Data – what’s the Big Idea?

  1. Hi Martin,

    Very useful – thank you for writing this up.

    The top-down vs bottom-up discussion triggered a thought around requirements. As you say, the bottom-up view yields exciting ideas about what *could* be done. But what is the top-down thought process that leads to Big Data as a solution to a particular problem? For problems where “Big Data” provides a good solution, one can imagine three scenarios that would have existed before Hadoop and the like:

    1) A valid business requirement has been articulated in the past but, in the absence of Big Data, was solved sub-optimally using a different technology

    2) A valid business requirement has been articulated in the past but, in the absence of Big Data, was rejected as infeasible

    3) A valid business requirement hasn’t even been articulated because the requester self-censored… they simply assumed it was impossible.

    The first and second cases are interesting – but can be addressed through the normal techniques of socialisation, education and the rest.

    But the third case is tricky: if we are truly saying that previously *imposible* requirements can now economically be addressed, what does this mean in a requirements-driven top-down world?

    Example:

    Before Google came along, how many search-engine designers said: “if only I could store the whole web in a single database table, I could solve the page ranking problem trivially”. My assertion is that almost none of them did. The idea would have seemed so ridiculous as to not be worth expressing. Yet Google’s BigTables paper uses precisely this scenario to describe their design.

    So… if we’re serious about the possibility of Big Data technologies, does this imply a need to “re-educate” consumers of enterprise technology about the art of the possible?

  2. You make a good point — your third scenario is rather the one I had in the back of my mind, and I think there is an element of removing constraints (often assumed rather than explicit) about what can and can’t be done with data.
    A colleague of mine recently very eloquently described how the very presence of the internet and resources like Google after the mid/late 1990s, and the subsequent rise of consumer IT raised expectations of what enterprise IT solutions should be able to achieve. They described the traditional approach to solution development as being akin to putting hardware, software and the users inside a box, which until the internet revolution was accepted as the normal working scope. The internet and tools like Google opened users eyes to seemingly limitless global data resources, which drove up the tension to push against the then conventional constraints, and break down the sides of the box. To continue this metaphor, the many perspectives and view points on the Big Data term could be considered to be the windows that people have created to the world on the outside of the traditional IT box.
    As described in my last section, from the top down, Big Data I think becomes an additional approach to the traditional kit bag for meeting business requirements, rather than the leading topic in and of itself. It is this new and emerging appreciation of the world beyond that at a business level creates a new context in which to build.

  3. ChrisPaWilliams

    One of the issues with Big Data is that it still requires quite a bit of programming and is therefore less consumable than, for example, a traditional database. To what extent will this limit its early take up?

    • Depends on the technology I guess, but I’d agree that’s certainly true of plain Hadoop. Without being a detailed subject matter expert, I’ve written some MapReduce for Hadoop to get a feel for what’s involved and my observation would be that in plain vanilla form, you do indeed need a high degree of programming, plus understanding of MapReduce to get the best out of it. I do, however, really like what IBM has done with Big Sheets to provide a higher level set of tools to exploit large data sets — I think tools that layer on top of the infrastructure software will drive the key business level use cases around developing insight from very large data sets. The other thing is that this will remain a moving target for a while, I’ll be interested to reflect back on this in a year’s time and see where we are then.

    • Apart from the issue of programming, there is also the question of traceability/visibility/governance…. i.e. how do I address requirements like:

      * “Show me the sequence of tasks that led to the generation of this data”
      * “Tell me which tasks (and versions) were deployed and active on Wednesday morning last week”

      And so on… I’m still learning in this area and I’m fully confident these questions could soon be answered (if they haven’t been already) but bringing them to the surface and showing how to solve them would go a long way to helping overcome some frequently heard objections.

  4. Martin, I am particularly interested in the “Continually changing data flow” element of Big Data. I am struggling to understand how the current architectures support this without “freezing” the data. Are there any good examples of data in motion architectures and patterns

    • I think this highlights the loose fit of the terminology, since the current “data in motion” thought is largely focused on stream analytics which is a flavour of Complex Event Processing (CEP) – very different architectural pattern to processing of a static data set, and historically one more commonly aligned to messaging and middleware than data management.
      In a CEP scenario the model works on the basis that what is examined is the trend in the data captured in the events fired into the CEP system, rather than surveying the full set of events retrospectively as a collective. For example, the trend in a set of temperature readings might indicate a defective component in a generator, and trigger an automated action as a result. Typically what is specified is a set of event definitions (of some kind) and a set of pattern descriptions that define which patterns are of interest. These definitions are loaded into the CEP server, which in turn integrates with the digital event sources.
      I used to work with CEP systems in a previous life so happy to follow up anyway if that’s a help.

  5. A very good endeavor to define the sometimes undefinable. My concern remains in all of this dialog on technology that there is little, if any, business representation. My experience tells me that the “hype cycle” for Big Data will be short lived (creating a substantial “sea of disenchantment”) if business leaders do not define strategic objectives and outcomes that (ala sources of competitive advantage) Big Data can either deliver on its own or in tow with conventional analytics & information management endeavors. Promoting the voice of the business in this conversation should not be given short shrift by any of us in 2012

    • Thank you, glad the post was of interest.

      In my mind, The Big Data topic is fundamentally about untapped potential enabled by technology – I think most people get that piece (i.e. the technical view that there’s an explosion of available data) but the challenge then is to help business leaders understand what that potential means (or could mean) to them in particular. I think currently in many cases people get that it should be important but are not quite 100% sure of specifically how. I do think, however, that much of this will emerge according to the traditional solution lifecycle – i.e. we have a business objective to achieve a given benefit, and Big Data becomes part of the modern palette of technology to help achieve that in more efficient and cost effective ways than were possible before.

      I’m certainly intrigued to see where this space is this time next year, 2012 will be an interesting year.

      • I agree completely. We must all try to keep this new “tool” in perspective. It must solve a real business problem (or alternatively create an opportunity for competitive advantage) and then be applied in a way that compliments existing investments and architectures. This is not going to be a “sweep the floor” wave of change. The best application of Big Data today is complimenting well characterized structured assets to deliver on the “360 degree view” that all of us have been pursuing for some time now. I can think of many other use cases that Big Data can make major contributions to in terms of success or differentiation. 2012 will indeed be an interesting year to participate in this new wave of disruption.

  6. One of the reasons Hadoop took off was because relational warehouses couldn’t handle the data volumes, nor could they efficiently handle the unstructured data. That first constraint has gone away now with the adoption of MPP in analytic database (including SQL ones) as in Hadoop. The second constraint (structured data, for SQL databases – the mainstream) remains. The challenge will be to identify which Big Data use cases are best implemented with which technologies. Most analytic requirements can probably be met and in some cases there may be multiple technology answers (e.g i know telcos analysing call data records – CDRs – in Hadoop and others using Netezza; CDRs aren’t relational but they are pretty well-structured so easy to map to relatIonal). It’s a venn diagram of technologies and use cases.

  7. I agree with all of these comments and am pleased that we are not endeavoring to treat these new tools/platforms as a magic wand to eliminate all of our challenges. I have been pursuing what I call “Total Information Exploitation” for many years now and we are clearly getting closer with each new generation of technology and practitioners. However, cost vs. latency, privacy vs completeness, etc. must always be reckoned with no matter how much information we have available to us.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s