By Mark Wills
The first thought that comes to mind about "Big Data" is that there is a lot of it. And while that is true, "Big Data" is more than that. It tries to address the complexity of being able to bring together both structured and unstructured data from an increasing variety of sources so it can be analysed in a concise and coherent way with a high degree of confidence.
Not so long ago, Big Data was certainly considered to be in the realm of "for the very big corporations", but that is starting to change. Part of the reason for the change is the technology associated with all the hype and the latest of buzz words becoming more available to the not so big. There are now viable choices of solutions that won't break the bank.
Because the "Big Data industry" is still in its infancy (relatively speaking), there is a small luxury of time before it becomes the expected "norm". That time can be well used to make sure you can learn all there is to know and decide if you will actually benefit from it.
One can see the anticipation of great things to follow in marketing departments the world over, talking up "big data" as if they have a clear idea about what it is, how to get it, and how it can be used. Importantly, you need to know what it might mean for your business as a competitive advantage or how disadvantaged you will quickly become if you are not currently thinking about it.
Taking a small step back in time, for years, when we needed to analyse our ERP systems with maybe a smattering of other structured databases (like CRM) from within an enterprise, we used some kind of Business Intelligence system (BI).
Arguably, a lot of BI systems fell a little short of the deliverables because of a predisposition toward post analysis. They were only able to report or predict based on what was actually captured in fairly traditional sources like invoicing and accounts receivable. But we all know there is a lot more information out there -- information that adds a new dimension to the traditional data and a more realistic perspective of the enterprise.
Think about your own systems in use, and all different methods available to interact (POI Points of Interaction) with your company. For a start, you have the obvious in-premise solutions, but think broader. Maybe you have a website being accessed to create orders, or log issues. Let us suppose a range of different devices use to access that -- the desktop, mobile devices or maybe even telemetry. And more recently, your Marketing team launched the corporate social sites, and let us not forget the slightly more traditional forms such as EDI and Telephony.
Looking at the variety of different POI, we realise that those points are being monitored or logged somehow. Those new sources of information are being stored in computer logs, Facebook or Twitter feeds from the corporate social sites, geographic information, cookies, activity logs, and Clickstream data. Now combine them with those traditional sources and now you have grown out of BI and entered into the world of "Big Data".
When you start to really think about your business and everyone it touches, imagine the coincidental information available. That's the information from various logs and devices themselves and is not restricted just to what people have been entering on your site. In includes their IP addresses and datetime activities, their clickstream data, their mobile geo-locations services, and tracking information via on-board telemetry from vehicles. Most importantly, that coincidental and associated data is being collected by machines logging the activity (at a significant frequency) whilst fulfilling other tasks. And, as we automate more functions, there is an ever increasing diversity as to what can be captured.
Thinking about the coincidental data, it becomes quite significant when associated with, for example, geographic locations. The "coincidental" part transforms into strategic data revealing geographic market strengths and opportunities. Combine that with various sources of feedback, and you expose vulnerabilities and consumer sentiment.
One particular scenario I worked on with a large corporation was dealing with excessive warranty work and wanting to gain insights into the "real" consumer for their product. The company dealt with resellers and agents (dealers), so it was always difficult find out what happened next in the retail space. We were able to gather data via the myriad social forums and dealer logs about the real consumer. Gaining insights into the consumer space revealed a few significant issues that were relatively easy to solve. Without a "Big Data" attitude, the information flow was trapped in the reseller domain.
That's part of the problem with "Big Data". It is a big buzzword, and it is full of big ideas, and needs new attitudes toward managing and processing data.
As we have said before, it is not just size, it is the variety of potential sources that really generates the volume. So we now find a need to make sense of the variety of data. That can mean formalising data relationships, or extracting elements from unstructured data, or undergoing various transformations so it can be used.
Take, for example, our large corporation having warranty issues with resellers. The company's customer is the reseller, and the reseller has their own customers, which we know as consumers. However those individuals are also known to our company as a registered user name via the website, and something different again on the corporate social media sites. So, how do we get all those different identifiers to mean one and the same thing? The business must define rules for the different and sometime disparate data sources.
That is the first real challenge: creating a business dictionary that defines the data correctly, consistently and uniformly. The business also needs to understand what data elements are available and what that can translate to in terms of achieving business goals.
With all the different data sets coming together, we need disk space. Potentially lots of it. It has to store the individual data elements and
allow for any new data feeds. There are technology solutions, and arguably a contributor to the rise in "big data" popularity could be all the discussion around cloud based solutions. An enterprise doesn't have to build its own ginormous data centre -- but that might be a more viable alternative, depending on the business.
Then there is analysing the data. Getting results that are reliable, trustworthy, usable and repeatable takes a new kind of thinking (as a technologist) and very clear goals set by the business. I say usable, because one of the possible risks with the variety of data sources available is a perceived or real possibility of contradicting privacy clauses, proprietary rights, copyright and ownership of data, and how all of those impact the marketing and selling of data.
Fortunately, there is a lot of information about "Big Data" being written out there, and a quick search can yield a heck of a lot of information (did I mention that Bing reckons they analyze over 100 petabytes of data to deliver their "high quality" search results?).
One thing you will find is reference to the three "Vs"Volume.
Many factors contribute to the increase in data volume.Velocity.
Data is streaming in at unprecedented speedVariety.
Data today comes in all types of formats
It is a phrase / a term first penned by Doug Laney in 2001 before "Big Data" became the current "hype" in Volume-Velocity-and-Variety.pdf
and rather poetically used to describe "Big Data".
The other is a term Hadoop, which is basically a software platform that controls data across a wide range of machines and worth of a separate article in and of itself.
The other thing you will find are all the major suppliers offering up their own summations and recommended reading and how they support Hadoop and/or other technologies. So the first stop for a lot of information would be your preferred hardware or database supplier. Here is a couple of links to get you started.
IBM : Bigdata-Enterprise
SAS : Big-Data
ORACLE : Big-Data
MCKINSEY : big_data_the_next_frontier_for_innovation
MICROSOFT : business-intelligence big-data
Now a cautionary tale... Big Data is not just gathering everything you can. That only becomes (quite simply) lots of data. People make the mistake of believing they must have Information because of all the data they are gathering. But a lot of the time, it is nothing more than consuming disk space for the sake of data collection and has no strategic business value in terms of Information (insight). Have a read about the NSA's dilemma
. (You might sleep better too.)
So, beware the hype, and take the time to understand before the boss walks in with the next "Big Idea". I hope this brief introduction inspires you to seek out more information about "Big Data".