Leon Katsnelson gave a presentation on Big Data and Social Business. His general theme is that we are seeing a LOT of data. This huge volume can fundamentally change how we addressthe business.
In todays transactions we want more information on it
- What was the transaction?
- Who bought the widget?
- Where did they get the information to come buy the widget?
- Why did they buy the widget?
- What more did they buy?
These questions are very different from the operationally oriented questions asked a while ago.
Nice Quote: Data is the product
Think about LinkedIn, Yahoo!, Twitter, Experian, facebook, and Zynga
- Each web page on Yahoo! is customized. That’s what help drawing the audience. They do this because they harness all the data. Yahoo! was the pioneer of Hadoop
- Zynga. Over 90% of their business comes from facebook. They are a gaming company. Ken Rudin, VP of Analytics said, “We’re an analytics company masquerading as a game company.”
- 232 million play their games
- A small percentage of users will buy virtual products and pay Zynga money for that.
- Electronic Arts is now worth less than Zynga. That’s a huge disruption to the gaming industry
- Apple sells a lot of stuff but itunes and the app store holds the largest collection of customers in the world. They have all the credit card data. This includes, music, subscriptions, app, movies, etc. They hold your data.
Illustrations of big numbers and data
- US tax revenue of 2,170,000,000,000
- Federal budget of $3,820,000,000,000
- Current deficit of 1,650,000,000,000
- National debt of $14,271,000,000,000
- Budget cuts of just $38,500,000,000
Other stats:
- Facebook is expected to have 1 billion users by August 2012. It generates 12TB of data a day. That’s 1/7 of the population of the world.
- Pets have facebook pages. 14% of pet owners do this.
- Google + is expected to reach 400+ million users by the end of 2012
- LinkedIn has 130 million members today
- Twitter has 100 million active users. It also generates 12+ TB of tweet data a day
What is this all about? Answer, it’s about getting a 360 degree view of a customer
- Who
- What social network
- What competitors do they use
- ARe they profitable
- Is she a shopping maven
- Does she influence others?
Quote: Twitter is not a technology. It’s a conversation
Quote: Posting press releases on twitter is a dumb idea
How cool can it be, check out the advanced search. But once you do a search, the amount of data is just plain huge. You can’t deal with it all individually.
Other Data
Think about the data in sensors. At lotusphere, every room we enter is scanned by an rfid tag. 5,000+ users * 15 rooms a day * 5 days equal a lot of data.
Now think about engines on a commercial airliner. They generate a huge amount of data. It’s 1TB of data every 30 minutes. That Terrabytes of data generated but they are erased after landing. Why? They can’t handle that amount of data.
Now think about electric meters that are sampled 4 times an hour
Quote: Data generated by machines and sensors will exceed that of machines
This all adds up to information load but lacking insight. That makes Big Data a big problem and a big opportunity.
IBM Big Data Platform
So yes, we knew it was coming. IBM does have a tool to handle these levels of data. It’s called the MPP datawarehouse
Think
- Netezza 1000
- IBM infosphere big insites for Hadoop
- Infosphere streams for streaming data or quickly moving data
- Infosphere Information Server to consolidate and integrate the data you have
What does a data platform do?
- Analyze a variety of information
- Analyze information in motion
- Analyze extreme volumes of information
- Discover and experiment. Do ad hoc analysis, data discovery, and experimentation. Experimentation is key because you don’t know what to do with that data until you experiment.
- Manage and Plan. Enforce data structure, integrity, and control to ensure consistency for repeatable queries
A lot of research has gone into this. Think about what IBM has done with Watson. That’s text analytics.
What are uses for big data? It’s log analysis and storage, smart grid, fraud detection, 360 degree view of customer, email and call center transcript analysis
A few more words about Hadoop
IBM embraced and extended it. It’s a nice way to embrace. It’s not forked, not ported. It’s extended and then contributed back to open source. The same analytics are used for in motion and at rest data. The two teams share a fair amount of information and code.
Case study
UOIT capturing preemie sensor data. It used to be captured every 30-60 minutes and discarded after 72 hours. Their system captures this information, analyzes it and make the nurses aware of the changes to a preemies’ health. It resulted in a 20% drop in mortality rates
Case Study
Sprint processes CDR records. It knows if they dropped a call and if they need to contact you about it.
Cloud use of Hadoop?
IBM can give you a 100 node cluster for $34/hr. They work with Amazon, cloud.com, IBM SmartCloud, RackSpace, etc.
How is Big Data different from datawarehouses?
First and foremost it’s about taking unstructured data instead of structured data. It’s about taking a more holistic approach. it’s a spreadsheet metaphor.
Text analytics can be a big problem. Is Spam the horrible food or the horrible email for example. They have text analytics tools and are investing a lot to improve upon it.
At least two of the companies you mention above (LinkedIn and Zynga) are Splunk customers. Splunk has been doing this kind of stuff for a while now. It’d be interesting to see how they compare to IBM MPP Data Warehouse.
I was talking to someone last night about the fact that Splunk has some big data characteristics, especially in the fact that it too analyzes unstructured data. Given that IBM is making a bet on Hadoop with Big Data, perhaps the better question is how Splunk compares to Hadoop. Thoughts on that?
Short answer is that Splunk and Hadoop are different, and Splunk is even embracing Hadoop. Check this out:
http://blogs.splunk.com/2011/12/05/introducing-shep/