Sunday, April 14, 2013

For the readers who aren't familiar with the term, 'Big Data Analytics', I would like to start with the well-received definition here;

"Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue." [Source

During the last couple of years, thanks to the power of cloud computing, and special thanks to large-scale distributed system middleware (yeah, thanks to research engineers in Google and people on Apache Hadoop), we're able to analyze relatively large amount of both structured and unstructured data on demand in quasi real-time than ever before.  Also, bunch of social media and networks provide massive amount of data in VVV (Velocity, Variety, Volume) so that it is very easy to capture the 'Big Data' to feed the 'Analytics' engines.  That sounds really happy for optimistic and humble people like me, but problems still exist between the 'Big Data' and 'Analytics'.  This is the main issue to discuss in the data analytics world. 

Recently we found that people in UC Berkeley announced their well-defined, unified software stack for data analytics called BDAS (Berkeley Data Analytics Stack).  It sounds like a 'Bad Ass', and professor Ion Stoica received a question when he had a talk on Stanford University in March 8 at Hewlett building room 201, he answered that there might be an easy way (may be 'bad ass') to remember the name of this project.  Anyhow, I'm sure it is an open source project managed by many other people, the same as we know about BSD (where I mean an OS), Berkeley DB, and Berkeley Socket.  This might be a way to contribute to the world of data analytics.  

Figure 1. Brief Overview of Berkeley Data Analytics System (BDAS) [image source:]
Figure 1 shows a little bit old (it is from the talk on February 2013), but looks almost the same when I saw the whole architecture view at March 8 at Stanford.  AFAIK, they are pursuing more values and technologies than Hadoop-based software stack does.  Mostly their software stack is in-progress (due to the nature of open-source project), they are component-wise replaceable (substitutable) with Hadoop stack, or vice-versa. 

The most interesting stuffs from the BDAS is as follows:
  • MLBase: provides a declarative Machine-Learning stuffs on that software layer
  • Shared RDDs: allows system to keep the data in shared-memory among distributed systems (if my understanding is correct), so that BDAS can work way faster than disk (2nd storage)-based data analysis technologies.  Or, you can just think that it's a cache-layer for the bunch of disk storage data. 
  • Spark & Shark: enables an easy and robust way to query in various ways.  BDAS people have several prior publications so if you're interested in, please read their articles!
  • Blink DB: provides a sampled-data analytics rather than full data analytics.  As you may guess, it definitely has tradeoff between quality (confidence level) of results and speed of analysis, but if the quality can be met by smaller subsets of source data, it is a clever way to do some sampling before analyzing the source data.  For example, if you're interested in the average day-time temperatures of each month for the last 20 years, do you still think you need all the results?  Well, there is no straight-forward way to solve this tradeoff relation between quality and speed, the guys in Berkeley will provide some answers in this year. 

Hope this BDAS can be a real 'Bad Ass' in a big data analytics soon! 

- written by ANTICONFIDENTIAL, at SF, in April 13, 2013