Berkeley Data Analytics Stack (BDAS) is an open-source analytics stack for complex computations on massive dataset (big data) which supports efficient, large-scale in-memory (RAM) processing, and allows users and applications to trade between query accuracy, time, and cost. The most important goal of massive dataset analysis is to provide the basis for decision-makings, such as personalized gamification and advertisement targeting through sophisticated exploratory analysis.
Current Tools Are Slow...
Today's analytics tools can provide complex processing on massive datasets but they are slow. Achieving speedy responses (less than 1 second latency) on massive dataset is hard because information is usually stored over several, possibly up to thousands, of storage partitions and even if all information were in RAM memory, it takes 10s of seconds to just scan 200-300 GB RAM memory of a large server. This makes the existing tools less suitable for real time complex computations, such as machine learning algorithms. To try to provide low latency computations on massive datasets for both historical and live data, the AMPLab at UC Berkeley is developing Berkeley Data Analytics Stack or BDAS which is an open source, next-generation data analytics stack.
Figure 1 - Open Analytics Stack, BDAS Overview, Stoica, 2013
Berkely Data Analytics Stack leverages today’s open analytics stack (fig. 1) which consists of four layers -- infrastructure layer (storage networks), storage layer (Hadoop), data processing layer (Pig, HBase, Hive) and the application layer.
BDAS (fig. 2) adds an extra layer, Mesos, between infrastructure and storage for resource management which uses fine grain sharing and thus enables multiple frameworks to share the same infrastructure resulting in increased parallelism. BDAS extends the storage layer to enable better data management by adding the capability of in-memory processing, with Tachyon. BDAS also extends the data processing layer through Spark by adding the capabilities such as pre-computation, in-memory processing and data sampling.
Mesos: a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.
Tachyon: High-throughput, fault-tolerant in-memory storage with interface compatible to HDFS and support for Spark and Hadoop
Spark: a data processing layer framework designed for low-latency and iterative computation (eg. Machine Learning algorithms) on historical data. It provides fault-tolerant and efficient memory abstraction called Resilient Distributed Database (RDDs). Support Hadoop Distributed File System (HDFS) API, Amazon S3 API, and Hive metadata. Spark provides an easy-to-program interface that is available in Java, Python, and Scala.
Shark: a port of Apache Hive onto Spark, Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark can run Hive queries 100 times faster when the dataset fits in memory and up to 5-10 times faster when the dataset is stored on disk.
Spark Streaming: extends Spark to provide streaming functionality and low latency computations on live dataset. Accept inputs from Kafka, Flume, Twitter, TCP sockets. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
GraphX: provides GraphLab API and Toolkits on top of Spark, fault tolerance by leveraging Spark.
MLbase: Make Machine-Learning (ML) accessible to non-experts with declarative approach to ML.
BlinkDB: an approximate query processing engine that allows users to trade between accuracy, time, and cost.
In summary, BDAS framework and its components work together to provide a magnitude of speed improvement, allowing iterative analysis possible with massive historical and live datasets.