Berkeley Data Analytics Stack (BDAS) is an open-source analytics stack for complex computations on massive dataset (big data) which supports efficient, large-scale in-memory (RAM) processing, and allows users and applications to trade between query accuracy, time, and cost.  The most important goal of massive dataset analysis is to provide the basis for decision-makings, such as personalized gamification and advertisement targeting through sophisticated exploratory analysis.

Current Tools Are Slow...

Today's analytics tools can provide complex processing on massive datasets but they are slow.  Achieving speedy responses (less than 1 second latency) on massive dataset is hard because information is usually stored over several, possibly up to thousands, of storage partitions and even if all information were in RAM memory, it takes 10s of seconds to just scan 200-300 GB RAM memory of a large server.  This makes the existing tools less suitable for real time complex computations, such as machine learning algorithms.  To try to provide low latency computations on massive datasets for both historical and live data, the AMPLab at UC Berkeley is developing Berkeley Data Analytics Stack or  BDAS which is an open source, next-generation data analytics stack.  

Figure 1 - Open Analytics Stack, BDAS Overview, Stoica, 2013Credit: BDAS Overview slides, Stoica, 2013

Figure 1 - Open Analytics Stack, BDAS Overview, Stoica, 2013

Berkely Data Analytics Stack leverages today’s open analytics stack (fig. 1)  which consists of four layers -- infrastructure layer (storage networks), storage layer (Hadoop), data processing layer (Pig, HBase, Hive) and the application layer.

BDAS Architecture

BDAS (fig. 2) adds an extra layer, Mesos, between infrastructure and storage for resource management which uses fine grain sharing and thus enables multiple frameworks to share the same infrastructure resulting in increased parallelism. BDAS extends the storage layer to enable better data management by adding the capability of in-memory processing, with Tachyon. BDAS also extends the data processing layer through Spark by adding the capabilities such as pre-computation, in-memory processing and data sampling.

 Figure 2 - Berkeley Data Analytics Stack (BDAS)Credit:

Figure 2 - Berkeley Data Analytics Stack (BDAS),

BDAS Components

Mesos:  a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.

Tachyon: High-throughput, fault-tolerant in-memory storage with interface compatible to HDFS and  support for Spark and Hadoop

Spark: a data processing layer framework designed for low-latency and iterative computation (eg. Machine Learning algorithms) on historical data. It provides fault-tolerant and efficient memory abstraction called Resilient Distributed Database (RDDs). Support Hadoop Distributed File System (HDFS) API, Amazon S3 API, and Hive metadata.  Spark provides an easy-to-program interface that is available in Java, Python, and Scala.

Shark: a port of Apache Hive onto Spark, Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark can run Hive queries 100 times faster when the dataset fits in memory and up to 5-10 times faster when the dataset is stored on disk.

Spark Streaming: extends Spark to provide streaming functionality and low latency computations on live dataset.  Accept inputs from Kafka, Flume, Twitter, TCP sockets.  With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.

GraphX: provides GraphLab API and Toolkits on top of Spark,  fault tolerance by leveraging Spark.

 MLbase: Make Machine-Learning (ML) accessible to non-experts  with declarative approach to ML.

BlinkDB: an approximate query processing engine that allows users to trade between accuracy, time, and cost.

In summary, BDAS framework and its components work together to provide a magnitude of speed improvement, allowing iterative analysis possible with massive historical and live datasets.