As part of the HP Software agenda of moving applications to Cloud and SaaS models, many developers and projects are dealing with the NoSQL dilemma. Do we need to use NoSQL? (Everybody is talking about NoSQL, hence, probably we need to use it too.) If yes, what is the most effective way to use it? This blog is not the definitive guide to a specific NoSQL solution. However, in this blog we will review the different examples and examples of when each NoSQL best suits your problem. So, if you came here to learn about Redis, MongoDB, Hadoop or Vertica, you will be very disappointed!
Let’s start from the simple question – What is NoSQL?
This definition is not very precise. There are NoSQL relational DBMS systems. This includes the first NoSQL system created by Carlo Strozzi in 1998 which defined it as NoSQL because it does not express its queries using SQL. It actually could be explained as non-SQL, but this is also incorrect since many NoSQL databases support SQL99 standard.
Another definition we can find in many communities for NoSQL is Not Only SQL. This is also very misleading since not all NoSQL databases support SQL.
Finally, the latest movement in this rapidly changing domain is to use a different abbreviator – NewSQL. The best definition I found is in Matthew Aslett’s blog post: “NoSQL is SQL databases that provide scalable/high performance services while changing the SQL language that you manipulate the database data with.” Since this implies horizontal scalability, which is not necessarily a feature of all the products, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
So what is the problem with Relational Databases Management Systems (RDBMS)?
Most RDBMS systems don’t scale to the size required by 21st century systems. They focus on reliability and supporting ACID model (Atomicity, Consistency, Isolation and Durability). As presented by Michael Stonebraker, VoltDB CTO, in “Urban Myths about SQL”, existing RDBMS systems suffer from archaic architectures that are slow by design. They don’t leverage the new hardware capabilities which enable them to get rid of the disk buffer pool, locking, crash recovery and non-efficient multi-threading. The following picture shows the TPC-C CPU cycles distribution during ACID operation:
In addition, ODBC/JDBC data access is costly because of roundtrips between client and server. Stored procedures with compiled code either to intermediate representation or to C++ (or another programming language) are significantly more efficient.
The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees (you need to pick two).
- Consistency – all database nodes see the same data, even with concurrent updates.
- Availability – every request receives a response about whether it succeeded or failed.
- Partition tolerance the system continues to operate despite arbitrary message loss or failure of part of the system.
To support ACID RDBMS you need C + A.
Now we need to understand the additional term frequently used by NewSQL/NoSQL solutions – Map-Reduce.
MapReduce is a patented software framework introduced by Google and used to simplify data processing across massive data sets. As people rapidly increase their online activity and digital footprint, a huge amount of data is generated continuously. This data can be of multiple types (text, rich text, RDBMS, graph, etc.) and organizations are finding it vital to quickly analyze this huge amount of data generated by their customers and audiences to better understand and serve them. MapReduce is the tool that is helping those organizations in quick and efficient analysis and brings business value to the organization.
Map reduce framework is inspired by two functions, map and reduce, which are commonly used in functional programming languages like LISP. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In other words, data sets are broken down into several subsets that are then individually processed in parallel over multiple machines and the resultant data sets are merged together. The requirements for MapReduce can be broken down as follows:
- The data set should be big enough to ensure that splitting up the data will increase overall performance and will not be detrimental to it.
- The computations are generally not dependent on external input. Only required is the data set that is being processed.
- The calculations/processing that run on one subset of the data needs to be merged with another subset.
- The resultant data set should be smaller than the initial data set.
MapReduce framework is implemented in many solutions, sometimes explicitly exposed to the user’s programs and sometimes only implicitly and accessible through different data retrieval approach.
As we’ll see soon, there are many different NewSQL/NoSQL solutions which are focusing on solving different problems. Let’s review the problems and their possible solutions.
1. New RDBMS systems
The following systems support ACID (C + A) and also provide the horizontal scale:
- VoltDB – the product of Michael Stonebraker which is designed to support ACID for millions of complex transactions per second and linear scalability.
- NuoDB – a re-think of relational database technology, targeted at an elastic cloud of computers rather than a single computer system.
- SAP HANA – real-time platform combines high-volume transactions with analytics and text mining.
- JustOne – radically re-designed from the ground up and exploits modern hardware characteristics.
- Akiban – uses table-grouping™ technology that enables breakthrough performance, scalability and programmability.
- ScaleArc – provides an abstraction layer that sits in between application servers and database servers
- ScaleBase – provides an abstraction layer with a single point of management for a distributed database environment.
- GenieDB – fuses the best of scalable NoSQL and hard-to-scale SQL architectures.
- Clustrix – optimized MySQL-compatible software and hardware appliance.
- Drizzle – a community-driven open source project that is forked from the popular MySQL database.
The Drizzle team has removed non-essential code, re-factored the remaining code into a plugin-based architecture and modernized the code base moving to C++.
- TokuDB – MySQL extended with Fractal Tree® Indexes.
- Citrusleaf – real-time NoSQL distributed database with ACID compliance, immediate consistency and 24×7 uptime.
Let’s review architecture of VoltDB in order to understand this section:
- Designed by Michael Stonebraker one of the best computer scientists specializing in database research. His previous projects include Vertica, StreamBase, Ingres, Postgres and many others.
- Stored procedures. To avoid ODBC/JDBC interactions, VoltDB works only with stored procedures.
- Based on in-memory storage. To avoid buffer pools and blocking disc operation, VoltDB stores its data in memory.
- Single threaded and transactional. VoltDB executes transactions using single threaded architecture in timestamp order. The transaction should run on multiple nodes than VoltDB uses “speculative execution”. In this model every CPU guarantees that it will process transactions in timestamp order. However, multi-partition transactions require inter-CPU messages and there may be some delay involved. Instead of waiting to process transactions with higher timestamps (and incurring a stall), the idea is to process a transaction in “tentative” mode until the multi-partition transaction is committed. If no conflict is observed, the tentative transactions can be committed; otherwise one or more need to be backed out. Effectively, this is a form of optimistic concurrency control (OCC).
- No logs. Instead of using logs, persistence is guaranteed by having the data replicated to multiple nodes. Each transaction is run on each node in the same timestamp order. Each replication is enabled for active-active usage.
- Sharding. Each table could be either replicated to all nodes or partitioned based on sharding configuration.
VoltDB will not replace DWH and OLAP systems as well as specialized data stores: graph-oriented, document-oriented, stream-oriented, object-oriented and key-value stores. As Michael Stonebraker mentions in his blog, using not ACID NoSQL solutions should be considered for applications when:
- The data does not lend itself to relational organization.
- You can say with certainty that your application will NEVER need competent transaction processing AND no future application that will use the database you’re building will need competent transaction processing.
- Use of a standard data language (SQL) is not beneficial to your team/organization.
Unless ALL of the above are true, it’s recommended to find an RDBMS that meets your needs.
2. MPP data warehouse solutions
“MPP (massively parallel processing) is the coordinated processing of a program by multiple processor s that works on different parts of the program, with each processor using its own operating system and memory. Typically, MPP processors communicate using some messaging interface. In some implementations, up to 200 or more processors can work on the same application. An “interconnect” arrangement of data paths allows messages to be sent between processors. Typically, the setup for MPP is more complicated, requiring thought about how to partition a common database among processors and how to assign work among the processors. An MPP system is also known as a “loosely coupled” or “shared nothing” system. An MPP system is considered better than a symmetrically parallel system (SMP) for applications that allow a number of databases to be searched in parallel. These include decision support system and data warehouse applications” – from http://whatis.techtarget.com
MPP solutions can be column-oriented, row-oriented, and a combination of both. In some cases the solution is based only on a software product, in other cases it also leverages specific hardware architecture.
The following are examples of MPP DWH solutions:
- Vertica – High performance MPP columnar database with a User-Defined Load System combined with the suite of built-in analytic functions including time-series, event-series and pattern matching.
- Aster Data – Hybrid row and column massively parallel processing (MPP) Analytic Platform, a software solution that embeds both SQL and MapReduce analytic processing with data stores for deeper insights on multi-structured data sources and types to deliver new analytic capabilities with breakthrough performance and scalability.
- GreenPlum – Shared-nothing hybrid (row- and column-oriented storage) MPP database with high-performance parallel dataflow engine, and advanced high performance parallel import and export of compressed and uncompressed data from Hadoop clusters software interconnect technology.
- Netezza – High-performance data warehouse appliances and advanced analytics application for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning.
- InfiniDB – Column database designed to service the needs of Business Intelligence (BI), data warehouse, and analytical applications where scalability, performance and simplicity are paramount.
Let’s review major points of Vertica’s architecture and understand their importance:
- Column architecture
i. Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
ii. Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
- Aggressive data compression
i. Compression can save big bucks on storage costs; increase data center density, or allow more data to be kept, while simultaneously increasing query performance in cases where I/O is the bottleneck.
ii. Compression is better on column store – if you walk down attribute columns there is more similarity than if you cut across the rows.
iii. Data that is well organized compresses better than data that is located haphazardly.
iv. Vertica Doesn’t Do In-Place Updates. Since new values could come along that don’t compress as well as the old values, some empty space must be left, or updates foregone. Since Vertica puts updates in a separate place (such as the Write Optimized Store), we can squeeze every last bit out of the data.
- A hybrid transaction architecture that supports concurrent, parallelized data loading and querying
i. Data warehouses are often queried by day and bulk-loaded by night. The problem is, there’s too much data to load at night and users are demanding more real-time data. Vertica features a hybrid architecture that allows querying and loading to occur in parallel across multiple projections. Each Vertica site contains a memory-resident Write-Optimized Store (WOS) for recording inserts, updates and deletes and a Read-Optimized Store (ROS) for handling queries. WOS contents are continuously moved into the associated ROS asynchronously.
ii. Lightweight transaction management prevents database reads and writes from conflicting so queries can run against data in the ROS, WOS or in both.
- Automatic physical database design
i. Based on DBA-provided logical schema definitions and SQL queries, Vertica automatically determines what projections to construct and where to store them to optimize query database performance and high availability.
- Multiple physical sort orders (“projections”) and grid/shared-nothing hardware architecture
i. Vertica supports logical relational models; physically, it stores data as “projections”—collections of sorted columns (similar to materialized views).
ii. Multiple projections stored on networked, shared-nothing machines (“sites”) can contain overlapping subsets of columns with different sort orders to ensure high availability and enhance performance by executing queries against the projection(s) with the most appropriate columns and sort orders.
- Automatic “log-less” recovery by query and High availability without hardware redundancy
i. Rather than having a mirrored database backup sitting idle for failover purposes, Vertica leverages the redundancy built into the database’s projections. It queries projections not only to handle user requests, but also for rebuilding the data in a recently restored projection or site. The Database Designer builds the necessary redundancy into the projections it creates such that a DBA-specified number of site failures can occur without compromising the system.
ii. Vertica’s approach to recovery avoids bogging down database performance with expensive logging and two-phase commit operations.
- Connectivity to applications, ETL, and reporting via SQL, JDBC and ODBC
i. Vertica offers this industry-standard connectivity through ODBC, JDBC, ADO.Net and our rich API as well as native integrations and certifications with a variety of tools like Cognos, MicroStrategy, Tableau and others.
ii. Vertica offers an Informatica plug-in for ETL processing.
iii. Vertica provides Hadoop and Pig connectors; users have unprecedented flexibility and speed in loading data from Hadoop to Vertica and querying data from Vertica in Hadoop as part of MapReduce jobs, for example. The Vertica Hadoop and Pig connectors are open source, supported by Vertica, and are available for download.
3. Distributed Batch processing
This type of data processing is getting a lot of buzz these days and many projects are evaluating it or using it in their content life cycle. Mike Olson (Cloudera CEO) said “these technologies attempt to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. These technologies apply to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But distributed batch processing can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them.”
The best examples of this type of solution is:
- Apache™ Hadoop™ – The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
- HPCC Systems – High Performance Computing Cluster is a massive parallel-processing computing platform by LexusNexus that solves Big Data problems.
- The Disco Project – Disco is open-source project developed by Nokia Research Center to solve real problems in handling massive amounts of data.
Let’s review Hadoop ecosystem to better understand the problems it attempts to solve.
A little history (by Mike Olson): The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.
The major parts of Hadoop ecosystem are:
- Hadoop Distributed File System (HDFS™) – A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce – A software framework for distributed processing of large data sets on compute clusters.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
Let’s review HPCC solution and see the difference with Hadoop.
HPCC Systems was created 10 years ago within the LexisNexis Risk Solutions division that analyzes huge amounts of data for its customers in intelligence, financial services and other high-profile industries.
According to Armando Escalante, CTO of LexisNexis Risk Solutions, the company decided to release HPCC now because it wanted to get the technology into the community before Hadoop became the de facto option for big data processing (from Derrick Harris blog). But in order to compete for mindshare and developers, he said, the company felt it had to open-source the technology.
Instead of MapReduce technique, HPCC uses ECL (Enterprise Control Language): a declarative, data-centric language that abstracts a lot of the work necessary within MapReduce. For certain tasks that take a thousand lines of code in MapReduce, he said, ECL only requires 99 lines. Furthermore, he explained, ECL doesn’t care how many nodes are in the cluster because the system automatically distributes data across however many nodes are present. Technically, though, HPCC could run on just a single virtual machine. And, says Escalante, HPCC is written in C++ — like the original Google MapReduce on which Hadoop MapReduce is based — which he says makes it inherently faster than the Java-based Hadoop version.
HPCC offers two options for processing and serving data: the Thor Data Refinery Cluster and the Roxy Rapid Data Delivery Cluster. Escalante said Thor — so named for its hammer-like approach to solving the problem — crunches, analyzes and indexes huge amounts of data a la Hadoop. Roxie, on the other hand, is more like a traditional relational database or database warehouse that even can serve transactions to a web front end.
The community support around Hadoop provides the significant benefit to this project but, for the other hand, the simplicity of ECL will allow analytics to become more pervasive across the enterprise market.
4 Key-Value databases
Key-value stores are schema-less, where the data is usually consisting of a string which represents the key and the actual data which is considered to be the value. The data is usually a primitive of an object that is being marshaled by the programming language. Binding between the key and the value replaces the need for fixed data model and makes the requirement for properly formatted data less strict.
The most common cases for key-value store usage are caching mechanisms, temporary storage for some computation processing (like hash-maps) and queue based applications.
The following several examples of key-value databases:
- Redis – an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
- Memcahed/Membase – an open source distributed, key-value database management system optimized for storing data behind interactive web applications.
- Voldemort – an open source fault tolerant distributed key-value storage system which supports distribution across data centers that are geographically far apart.
- Amazon DynamoDB – a highly available, proprietary key-value structured storage system or a distributed data store. It has properties of both databases and distributed hash tables (DHTs). It is not directly exposed as a web service, but is used to power parts of other Amazon Web Services such as Amazon S3.
- RaptorDB – an open source embedded nosql persisted dictionary using b+tree or MurMur hash indexing.
- RethinkDB – memcached compatible persistent store designed for solid-state discs and multicore systems.
- LevelDB – a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
- Kyoto Tycoon – a persistent lightweight cache server.
5 Document-oriented DB
The principles of object-oriented design reflect that information is naturally organized in trees or graphs and not in tables. The best method found to close this gap is ORM (object-relational mapping). ORM layer is responsible for translations of object-oriented design to logical/physical ERD (entity relational diagram). It’s also responsible for generating SQL queries for different DB providers (take care of SQL dialects). Unfortunately, there are many cases when ORM can’t close the gap, such as bulk operations, complex data processing, etc. In addition, multiple joins are a serious barrier of understandable and maintainable business logic. In all these cases there is a need to break the object-oriented design and use alternative techniques.
The document-oriented store can store the hierarchical documents without a predefined data model (schema-less approach). Schema-less store also fits the DevOps approach, which is much harder to achieve with standard RDBMS systems. So, if you have a frequently changing data model you should consider using document-oriented databases.
Below are several examples of document-oriented databases:
- MongoDB – a scalable, high-performance, open source NoSQL database with JSON-style documents and dynamic schemas
- RavenDB – a transactional, open-source Document Database written in .NET, offering a flexible data model
- Clusterpoint – a high-performance document-oriented NoSQL DBMS platform software combining XML/JSON database, clustering and rich enterprise SEARCH features
- SisoDb – a document-oriented DB-provider for SQL-Server written in C#. It lets you store object graphs of POCOs without having to configure any mappings
6 Object-oriented DB
The reasonable question to ask here is why do we need to stop at document-oriented store, instead of going directly to the object-oriented DB (ODBMS)?
To understand the difference, let’s see how a JSON-like document is different from Object Definition Language (ODL) (from MongoDB blog):
- Objects have methods, predefined schema, and inheritance hierarchies. These are not present in a document database; code is not part of the database
- While some relationships between documents may exist, pointers between documents are de-emphasized
- The document store does not persist “graphs” of objects — it is not a graph database.
- In the document store there is an explicit declaration of indexes on specific fields for the collection. This is not always possible in object stores.
- Object-oriented models are usually better bound to a specific environment. If you use exclusively Java, for example, you have a lot of choices. But if your environment is hybrid (Java, COBOL, C++, etc.) there is no common denominator.
Below are several examples of document oriented databases:
- Db4o – is an embeddable open source object database for Java and .NET developers. It is developed, commercially licensed, and supported by Versant.
- NeoDatis – a very simple Object Database that currently runs on Java, .Net, Google Android, Groovy and Scala.
- Objectivity/DB – a commercial object database produced by Objectivity, Inc. It allows applications to make standard C++, Java, Python or Smalltalk objects persistent without having to convert the data objects into the rows and columns used by a relational database management system (RDBMS). Objectivity/DB supports the most popular object-oriented languages plus SQL/ODBC and XML.
- ObjectStore – a commercial object database produced an object database for Java and C++ applications that can cache data to deliver performance.
- Starcounter – a memory centric, ACID compliant, transactional database, optimized for modern CPUs with .NET object API. It is reliable since transactions are secured on disk, and it supports replication and full checkpoint recovery.
- Perst – An open source, object-oriented embedded database for Java and C#.
7 Graph DB
So what is the difference between object-oriented and graph-oriented databases? Both operate with an object’s topology and have to deal with complex relations issues. I found this interesting answer in InfoGrid blog:
Object and graph databases operate on two different levels of abstraction. An object database’s main data elements are objects, the way we know them from an object-oriented programming language. A graph database’s main data elements are nodes and edges. An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity, etc. A graph database does not have the notion of a pointer that can be NULL. A graph database is independent of the application platform (Java, C# etc). On the other hand, using a graph database you can’t simply take an arbitrary object and persist it.
Below are several examples of graph databases:
- Bigdata® – a scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates.
- Neo4J – an open-source / commercial graph database with an embedded, disk-based, fully transactional Java persistence engine.
- InfoGrid – a Web Graph Database with many additional software components that make the development of REST-ful web applications on a graph foundation easy. InfoGrid is open source and is developed in Java
- Infinite Graph – a highly scalable, distributed, and cloud-enabled commercial product with flexible licensing for startups.
- Trinity – distributed in-memory graph engine under development at Microsoft Research Labs.
- HyperGraphDB – a general purpose, open-source graph and Java object-oriented (hybrid) database based on a powerful knowledge management formalism known as directed hypergraphs.
- AllegroGraph – a closed source, scalable, high-performance graph database.
- OrientDB – a high-performance open source document-graph (hybrid) database.
- VertexDB – high performance graph database server that supports automatic garbage collection.
In addition, we can’t finish graph databases without mentioning the new concept of the graph processing, called Pregel, which was created by Google (on top of the BSP Bulk Synchronous Parallel – computational model).
Claudio Martella describes Pregel as follows: Pregel is a system for large-scale graph processing. It provides a fault-tolerant framework for the execution of graph algorithms in parallel over many machines. Think of it as MapReduce re-thought for graph operations.
But what’s wrong with MapReduce and graph algorithms? Nothing in particular, though it can lead to suboptimal performance because the graph state has to be passed from one phase to the other. generating a lot of I/O. It has some usability issues since it doesn’t provide a way to do a per-vertex calculation. In general, it’s not easy to express graph algorithms in M/R. Pregel fills a gap since there are no frameworks for graph processing that address both distribution and fault-tolerance.
Like M/R, it provides the possibility to define Combiners in order to reduce message passing overhead – it combines messages together where semantically possible. Like Sawzall (a procedural domain-specific programming language, used by Google to process large numbers of individual log records), Pregel provides Aggregators which allow global communication by receiving messages from multiple vertices, combining them and sending the result back to the vertices. They are useful for statistics (think of a histogram of vertex degrees) or for global controlling (for example an aggregator can collect all the vertices’ PageRank deltas to calculate the convergence condition).
This framework inspired several projects to provide frameworks with similar APIs:
- Giraph – a graph-processing framework which provides Pregel’s API that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure.
- Apache Hama – a distributed computing framework based on BSP computing techniques for massive scientific computations, e.g., matrix, graph, and network algorithms.
- GoldenOrb – a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.
- Phoebus – a Pregel implementation in Erlang.
- Signal/Collect – a framework for synchronous and asynchronous parallel graph processing. It allows programmers to express many algorithms on graphs in a concise and elegant way.
- HipG – a library for high-level parallel processing of large-scale graphs. HipG is implemented in Java and is designed for a distributed-memory machine. Besides basic distributed graph algorithms, it handles divide-and-conquer graph algorithms and algorithms that execute on graphs created on-the-fly.
8 CEP (Complex event processing)
Complex event processing is one of the most important parts in business decision making and real-time business analytics (the opposite from batch processing). The wiki defines CEP as following:
Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events), and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.
CEP relies on a number of techniques, including:
- Event-pattern detection
- Event abstraction
- Modeling event hierarchies
- Detecting relationships (such as causality, membership or timing) between events
- Abstracting event-driven processes
CEP is primarily focused on operational data and dealing with operational issues, but also is connected to ex-post analysis of event data and enables efficient archiving and query valuable event data for later analysis.
As was explained by Hienz Roth in his work – the knowledge that could be gained from event-data analysis can help to adapt and improve the underlying CEP application.
An Event Data Warehouse (EDWH) is a subject-oriented, integrated, time-variant and non-volatile collection of event data in support of operational or management’s decision-making process. The EDWH stores data items which originate from a continuous stream of events and stores the following types of data items:
- Data items representing the original events in event streams
- Data items capturing relationship information for event sequences resulting from event correlations
- Derived or calculated data items from events or event streams.
An EDWH supports a query language for accessing all three types of data items. In order to analytically process event data ex-post, it has to be permanently archived somewhere. The common solution for this problem is to make use of a data warehouse (DWH) approach where data is added periodically in a batch using an extract, transform and loading (ETL) process.
An additional definition for these kinds of systems is known as active data warehousing. It is described as follows:
An active data warehouse (ADW) is a data warehouse implementation that supports near-time or near-real-time decision making. It is featured by event-driven actions triggered by a continuous stream of queries (generated by people or applications) against a broad, deep, granular set of enterprise data.
The following several examples of such CEP systems:
- StreamBase – the commercial version of the project called Aurora set up by Mike Stonebraker at MIT, in conjunction with researchers from Brandeis University and Brown University. StreamBase’s web site says that its complex event processing (CEP) platform allows for the rapid building of systems that analyze and act on real-time streaming data for instantaneous decision-making, and combines a rapid application development environment, an ultra-low-latency high-throughput event server, and connectivity to real-time and historical data.
- Sybase Aleri Event Stream Processor – the complex event processing platform for rapid development and deployment of business critical applications that analyze and act on high velocity and high volume streaming data – in real-time.
- Esper – a component for complex event processing with Event Processing Language (EPL) for dealing with high frequency time-based event data. Esper supports storing the state to Apache Cassandra.
- Yahoo! S4 (Simple Scalable Streaming System) – general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
- Twitter Storm – a free and open source distributed realtime computation system.
- Apache Kafka – a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn’s activity stream and operational data processing pipeline.
- RuleCore CEP Server – a Complex Event Processing server for real-time detection of complex event patterns from live event streams. A system level building block, providing reactivity into an event driven architecture
- Splunk – software to search, monitor and analyze machine-generated data by applications, systems and IT infrastructure at scale via a web-style interface. Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
9 Column Family Databases
Wikipedia defines column family databases (CFDB) as follows:
A column family is a NoSQL object that contains columns of related data. It is a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is a set of columns. In analogy with relational databases, a column family is as a “table”, each key-value pair being a “row”. Column family represents how the data is stored on the disk. All the data in a single column family will be located in the same set of files. A column family can contain columns or super columns. Each column is a tuple (triplet) consisting of a column name, a value, and a timestamp. In a relational database table, this data would be grouped together within a table with other non-related data. A super column is a dictionary; it is a column that contains other columns (but not other super columns).
Please don’t mix up column-oriented MPP systems (like Vertica) with column-family databases. The following is Greg Linden’s post about C-Store (Vertica’s prototype):
CFDB is also column-oriented and optimized for reads. It is designed for sparse table structures and compresses data. It has relaxed consistency. It is extremely fast. There are some big differences:
- SFDB is not designed to support arbitrary SQL; it is a very large, distributed map.
- CFDB emphasizes massive data and high availability on very large clusters, including cross data centers clusters.
- CFDB is designed to support historical queries (e.g. get data as it looked at time X).
- CFDB does not require explicit table definitions and strings are the only data type.
The following several examples of such CFDB systems:
- Google BigTable – a compressed, high performance, and proprietary data storage system built on Google File System, Chubby Lock Service, SSTable and a few other Google technologies. It is not distributed outside Google, although Google offers access to it as part of its Google App Engine.
- Hypertable – an open source database inspired by publications on the design of Google’s BigTable. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++.
- Apache Cassandra – a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. Jeff Hammerbacher, who led the Facebook Data team at the time, described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure
- Apache HBase (Hadoop) – provides Bigtable-like capabilities on top of Hadoop Core.
- Apache Accumulo – sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift.
- Cloudata – open source implementation of Google’s BigTable.
10 In Memory Data Grid
In Memory Data Grid (IMDG) decouples RAM resources from individual systems in the data center or across multiple locations, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. IMDG uses key-value structure, where value is object-based. IMDG uses shared-nothing scalability architecture where data is partitioned onto nodes connected into a seamless, expandable, resilient fabric capable of spanning process, machine, and geographical boundaries.
In many cases a user will be able to configure ACID support, similar to some key-value databases. To better understand IMDG solution, let’s take a look at the GemFire architecture diagram:
In this diagram we can see how it could be possible to connect 3 different datacenters across the globe into one coordinated virtual environment with shared data.
Let’s review the main distinguishing features of IMGD by GemFire:
- Wide Area Data Distribution – GemFire’s WAN gateway allows distributed systems to scale out in an unbounded and loosely-coupled fashion without loss of performance, reliability and data consistency.
- Heterogeneous Data Sharing – C#, C++ and Java applications can share business objects with each other without going through a transformation layer such as SOAP or XML. A change to a business object in one language can trigger reliable notifications in applications written in the other supported languages.
- Object Query Language (OQL) – a subset of SQL-92 with some object extensions, to query and register interest in data stored in GemFire.
- Map-reduce framework – a programming model when application function can be executed on just one fabric node, executed in parallel on a subset of nodes or in parallel across all the nodes.
- Continuous Availability – in addition to guaranteed consistent copies of data in memory across servers and nodes, applications can synchronously or asynchronously persist the data to disk on one or more nodes. GemFire’s shared-nothing disk architecture ensures very high levels of data availability
- High Scalability – scalability is achieved through dynamic partitioning of data across many member nodes and spreading the data load across the servers. For ‘hot’ data, the system can be dynamically expanded to have more copies of the data. Application behavior can also be provisioned and routed to run in a distributed manner in proximity to the data it depends on.
- Low and Predictable Latency – GemFire uses a highly optimized caching layer designed to minimize context switches among threads and processes.
- Very High Throughput – GemFire uses concurrent main-memory data structures and a highly optimized distribution infrastructure, offering 10X or more read and write throughput, compared with traditional disk-based databases.
- Co-located Transactions to Dramatically Boost Throughput – Multiple transactions can be executed simultaneously across several partitioned regions.
- Improved Scale-out Capabilities – Subscription processing is now partitioned to enable access by many more subscribers with even lower latency than before. Clients communicate directly with each data-hosting server in a single hop, increasing access performance 2 to 3 times for thin clients.
- Spring Integration and Simplified APIs for Greater Development Ease – Thanks to the Spring GemFire Integration project, developers will be able to easily build Spring applications that leverage GemFire distributed data management. In addition, GemFire APIs have been modified for ease of startup and use. The developer samples included with GemFire have been updated to reflect the new APIs.
- Enhanced Parallel Disk Persistence – Our newly redesigned “shared nothing” parallel disk persistence model now provides persistence for any block of data: partitioned or replicated. This enables all your operational data to safely “live” in GemFire, greatly reducing costs by relegating the database to an archival store.
- L2 Caching for Hibernate – With L2 caching, developers can implement GemFire’s enterprise-class data management features for their Spring Hibernate applications. Highly scalable and reliable GemFire L2 caching vastly increases Hibernate performance, reduces database bottlenecks, boosts developer productivity, and supports cloud-scale deployment.
- HTTP Session Management for Tomcat and vFabric tc Server – GemFire lets you decouple session management from your JSP container. You can scale application server and HTTP session handling independently, leveraging GemFire’s ability to manage very large sessions with high performance and no session loss. GemFire HTTP Session Management is pre-configured and can launch automatically with tc Server. For Tomcat, the module is enabled via minor configuration modifications.
Below are several examples of IMDG systems:
- VMware vFabric TM GemFire ® – is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence. It blends advanced techniques like replication, partitioning, data-aware routing, and continuous querying.
- GigaSpaces XAP Elastic Caching Edition – an in-memory data grid for fast data access, extreme performance, and scalability. XAP eliminates database bottlenecks and guarantees consistency, transactional security, reliability, and high availability of your data.
- Infinispan – an extremely scalable, highly available data grid platform – 100% open source, and written in Java. The purpose of Infinispan is to expose a data structure that is highly concurrent, designed from the ground-up to make the most of modern multi-processor/multi-core architectures while at the same time providing distributed cache capabilities. At its core, Infinispan exposes a Cache interface which extends java.util.Map. It is also optionally is backed by a peer-to-peer network architecture to distribute state efficiently around a data grid.
- Hazelcast – an open source clustering and highly scalable data distribution platform for Java.
- Oracle Coherence – replicated and distributed (partitioned) data management and caching services on top of a reliable, highly scalable peer-to-peer clustering protocol.
- Terracotta BigMemory – stores “big” amounts of data in machine memory for ultra-fast access.
This overview of Big Data problems and NewSQL/NoSQL solutions aims to help you navigate the data management world, which is evolving at an insane pace. In addition, I hope it helped to remove the bias against non-standard solutions, and understand the trend of moving solutions to SaaS and Cloud environments.
Since we didn’t dive deep enough into any of the presented solutions, I recommend collecting real stories from different HP Software groups who are currently tackling big data problems. They may have solved the problems using one of the proposed solutions, or perhaps one that isn’t included in this blog.
 Urban Myths about SQL, Michael Stonebraker, 2011
 Clarifications on the CAP Theorem and Data-Related Errors, Michael Stonebraker, October 2010
 Towards Robust Distributed Systems, Eric A. Brewer, July 2000
 Event data warehousing for Complex Event Processing, Roth Heinz, May 2010