In an era when we eat, sleep, talk, and work data, discussing databases has to be one of the most relevant things to do. A database is typically an organized collection of data (information) that can easily be accessed, managed, and updated. It is the collection of schemas, tables, queries, reports, views, and other objects. There are different types of databases and Apache Cassandra falls into the NoSQL category. Before we proceed, let us first understand what a NoSQL database is.
NoSQL Database – Definition
A NoSQL database (often referred to as Not Only SQL) is a non-relational and largely distributed database system that enables fast, ad-hoc organization and analysis of high-volume and disparate data types. It provides a mechanism to store and retrieve data other than the tabular relational data used in relational databases. NoSQL encompasses a wide range of technologies and architectures and is particularly useful when an organization needs to access and analyze huge amounts of unstructured data that is stored remotely on multiple servers in the cloud. Therefore, it is also called cloud database.
The main objective of a NoSQL database is to offer:
- simplicity of design
- horizontal scaling
- precise control over availability
What is Apache Cassandra?
As many of our techie friends must be knowing that Apache Cassandra is a distributed database designed to tackle large amounts of data across many commodity servers. It is basically a type of NoSQL database.
Cassandra was born at Facebook, developed for the purpose of inbox search and was built on Amazon’s Dynamo and Google’s BigTable. The database was open-sourced by Facebook in July 2008 and found its acceptance within Apache Incubator in March 2009. Cassandra became an Apache top-level project since February 2010.
Interestingly, the name Cassandra for this Apache project is inspired from Greek mythology. Cassandra was the daughter of King Priam of Troy, who had the ability to accurately predict the future but nobody believed her. Ironically, Apache Cassandra also faces big resistance (or disbelief) from the traditional RDBMS world in spite of its capabilities.
Features of Apache Cassandra
Apache Cassandra offers outstanding technical capabilities including operational simplicity and easy data distribution across multiple data centers. Some of the important features of Cassandra are as follows:
- Linear-scale performance – Cassandra is linearly scalable. Therefore, it enhances your throughput while you increase the number of nodes in the cluster thereby maintaining a fast response time.
- Continuous availability – Cassandra boasts of continuous availability without a single point of failure. It always remains on architecture for critical applications that are crucial for a business.
- Elastic scalability – Cassandra is massively scalable with a masterclass architecture, where all nodes are the same. It enables adding more hardware to accommodate more data as well as customers depending on the requirement.
- Strong data protection – Cassandra has a commit log design that ensures no data loss. The built-in security with backup/restore feature keeps data protected.
- Flexible data storage – Cassandra can accommodate various data formats including semi-structured, structured, and unstructured data. It also enables you to incorporate changes to your data structures according to your need.
- Multi-data center replication – Cassandra offers the flexibility to distribute data wherever needed by replicating data across multiple data centers across multiple geographies.
- Data compression – Cassandra allows you to compress data up to 80% without performance overhead.
- Transaction support – Cassandra supports the ACID properties, namely, Atomicity, Consistency, Isolation, and Durability.
- Fast writes – Initially designed to run on cheap commodity hardware, Cassandra can perform fast writes and store hundreds of terabytes of data, without compromising the read efficiency.
- Cassandra Query Language (CQL) – Cassandra comes with an SQL-like language that helps you to shift from a relational database seamlessly.
Apache Cassandra Architecture
Unlike using the master-slave or a manual and shared architecture, Cassandra leverages a masterless “ring” design. All nodes, in Cassandra, play an identical role, with all nodes communicating with each other equally. It does not have a master node. Cassandra boasts of a built-for-scale architecture, which is capable of handling huge amounts of data and numerous concurrent users or operations per second. Cassandra’s masterless architecture ensures that it has no single point of failure thereby continuous availability and uptime.
On the surface, the data model of Apache Cassandra appears quite relational. However, diving deeper, it seems like an unfinished RDBMS, lacking features like JOINS and other rich-query capabilities. To understand such kind of a data model, we need to understand that distributed storage services like Cassandra were designed to tackle enormous amounts of data, stored in billions of rows on large clusters. It can simply remove some or all of these ties and does not allow operations, such as JOINS or rich-queries, that require scanning extensive parts of the data set. Cassandra has only two ways to query – by key or by key-range. The bare minimum data model of Cassandra is based on the fact that a single table is easier to distribute over multiple machines, than a number of normalized relations or graphs. Cassandra supports a row-oriented data storage with all contents of a row being serialized together on a disk. Every row of columns contains a unique key and each row can hold up to 2 billion columns. In Cassandra, data is partitioned solely by row-key.
With a true masterless architecture, Apache Cassandra can be used by developers, administrators and the Central IT team. Cassandra has accomplished a lot in the last 8 years and has emerged as one of the most scalable databases.
Learn more about Apache Cassandra and the fundamentals of NoSQL databases from the industry experts. Be the expert.