Intro to Apache CASSANDRA (NoSQL Database)

Start Here

Get in touch with a
TriCore Solutions specialist

Blog | Mar 3, 2017

Intro to Apache CASSANDRA (NoSQL Database)


Apache Cassandra is a recognized fault-tolerance and linear scalability database. It can accommodate any hardware or cloud infrastructure which makes this database the perfect platform for mission-critical data.

Introduction:
This blog gives an overview of a non-relational database Apache Cassandra and discusses its components to provide an understanding of how the database operates and manages data.

Overview:
An organization that primarily requires scalability and high availability to maintain its day to day operational data without compromising the performance of the database system can benefit from using the Apache Cassandra.

Apache Cassandra is a recognized fault-tolerance and linear scalability database. It can accommodate any hardware or cloud infrastructure which makes this database the perfect platform for mission-critical data.

The Apache Cassandra database supports replication across multiple geographic locations and provides lower latency for your users while guaranteeing that any regional outage will not impact the entire database system.

Cassandra is an open source, distributed and decentralized/distributed storage system (database). It is used for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure. It is a type of a NoSQL database.

Some of the unique facts about Apache Cassandra are:

  • Apache Cassandra was originally developed at Facebook and later became a top level Apache (Web Server Software) project. It differs sharply from relational database management systems.
  • It is a column-oriented database.
  • Apache Cassandra implements a Dynamo-style replication model with no single point of failure, and adds a more powerful “column family” data model.
  • Apache Cassandra is being used by some of the biggest companies such as Facebook, GitHub, GoDaddy, Instagram, Cisco, Rackspace, ebay, Twitter and Netflix among others.

Features of Cassandra:

  • Elastic scalability:-As it is highly scalable, it allows you to add additional hardware as required.
  • Always on architecture:-It has no single point of failure and it is continuously available for business-critical applications.
  • Fast linear-scale performance: - It is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster.
  • Transaction support: - It supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
  • Fast writes: - It was designed to run on cheap commodity hardware. 
  • Easy data distribution: - It provides the flexibility to distribute data where you need by replicating data across multiple data centers. 

Architecture – Apache Cassandra

cloud infrastructure support

Image Source:-https://www.google.co.in/search?q=cassandra&biw=1242&bih=602&source=lnms&tbm=isch&sa=X&sqi=2&ved=0ahUKEwjenJzWq73RAhUFsI8KHdikB44Q_AUIBigB&dpr=1.1#tbm=isch&q=Apache+cassandra+architecture&imgrc=4YV8Uzsn0xh06M%3A

Given below are the key components of Cassandra:

  • Node- It is the place where data is stored.
  • Data centre- It is a collection of related nodes.
  • Commit log- The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
  • Cluster- A cluster is a component that contains one or more data centres.
  • Mem-table- A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  • SSTable- It is a disk file, to which the data is flushed to, from mem-table, when its contents reach a threshold value.
  • Bloom filter– It is a quick, nondeterministic, algorithm for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
  • Compaction– It is the process of freeing up space by merging large accumulated data files is called compaction. During compaction, the data is merged, indexed, sorted, and stored in a new SSTable. Compaction also reduces the number of required seeks.

Installation: - it is easy to install an Apache Cassandra database:

Configure Cassandra database:-

Change following minimum parameters in /etc/cassandra/conf/cassandra.yaml file in order to configure cassandra database.

  • cluster_name:'ClientName_CC_Lifecycle_Project' ---> where enviornemnt maybe Dev, Test or Prod
  • data_file_directories:/css_data/data --> this directory would be having database datafiles.
  • commitlog_directory:/css_data/commitlog
  • saved_caches_directory:/css_data/saved_caches
  • authenticator:PasswordAuthenticator --> This parameter would enable password authentication into database.
  • max_heap_size="1G"
  • heap_newsize="250M"

To start database type:  cassandra

Status of cassandra database cluster :-noetool status

(Note:- We can create keyspace once Cassandra database is up and as per our requirement)

Note:-Though you would be able to install Apache Cassandra by following the steps outlined above, the database configuration is required in order to fine-tune database accordingly.

Conclusion:
To handle big data workloads, a massively scalable NOSQL database is a major requirement. While there are number of NOSQL databases available in the market, Apache Cassandra provides linear scale performance and key-enterprise class features that sets it apart from others available to meet the requirements of the big data system. For any questions on the topic feel free to reach out to me by clicking below:

Ask Amit