Photo by Colin Lloyd on Unsplash
Apache Cassandra: Best use practices in the search for performance ?!
Improving the performance of Apache Cassandra, Best practices, and a little more! :)
Cassandra is a NoSQL database developed to ensure rapid scalability and high availability of data, being open source and maintained mainly by the Apache Foundation and its community.
Its main features are:
“Decentralization”: all nodes have the same functionality.
“Resilience”: several nodes replicate data; it also supports replication by multiple data centers.
“Scalability”: adding new nodes to the cluster is fast and does not affect the system's performance; there are systems that use Cassandra with thousands of nodes today.
According to Apache, we can look at Cassandra as being: “The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.”
This publication summarizes some possibilities and good practices to improve Cassandra’s performance and performance and an alternative to Cassandra.
Good habits:
Like any NoSQL database, we have to take into account, according to the website www.howtouselinux.com we can look at:
Without load balancers in front of the Cassandra: Cassandra already distributes the data between the nodes, and most Cassandra drivers have an integrated algorithm to route requests properly. Adding the load balancer introduces an additional layer, potentially breaking the intelligent algorithms used by the drive and introducing a single point of failure where none exists in the Cassandra world.
Avoid the secondary index whenever possible. In Cassandra, the secondary indexes are local. Therefore, queries that use the secondary index can cause performance problems in the cluster if multiple nodes are accessed. Using the secondary index occasionally in a column with low cardinality (for example, a column with a few hundred unique state names) is normal, but in general, try to avoid it. Please do not use it for high cardinality columns.
Avoid full table checks. Cassandra distributes partitions among all nodes in the cluster. In a large cluster, it can have billions of lines. Therefore, a complete scan of the table involving all records can cause network bottlenecks and extreme heap pressure. Instead, adjust your data model so that you don’t have to scan the table completely.
Keep partition sizes within 100 MB The maximum practical limit is two billion cells per partition. However, you should try to keep the size of your partitions within 100 MB. Enormous partitions create considerable pressure on the stack and cause slow compaction.
Do not use batch for bulk upload. Batches should be used when you want to keep a set of denormalized tables in sync. Do not use batch for bulk loading (especially when multiple partition keys are involved), as this can put significant pressure on the coordinating node and be detrimental to performance.
Take advantage of prepared statements when possible Use the prepared statement when running a query with the same structure multiple times. Cassandra will analyze the query string and cache the result. The next time you want the query, you can link the variables with prepared statements in the cache. This helps to increase performance, bypassing the analysis phase for each query.
Avoid using IN clause queries with many values for multiple partitions. The use of ‘in’ clause queries with large numbers for multiple partitions places significant pressure on the coordinating node. Also, if the coordinating node fails to process this query due to excessive load, everything must be repeated. Instead, I prefer to use separate queries in these cases to avoid single points of failure and overheating the coordinating node.
Do not use SimpleReplicationStrategy in multi-datacenter environments A simple replication strategy is used for single data center environments. Place the replica on the next nodes clockwise without considering the rack or data center location. Therefore, it should not be used for environments with multiple data centers.
I prefer to use local consistency levels in environments with multiple data centers. If possible and the use case allows, always prefer to use levels of local consistency in environments with multiple data centers. With the level of local consistency, local replicas are consulted to prepare a response, which avoids the latency of communication between data centers.
Avoid queuing as data models. These types of data models generate many tombstones. A slice query that scans multiple tombstones to find a match is less than ideal. This causes latency problems and also increases the pressure on the heap, as it scans a lot of capture data to find a small amount of data needed.
Avoid reading before writing the pattern. Reads of non-cached data may require multiple disk accesses. Therefore, this decreases the throughput of writes considerably, making sequential I/O.
Try to keep the total number of tables in a cluster within a reasonable range. Many tables in a cluster can cause excessive heap and memory pressure. Therefore, try to keep the number of tables in a cluster within a reasonable range. Due to several factors involved with the tables, it is difficult to find a good number. Still, from many tests, it has been established that you should try to keep the number of tables within 200 (warning level), and you should absolutely not cross 500 tables ( failure level). Choose the leveled compression strategy to read the heavy workload if enough i /o is available.
Assuming almost uniform lines, the leveled compression strategy ensures that 90% of all readings are satisfied from a single sstable. Therefore, it is great for heavy reading and latency-sensitive use cases. Of course, it causes more compression, requiring more i/o during compression.
Note: It is always a good idea to use a leveled compaction strategy when creating your own table. Once the table itself is created, it becomes a little tricky to change it later. Although it can be changed later, do it carefully, as it can overload your node with a lot of I/O.
Keep the batch size of multiple partitions within 5 KB Large batches can cause a significant performance penalty, overloading the coordinating node. Therefore, try to keep the batch size at 5 KB per batch when using multiple partition batches.
These are some of the points that we have to pay attention to using Cassandra. Do you know another one? Do you agree with the points raised?
Looking at performance:
The most common complaint of professionals who use Cassandra daily is related to their performance. To deal with this, we can adopt, whenever possible, Apache Spark in a paralyzed way to make queries, paying attention to:
Always try to reduce the amount of data sent over the network.
Filter the data as early as possible not to process data that will be discarded later.
Define the right number of partitions.
Avoid data distortion.
Configure the breakdown before expensive or multiple joins.
Before recording in Cassandra, never try to do the partition before recording in storage, using the Spark Cassandra Connector, this will be done automatically in a much more performative way.
Define the right number of executors, cores, and memory.
Study and define the serialization that Spark will use.
Whenever you use SparkSQL, use Spark Catalyst!
Use the correct file format, .Avro for row-level operations and Parquet or ORC for column-based operations.
Always test these and other optimizations, as a tip, and whenever possible, use an equal environment, a clone of the productive, to serve as a laboratory! :)
If none of this made the difference that you expected or needed, then the time has come to look at adopting another database like …
Scylladb
“ The Real-Time Big Data Database”
Scylladb is implemented using the C ++ language, has the Cassandra Query Language (CQL) interface of Apache Cassandra, with the same characteristics of horizontal expansion and fault tolerance, in addition to being composed of algorithms for better use of computational resources, that is, it will always be using 100% of the available resources to have its unbelievable performance, in addition to doing data compression and compression automatically.
With this database, we have the possibility of integration with:
Apache Spark, Framework for data processing;
Apache Kafka, event stream platform;
Datadog, a service to monitor cloud solutions;
Akka, extremely resilient Scala framework for “message-driven”;
Presto, an SQL query engine,
Apache Parquet, a popular format for “columnar storage.”
Also, we managed to use, together with Scylladb, some Datastore such as:
Comparing ScyllaDB with Cassandra, we can notice:
It occupies 10 times less space than Cassandra.
Much fewer latency spikes.
Improvement of up to 11 times in the latency of 99 percent of the cases.
It cost 2.5 less than Cassandra in the same environment (AWS EC2 Bare Metal)
Possibility of deploying in a container environment (Kubernetes, Openshift, Rancher, clouds …) natively with an Operator.
You didn’t believe it, so look at the benchmarks:
This was a publication about Apache Cassandra, good usage practices, some ways to improve your performance, and an alternative, Scylladb, with a comparison between them.
References:
Follow me on Medium :)