Photo by Damien TUPINIER on Unsplash
Apache Hive in the vein!
Discover Apache Hive, its power, and more !! :)
Big data
More and more, we have to deal with large volumes of data that are created and need to be used at an unbelievable Speed, having a huge variation, almost impossible for a human being to follow, to be concerned with its Veracity, and to be able to add value to the business in a way effective. (The 5 V of Big data).
To deal with this, the term “Big data” came up, and several solutions to deal with these problems in different scenarios, such as Apache Hive.
Hive
According to IBM, “Apache Hive is open source data warehouse software (Open source) for reading, writing and managing large data set files that are stored directly on the Distributed File System as Apache Hadoop (HDFS) or on other data storage systems, such as Apache HBase. Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements for querying and analyzing data. It was designed to make MapReduce programming easier, because you don’t need to know and write extensive Java code. Instead, you can write queries more simply in HQL, and Hive can create the map and reduce functions. ”
As with any database management system (DBMS) today, it can be accessed via commands on a command-line interface, via a JDBC, ODBC connection, or a custom driver/connector.
In addition to being based on SQL that we are already used to seeing in other databases, it also integrates Hadoop functions in its query language (HIVEQL); with that, we have the possibility to use MapReduce, for example.
Looking for performance with HiveQL, we can use files in the format RCFile, AVRO, ORC, or Apache Parquet, enable Vectorization, Serialize or Deserialize the data, identify the workload in queries, use Skew Joins, concurrent connections or cursors, and use of Tez-Execution Engine.
These are just a few alternatives, and there is much more…
Because it is a Hadoop-based solution, it is widely used integrated with other solutions in this ecosystem, such as Apache Spark, often being part of a means to implement extraction (Extract), transformation (Transform), and data loading.
Spark SQL
Most of the time, users want to use the data being processed in Spark and record it in Hive or vice versa; for that, we can configure Spark or create a new section to establish this connection.
Many people do not like to use SparkSQL because of its performance in manipulating data. Because it is generally used in conjunction with DataFrames, to try to optimize this, we can make use of PyArrow.
Spark uses the Java Virtual Machine (JVM), and with that, we have the villainous Garbage Collection (resource allocation and deallocation manager) and its unwanted behavior when subjected to multiple processors and large amounts of data. In newer versions of Java, we have different implementations used during code execution in Spark for performance.
NOTE: Before using, check the software version and compatibility between them!
PyHive
To use Hive with Python, in addition to the possibility of making a JDBC connection, we can use the PyHive library; with this, in addition to further simplifying the use of Hive, we have the option of applying cursors to work with large volumes of data or use the interface do Presto (or PrestoDB).
NOTE: Make sure your operating system has the necessary libraries to connect to Hive!
HIVE ON DOCKER
With the arrival of DataOps initiatives in the data area, many solutions will start to gain use in the container environment. This type of implementation abstracts much of the need for configuration management and can be used to build infrastructures quickly, easily, and effectively.
Imagine the difficulty of configuring 1000 nodes with Hive, with Terraform, docker, and Ansible you can achieve, will you? ….
Some examples of Hive implementations with Docker:
References:
Follow me on Medium :)