Apache Arrow integration for accelerated analytics

Cloudera engineers have been collaborating for years with open-source engineers to take advantage of Apache Arrow for columnar in-memory processing and interchange. The integration of Apache Arrow in Cloudera Data Platform (CDP) works with Hive to improve analytics performance.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.

  • Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.

  • A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads.

Arrow isn’t a standalone piece of software but rather a component used to accelerate analytics within a particular system and to allow Arrow-enabled systems to exchange data with low overhead. It is sufficiently flexible to support most complex data models.

Arrow improves the performance for data movement within a cluster in these ways:

  • Two processes utilizing Arrow as their in-memory data representation can relocate the data from one process to the other without serialization or deserialization. For example, in Cloudera Data Platform, executors run tasks that read Arrow data from LLAP when you run hive.executeQuery("SELECT * FROM t").sort("A").show(). The Hive Warehouse Connector returns ArrowColumnVectors to Spark.
  • Arrow data can be received from Arrow-enabled database-like systems without costly deserialization on receipt. For example, LLAP demons can send Arrow data to Hive for analytics purposes.