Skip to content

Instantly share code, notes, and snippets.

@284km
Last active November 28, 2018 05:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 284km/a720358afa3fd629b4ec6fc65f892d11 to your computer and use it in GitHub Desktop.
Save 284km/a720358afa3fd629b4ec6fc65f892d11 to your computer and use it in GitHub Desktop.
[English translation of selfishly] The latest information on Apache Arrow (September 5, 2018) by Sutou-san

Original article

The latest information on Apache Arrow (September 5, 2018)

Do you know Apache Arrow? Most people probably do not hear or know only the name or only the concept. There will be few people who have actually experienced it. Apache Arrow is a project that will become an important component in the data processing in a few years. People who are interested in the data processing should be useful to know, so I will introduce the latest information as of September 2018.

I'm the only Japanese in PMC and the number of commits is the third most. Maybe I'm familiar with Apache Arrow in Japan. Since there is not much information on Apache Arrow in Japanese, I will introduce it in Japanese. Incidentally, there are various information in English. Useful information sources include the official blog of Apache Arrow, official mailing lists and blogs/presentation of each developer. In this article I also provide links to those information while introducing the latest information, so please make use of English information as well.

What Apache Arrow does

What Apache Arrow is aiming for is to make data processing on memory efficient.

"Efficient" has two perspectives. One is "performance". Faster is that it is more efficient. The other is "implementation cost". The lower the implementation cost, the more efficient it is.

Performance

First, I will explain from "performance".

Apache Arrow assumes the case of processing a large amount of data. Therefore, when thinking about performance, think on the premise of processing a large amount of data.

It is possible to solve a bottleneck part when processing a large amount of data and to process it faster by optimizing the parts that can improve performance.

The bottleneck that Apache Arrow focused on is part of data exchange.

Apache Arrow focused on parallel processing part when it leads to performance improvement.

I will explain in order.

Data Exchange

I will explain about solving the bottleneck here. The target processing is data exchange processing.

When handling a large amount of data, multiple systems work together to process the data. For example, A system that collect data, a system to persist data, a system to preprocess data, and a system to calculate data. Although it is necessary to exchange data between each system, the cost of serializing and deserializing data at that time is quite expensive. For example, if you are exchanging data with JSON, the cost of formatting the data into JSON and the cost of parsing JSON and formatting it programmatically are quite good.

Therefore, Apache Arrow creates specifications (Apache Arrow format) with very low serialization / deserialization cost that can be used commonly in various systems. The basic idea is "do not parse". By placing data in a format that can be handled directly from program, data can be processed without parsing. This is also efficient when serializing data. There is almost no serialization cost because it is enough to write the data handled by the program as it is. In addition, FlatBuffers is used to realize a data format which does not parse.

The benefit of the Apache Arrow is appeared especially when the whole system supports the Apache Arrow format. If you have a system that does not support the Apache Arrow format, you can not use the Apache Arrow format when exchanging data with that system. In other words, the cost of exchanging data with that system will be higher.

In order to realize performance improvement by reducing the data exchange cost, it is necessary to be able to use the Apache Arrow format widely. For this reason, we are working on development to handle Apache Arrow format in more languages. The languages that can currently read and write Apache Arrow format are as follows (alphabetical order).

  • C
  • C++
  • Go
  • Java
  • JavaScript
  • Lua
  • MATLAB
  • Python
  • R
  • Ruby
  • Rust

The high degree of completion is C++ and Java. C, Lua, Python, Ruby which are implemented as C++ bindings are also highly complete. MATLAB and R are also C++ bindings, but development has just begun and the completeness is not that much. The libraries for JavaScript, Go and Rust are implemented in that languages themselves, not as bindings. Implementations of these languages support major types (eg numeric types), but some are not yet supported. Implementation of Julia is also progressing. Details of the implementation situation in various languages will be described later.

Parallel processing

Here I explain the optimization of the part that can improve performance. The target is parallel processing.

By implementing either or both of the following, you can handle large amounts of data with high performance.

  • Process one data faster
  • Process multiple data at the same time (parallel processing)

The data format of Apache Arrow is allocated by localizing data as much as possible. Localization makes it easier to utilize CPU's cache memory and leads to "processing one piece of data faster". When localizing and aligning the data boundary to 64 bits, it becomes easier to make effective use of SIMD. If you can use SIMD, you can process multiple data with a single instruction, which leads to "processing multiple data at the same time".

The data of Apache Arrow is designed to have better performance in case of OLAP(OnLine Analytical Processing)(like data analysis) than OLTP(OnLine Transaction Processing)(like buying at EC site). Processing targets are different between OLTP and OLAP, so ease of implementation will change depending on how data is allocated. For those who know RDBMS, in OLTP there are many lines of processing (such as insert and delete), and in OLAP there are many column processing (such as COUNT and GROUP BY). For OLTP, the performance is easier when the data is solidified in row units, and in OLAP, the performance is easier if the data is solidified in column units. Since Apache Arrow is designed for OLAP, it collects data on a column-by-column basis. How to hold such data is called "Columnar".

Summary of "performance"

The data format of Apache Arrow is designed so that it can quickly perform the following to calculate a large amount of data.

  • Data Exchange
  • Parallel processing

It is designed to reduce the serialization / deserialization cost for data exchange. In addition, we are developing the library so that we can read and write in more languages.

We devise data allocation method for parallel processing. It is allocated to be able to process at high performance with recent CPU / GPU.

Implementation cost

Apache Arrow aims to efficiently process data on memory. We focus on 'performance' and 'implementation cost' for 'efficiency improvement'. So far we have explained "performance". Next is about "implementation cost".

There are many software to process data on memory. There are many software whose performance is important. For software that requires high performance, we will implement various ideas to make it faster. This will take some time.

Apache Arrow provides the necessary functions as a library when processing large amounts of data on memory at high performance. I would like to avoid having to do similar work by using Apache Arrow as infrastructure. We want to cooperate and develop and share important parts common to many software.

Recently, there is a movement to include the following processing into Apache Arrow.

Summary of what Apache Arrow realizes

Apache Arrow aims to efficiently process large amounts of data on memory. For that, what we are doing is as follows.

  1. Define specification of Apache Arrow format which is easy to exchange data and process at high performance
  2. Develop library to read and write Apache Arrow format for various languages
  3. Develop library to calculate a large amount of data on memory with high performance

I explained what Apache Arrow is. Next, I will explain suitable use, current situation and how to proceed in the future.

Suitable use of Apache Arrow

Apache Arrow is not a panacea. There are both good and weak situations. Let's use properly according to the purpose.

Apache Arrow is designed for use cases where multiple systems cooperate to compute large amounts of data. Therefore, it is suitable for the next use.

  • Exchange of mass data
  • Analysis processing of large amount of data on memory

Because it is a low cost data format for serialization / deserialization, it is suitable for "exchange of mass data".

It is suitable for "analysis process of large amount of data on memory" in that data allocation and processing suitable for high performance processing are provided in the library.

If you don't care about file size, it is useful for "temporary cache of processing results". If you are concerned about the file size Apache Parquet format is better.

On the other hand, it is not suitable for persistence. For example, it is not suitable for saving logs in Apache Arrow format or saving database data in Apache Arrow format.

In the data format for persistence, it is often the case that the size is a critical point. The smaller the size, the more data can be saved, and the I / O when reading and writing data decreases. Since Apache Arrow emphasizes lowering the serialization / deserialization cost, It can not do much for reducing the size. For example, compressing with Zstandard etc. will cause some of the merit of Apache Arrow to fade away. That merit is "zero copy".

Zero copy is to not copy data. In the Apache Arrow format, it is designed to be able to calculate directly without copying data. Specifically, it is designed as follows.

  • Allocate data so that it can be processed directly from the program efficiently
  • Data is basically used for read only

If data placement can be handled directly from the program efficiently, we can handle direct data. Data can be handled with zero copy by using the memory map function provided by the system. However, even if data that can be handled directly is compressed, it can not be used unless it is decompressed. In other words, it will no longer be "an arrangement that can be handled directly directly." Therefore, it is necessary to prepare a new area and store decompressed data. It is behavior like copy. Because it is assumed to process large amounts of data, such expanding costs can often not be ignored. However, data transmission / reception cost on the network may be higher than compression / expansion cost as when exchanging data over the network. In such a case, the compressed one is suitable for the price, so it depends on the application. (Compression function is not yet supported, but the part for handling Zstandard and LZ4 is already implemented.ARROW - 300 [Format] See Add buffer compression option to IPC file format.)

We will also explain read-only and zero copy. If there is a possibility that the data will be changed, in order to process the data safely it is necessary to copy the data or control exclusive control. Because we are assuming to handle large amounts of data, the cost of copying can not be ignored. When exclusive control is performed, the degree of parallel tends to decrease and the performance is sacrificed. Therefore, Apache Arrow is designed to handle data only on read-only.

In this way Apache Arrow is an implementation that takes care of zero copy. In Apache Arrow, please remember that zero copy is a frequent keyword.

Reference:

I explained that Apache Arrow is suitable for the following applications.

  • Exchange of mass data
  • Analysis processing of large amount of data on memory

Next, I will explain how the situation is now and what will happen in the future.

Current status of Apache Arrow

As of September 2018, I will explain the specification and implementation of Apache Arrow data format.

Specification of the data format is expected to be decided during this year. Apache Arrow 1.0 will be released once the specification is decided. The latest version at the moment is 0.10.0. 0.11.0 is scheduled to be released during September.

Implementation for data exchange (implementation for dealing with the Apache Arrow format) is almost complete, and the weight is shifting to implementation for processing data in Apache Arrow format with high performance.

Then I will explain the details.

Supported data

First, I will explain the data supported Apache Arrow.

Apache Arrow supports the following data.

  • Data frame
  • Dense multidimensional array

I will explain each.

Data frame

Apache Arrow was originally developed to process "data frame" data. "Data frame" is tabular data. It is like an RDBMS table. A table has one or more columns, each of which can be of a different type.

The types currently supported by Apache Arrow are as follows. It is possible to process these types of data.

  • Boolean value (1 bit)
  • Integer
    • 8 bit nonnegative integer (little endian)
    • 8 bit integer (little endian)
    • 16 bit nonnegative integer (little endian)
    • 16 bit integer (little endian)
    • 32 bit nonnegative integer (little endian)
    • 32 bit integer (little endian)
    • 64 bit nonnegative integer (little endian)
    • 64 bit integer (little endian)
  • Floating point number
    • 16 bit floating point number
    • 32 bit floating point number
    • 64 bit floating point number
  • Decimal (specify precision / scale)
  • Variable length UTF-8 character string
  • Binary data
    • Variable length binary data
    • Fixed-length binary data
  • Date
    • Number of days elapsed since UNIX epoch (32 bits)
    • Number of milliseconds elapsed since UNIX epoch (64 bits)
  • Time stamp (64 bit integer)
    • Number of elapsed seconds since UNIX epoch
    • Number of milliseconds since UNIX epoch
    • Number of microseconds since UNIX epoch
    • Number of nanoseconds since the UNIX epoch
  • Time
    • Number of seconds elapsed since midnight (32 bit integer)
    • Latest milliseconds since midnight 0 (32 bit integer)
    • Number of microseconds since midnight 0 o'clock (64 bit integer)
    • Number of nanoseconds since midnight (64 bit integer)
  • List (0 or more types with values of the same type)
  • Structure (a type with one or more fields, each field can be a different type)
  • Union (a type with one or more fields, each field can be a different type, only the value of one field is set)
  • Dictionary
    • Statistical explanation: Nominal scale category data
    • Explanation from implementation: A type that assigns an ID as an integer to each value and expresses a value by a numeric value
    • Description for those who know scikit-learn: type that uses the transform result of sklearn.preprocessing.LabelEncoder as a value

You can set null for any type.

Multidimensional array

Apache Arrow started development assuming data frames, but there was feedback that it would be nice to be able to process multidimensional arrays, and supported from Apache Arrow 0.3.0.

Reference: ARROW-550 [Format] Add a TensorMessage type

The difference between data frames and multidimensional arrays is as follows.

  • Number of dimensions:
    • The data frame is two-dimensional
    • A multidimensional array has n dimensions
  • Type
    • A data frame can use a different type for each column
    • Element values of multidimensional arrays are all of the same type

In Python, pandas realizes data frames, NumPy realizes multidimensional arrays.

Multidimensional arrays have dense multidimensional arrays and sparse multidimensional arrays. Dense multidimensional arrays are multidimensional arrays that have all the values of elements, and sparse multidimensional arrays are multidimensional arrays that only have values of nonzero elements.

For example, when there is a multidimensional array (one dimensional array) of [0, 0, 0, 1, 0], it is dense multidimensional array that holds the values of all 5 elements, It is a sparse multidimensional array that holds only the information that the element value is 1 and the others are 0.

Apache Arrow currently only supports dense multidimensional arrays. We also plan to support sparse multidimensional arrays. Currently we are looking for people to consider the specification of valid data format. Those who have knowledge of sparse multidimensional array, please try it.

Reference: ARROW-854 [C++] Support sparse tensor

Implementation of data processing part

Apache Arrow emphasizes the development of high-performance data processing functions that can be used commonly by various data processing tools. I have not done much yet because I was focusing on the data format so far, but I will show you what I can do at the moment.

Implementing the data processing part is only implemented by C++, JavaScript, Go, so we will introduce it for each implementation in each language.

Just because the data processing part is not implemented does not mean that it can not be processed. It is possible to convert the data of the loaded Apache Arrow format so that it can be processed by other existing libraries. It is somewhat inefficient because it requires copying and data conversion, but since Apache Arrow format makes the cost of data exchange more efficient, it is often useful.

For example, Apache Spark has greatly improved performance by using Apache Arrow at the data exchange part. Apache Spark has a feature called PySpark that allows you to implement some of the processing in Python. Since Apache Spark itself is implemented by Scala, it is necessary to pass data to Python running in a separate process. At this time, Apache Arrow data is passed through the socket. In the past, data was transferred using pickle standard included in Python. Data received in Apache Arrow format is converted to pandas object, and data processing uses pandas function. By the way, conversion from Apache Arrow data to pandas object can be realized with zero copy depending on type. In this case you can exchange data more efficiently.

Reference: Speeding up PySpark with Apache Arrow

Let me introduce the data processing functions implemented at the present time for each language.

C++

Processing implemented in C++ is as follows.

  • Cast (conversion of type)
    • Example: Convert 16 bit integer to 32 bit integer
    • Conversion to dictionary type also implemented
  • Logical NOT of each element
  • Logical AND of elements
  • Logical OR of elements
  • Elementwise exclusive OR

For the list to be implemented, please search JIRA which is the project management tool of Apache Arrow with kernel.

JavaScript

Processing implemented in JavaScript is as follows.

  • Logical NOT of each element
  • Logical AND of elements
  • Logical OR of elements
  • Elementwise exclusive OR
  • == comparison for each element
  • < Comparison for each element
  • <= Comparison for each element
  • Comparison for each element

  • = Comparison for each element

  • Counting the number of occurrences by value
    • Only dictionary type support

Go

Processing implemented in Go is as follows.

  • Calculating the total value of all elements

There is only one implementation yet, but since it is an interesting implementation to utilize SIMD, I will explain a bit.

In order to use SIMD, it can not be used unless the CPU supports SIMD. Recent C/C++ compilers are smart enough to automatically generate SIMD-based code depending on whether CPU supports SIMD or not. However, compiler of Go can not be optimized yet. So we used Clang in advance to output SIMD assemblies and incorporate them into Go. However, if this method is used, Go will incorporate SIMD that the CPU does not support. Therefore, we detect the usable SIMD at the time of execution and invoke the processing of the appropriate SIMD implementation. For details, refer to the following reference URL. It is a blog written by a person of InfluxData who wrote Go implementation.

Reference: InfluxData Working on Go Implementation of Apache Arrow | InfluxData

By the way, InfluxData is a company that is developing time series database InfluxDB. Influx DB is implemented in Go and participates in the development of Apache Arrow for use with Influx DB.

Summary of implementation of data processing part

It may be disappointing that the data processing part is not implemented so much. I will also tell you a bit about the future.

Apache Arrow also considers not only the processing of each element, but also the implementation of an execution engine that processes complicated conditions at high performance. A module that is likely to be used as its implementation will be in the Apache Arrow shortly.

This module is named Gandiva. Gandiva calculates at runtime how to perform the process, compiles it at runtime using LLVM, and then executes it. Since it compiles, it will be as fast as compiled beforehand, and LLVM also has optimization function for SIMD, so it will also be SIMD compatible. For details, refer to the following blog written by Gandiva developer.

Reference: Introducing the Gandiva Initiative for Apache Arrow - Dremio

I will also introduce what status it is "now planning implement to the Apache Arrow soon".

In order for external modules to be imported into Apache Arrow, you need to take the following steps.

Gandiva is now in a status that "PMC Approved" has been completed.

Reference: [RESULT] [VOTE] Accept donation of Gandiva to Apache Arrow

Next is IP clearance, but for that, we need to create a pull request in Apache Arrow repository, but we have not done it yet.

Since the release of 0.11.0 is scheduled for the current month, I don't think that it will make it in time, but I think that it will come in within a few months.

Plasma

Apache Arrow also has function to share objects within the same machine. That is a module named Plasma. Apache Arrow has a mechanism for exchanging data frame data at low serialization/deserialization cost. We use it to implement. Please note that Plasma is not a function for exchanging data frame data, but a lower level function for exchanging raw data on memory.

Plasma is a function originally developed as part of Ray, but moved to Apache Arrow because it seems to be useful widely. Ray is a high-performance distributed processing framework for large machine learning and reinforcement learning. It is developed by RISELab of the University of California, Berkeley.

With Plasma, data on the CPU and data on the GPU can be shared with zero copy. By sharing data with Plasma, distributed processing can be performed by multiple processes.

In order to use Plasma you need to start the server process. Each process connects to the server process, places the data you want to share, and refers to the shared data to proceed.

Reference: Plasma In-Memory Object Store

Implementation completeness in each language

Next, I will explain the completeness of implementation in each language.

The Apache Arrow format is more beneficial to be used in more environments. Therefore, We develop it to increase the language which can use the Apache Arrow format. Currently it can be used in the following languages (redundant), but its completeness varies.

  • C
  • C++
  • Go
  • Java
  • JavaScript
  • Lua
  • MATLAB
  • Python
  • R
  • Ruby
  • Rust

In particular, Java, C++, Python, C, and Ruby implementations have high completeness.

Development is starting from the Java, C++ implementation first, so implementation of these languages is highly complete. Python, C, and Ruby implementations are also highly complete because they are C++ implementation bindings (making C++ implementation's functionality available in other languages). Lua implementation is also (indirectly) binding of C++ implementation, but a little less complete in terms of usability of API.

MATLAB, R implementation is also a binding of C++ implementation, but development is just beginning and completeness is not that much.

Since Go, JavaScript, Rust implementation is implemented from 1, it is not as complete as Java or C++ implementation yet. However, the completeness of JavaScript implementation is high with regard to the type of supported data frame.

It is not yet in the Apache Arrow repository, but Julia's implementation is proceeding as well. and implemented from 1.

Let's show the overview and completeness of implementation of each language in alphabetical order.

C

The C implementation is a C++ implementation binding. This is the module I started making. Currently I and @shiro615 are mainly developing.

It is characterized by using a convenient library for C language called GLib. The reason for using GLib is "Automatic generation of bindings in various languages". As a precaution supplement, this is not "can automatically generate bindings for C++ implementation", but "can automatically generate C implementation binding, which is a C++ implementation binding".

The binding of Ruby and Lua is working using this function. GLib also includes library GObject for object-oriented programming in C language. Object oriented API is realized using GObject. Therefore, the binding for Ruby / Lua which is automatically generated is also object-oriented API making it easy to use. Normally, the approach to automatically generate bindings for C libraries is an API that lets you use raw C functions as they are. Then, in order to make it easier to use, we need to implement a layer that manually wraps the API, but it is not necessary for the auto generation function used in the C implementation.

The automatic generation mechanism used in C implementation is GObject Introspection.

In the C implementation, you can use the functions of all C++ implementations except Plasma. Plans will be available soon.

C++

C++ implementation is developed from 1. It is mainly developed by @wesm, @xhochy, @pitrou, @pcmoritz, @cpcloud.

The C++ implementation uses C++11. Therefore, we can not build with older g++ that is standard in CentOS 6. If you want to build on CentOS 6, install the devtoolset-6 or devtoolset-7 package and prepare a new g++.

The C++ implementation supports all types of data frames and multidimensional arrays.

Plasma's server process is implemented in C++. Of course, the Plasma client function is also implemented.

The C++ implementation includes several interconversion features with other formats. The formats currently implemented are as follows.

There is also a function to place data on GPU, not on memory. Although it has not been used much yet, it should be utilized more in the future. That's because libraries are being developed to calculate data in Apache Arrow format on the GPU.

For example, libgdf is a library. libgdf is a data frame library for C language. It is implemented using CUDA. In addition to value comparison processing, advanced functions such as grouping and join function are implemented.

libgdf is developed by GoAi (GPU Open Analytics Initiative). The mission of this organization is "to build a platform that can analyze data using GPU".

In addition, the database running on the GPU named MapD corresponds to the Apache Arrow format. MapD can process data using SQL. Companies developing MapD also participate in GoAi.

A module for Python linkage is also included in the C++ implementation. The Python linkage module provides a function to convert C++'s Apache Arrow object and Python's pandas / NumPy object to each other. It is because we want to share this function in various libraries that I put in the C++ implementation without putting it in the Python binding. Although I haven't used it in one initial assumption pandas 2.0, there is an example that I already use. It is a library called Red Arrow PyCall to share Apache Arrow data between Ruby and Python by running both Ruby and Python within the same process. We use Python linkage module to convert Apache Arrow data and Python object.

Reference: ARROW-341 [Python] Making libpyarrow available to third parties

In September of 2018 Apache Parquet C++ implementation moved to the Apache Arrow repository. The reason for this is because both parties collaborate very nicely at the moment and developers are promoting it so it is easier for people who were in the same repository to proceed with development. In the future, if the API stabilizes it may also separate.

Reference:

Go

Go implementation is developing from 1. InfluxData people developed it. After merged into the Apache Arrow repository, @sbinet (not a person of InfluxData) is developing it.

supports data frame types are as follows.

  • Boolean value (1 bit)
  • Integer
    • 8 bit nonnegative integer (little endian)
    • 8 bit integer (little endian)
    • 16 bit nonnegative integer (little endian)
    • 16 bit integer (little endian)
    • 32 bit nonnegative integer (little endian)
    • 32 bit integer (little endian)
    • 64 bit nonnegative integer (little endian)
    • 64 bit integer (little endian)
  • Floating point number
    • 32 bit floating point number
    • 64 bit floating point number
  • Binary data
    • Variable length binary data
  • Time stamp (64 bit integer)
    • Number of elapsed seconds since UNIX epoch
    • Number of milliseconds since UNIX epoch
    • Number of microseconds since UNIX epoch
    • Number of nanoseconds since the UNIX epoch
  • List (0 or more types with values of the same type)
  • Structure (a type with one or more fields, each field can be a different type)

Multidimensional arrays are not yet supported.

Plasma client function is not yet supported.

We do not yet support mutual conversion function with other formats.

Java

Java implementation is developed from 1. It is mainly developed by @julienledem, @BryanCutler, @StevenMPhillips, @siddharthteotia, @elahrvivaz, @icexelloss.

The Java implementation supports most of the data frame types. Only the following types are not supported.

  • Floating point number
    • 16 bit floating point number

Multidimensional arrays are not yet supported.

Plasma client function is supported.

Since the function to return data acquired by JDBC as an Apache Arrow object is implemented, it is easy to handle RDBMS data as an Apache Arrow object.

JavaScript

JavaScript implementation is developed from 1. Mainly developed by @trxcllnt and @TheNeuralBit.

Developed with TypeScript, it works on Web browser as well as on Node.js.

You can use it to visualize the results processed by MapD on a Web browser. Because MapD supports Apache Arrow, it can exchange data with JavaScript on Web browser at low cost.

Reference: Supercharging Visualization with Apache Arrow

The JavaScript implementation corresponds to all types of data frames.

Multidimensional arrays are not yet supported.

Plasma client function is not supported yet. There is a concept that want to create and support a module for Node.js called node-plasma.

Reference: Connecting JS to modern GPU and ML frameworks: Update from Nvidia GTC 2018

Julia

There is no official Julia implementation yet, but @ExpandingMan is progressing the implementation of Arrow.jl. Discussions are also progressing to make it an official implementation.

Reference: collaboration with Apache Arrow org · Issue #28 · ExpandingMan/Arrow.jl

The types of data frames supported by Julia implementation are as follows.

  • Boolean value (1 bit)
  • Integer
    • 8 bit nonnegative integer (little endian)
    • 8 bit integer (little endian)
    • 16 bit nonnegative integer (little endian)
    • 16 bit integer (little endian)
    • 32 bit nonnegative integer (little endian)
    • 32 bit integer (little endian)
    • 64 bit nonnegative integer (little endian)
    • 64 bit integer (little endian)
  • Floating point number
    • 32 bit floating point number
    • 64 bit floating point number
  • Variable length UTF-8 character string
  • Date
    • Number of days elapsed since UNIX epoch (32 bits)
  • Time stamp (64 bit integer)
    • Number of elapsed seconds since UNIX epoch
    • Number of milliseconds since UNIX epoch
    • Number of microseconds since UNIX epoch
    • Number of nanoseconds since the UNIX epoch
  • Time
    • Number of seconds elapsed since midnight (32 bit integer)
    • Latest milliseconds since midnight 0 (32 bit integer)
    • Number of microseconds since midnight 0 o'clock (64 bit integer)
    • Number of nanoseconds since midnight (64 bit integer)
  • List (0 or more types with values of the same type)
  • Dictionary
    • Statistical explanation: Nominal scale category data
    • Explanation from implementation: A type that assigns an ID as an integer to each value and expresses a value by a numeric value
    • Description for those who know scikit-learn: type that uses the transform result of sklearn.preprocessing.LabelEncoder as a value

Multidimensional arrays are not yet supported.

Plasma client function is not yet supported.

Lua

Lua implementation is automatically generated using C implementation and lgi. Not packaged. It can be used as follows.

local lgi = require 'lgi'
local Arrow = lgi.Arrow

Reference: arrow/c_glib/example/lua at master · apache/arrow

All functions available for C implementation can be used.

MATLAB

The MATLAB implementation is a C++ implementation binding. As of September 10, 2018, there is only one commit yet, but the person who mainly develops is @kevingurney. He is a MathWorks person who is developing MATLAB.

Currently there is only the function to read Feather format data. Can not read Apache Arrow format data yet.

Python

The Python implementation is a C++ implementation binding. The people who are mainly developing are about the same as those who develop C++ implementation.

All functions available in C++ implementation can be used. Not only can you use it, API is being developed so that it can be used more conveniently from Python.

It is implemented using Cython.

A function that can easily convert to an object of Pandas/NumPy is also implemented, so it can be seamlessly used with existing libraries.

R

The R implementation is a C++ implementation binding. As of September 10, 2018, it is only one commitment yet, but the person who mainly develops is @romainfrancois. Some of the people related to R are commenting on the first pull request below, and there are others who are making new issues in JIRA, so it seems that more people will participate in development in the future.

Reference: ARROW-1325: [R] Initial R package that builds against the arrow C++ library by romainfrancois · Pull Request #2489 · apache/arrow

Currently we can only use the functions of C++ implementation, can not read the data of Apache Arrow format. However, development of the development environment is progressing. You can already test it with Travis CI.

We discuss how to implement R implementation in Google Docs. If it gathers to a certain extent it will be broken down to JIRA tickets.

Reference: Apache Arrow for R - Initial Roadmap - Google Docs

Ruby

Ruby implementation is generated automatically using C implementation and gobject-introspection gem. People who are mainly developing are people developing C implementation.

All functions available for C implementation can be used. In addition to simply generating bindings, APIs for more convenient use like Python implementation are also in place.

We don't want to increase the dependency, so it is not included in the Ruby implementation itself, but the following related libraries are available.

There is also Red Parquet for reading data in Apache Parquet format.

Rust

Rust implementation is developed from 1. It is mainly developed by @andygrove.

The types of data frames supported by Rust implementation are as follows.

  • Boolean value (1 bit)
  • Integer
    • 8 bit nonnegative integer (little endian)
    • 8 bit integer (little endian)
    • 16 bit nonnegative integer (little endian)
    • 16 bit integer (little endian)
    • 32 bit nonnegative integer (little endian)
    • 32 bit integer (little endian)
    • 64 bit nonnegative integer (little endian)
    • 64 bit integer (little endian)
  • Floating point number
    • 16 bit floating point number
    • 32 bit floating point number
    • 64 bit floating point number
  • Variable length UTF-8 character string
  • List (0 or more types with values of the same type)
  • Structure (a type with one or more fields, each field can be a different type)

Multidimensional arrays are not yet supported.

Plasma client function is not yet supported.

Summary of implementation completeness in each language

I explained each implementation of Apache Arrow in each language.

In terms of supported types, any implementation covers numeric types and boolean types. Except for the Go/Rust implementation cover time-related types. Complex types are covered for implementations of any language except unions. Bindings for C ++, Java, JavaScript implementation and C ++ implementation correspond to all types.

C++ implementation families are advancing mutual conversion with other formats. It can convert with Apache Parquet, Apache ORC, Feather format.

GPU support is also advanced in the C++ implementation family.

If you want to use Plasma you will use C++, Java, Python implementation. It will be available soon in the C implementation family as well.

From the viewpoint of processing Apache Arrow data at high performance, all implementations still have a narrow scope.

The future of Apache Arrow

So far I have explained the outline of Apache Arrow and its current situation. Finally I will explain the future.

Capture execution engine Gandiva

Currently it is still weak to handle Apache Arrow data at high performance. The module covering that is Gandiva. This Gandiva will be imported in the near future and will be the basis for high performance processing of Apache Arrow data.

Since Gandiva is implemented in C++, C++ implementation will soon be able to use Gandiva.

Dremio, which is developing Gandiva, uses Apache Arrow for its own service. Since this service is implemented in Java, They are also developing a binding for using Gandiva implemented in C++ from Java. So you can use Gandiva immediately in Java implementation as well.

C++ implementation-based Python implementation, C implementation families will be available soon.

Add CSV parser

CSV is still used frequently in the world. Therefore, we are developing a parser to parse CSV and create Apache Arrow object.

Reference: Building a fast Arrow-native delimited file reader (e.g. for CSVs) - Apache Mail Archives

This is a C++ implementation. Once implemented, it will be readily available in the C++ implementation family.

Add Apache Arrow compatible client

The Java implementation has the function of returning data acquired by JDBC as an Apache Arrow object. A similar function will be included in C++ implementation.

First, we got the function to retrieve data from Apache Hive / Apache Impala and return it with Apache Arrow object.

We also plan to implement a function to retrieve data from PostgreSQL and return it in Apache Arrow format.

Reference: Developing native Arrow interfaces to database protocols - Apache Mail Archives

There is a Python library doing the same thing. That is turbodbc. It can return ODBC retrieved data as an Apache Arrow object. Perhaps in the future, this function may be moved to the Apache Arrow and turobdbc may use it.

Summary

I explained the outline of Apache Arrow, the current situation as of September 2018, and the future. Apache Arrow is a project that will become an important component in the data processing area in a few years. I think that it would be nice to increase the number of people who know about Apache Arrow in Japan so I put them together in Japanese. I hope that more people will use Apache Arrow. In addition, I hope that more people will participate in development.

I omit it in this explanation, but there is an interesting reading about Apache Arrow's birth. It is written by PMC (like the project management team) chair (1st great man). People who are interested please also read here.

I compiled what I know. However, there may be cases where it can not be covered. If someone thinks "I want to know more!" Please chat with Red Data Tools. Add to this article.

If you would like to talk about Apache Arrow please contact us via the inquiry form.

I'd like to work on the development of data processing tools. Of course, Apache Arrow development is also included. I want to work together! If you would like to support our own service Apache Arrow please contact us via the inquiry form.

@kou
Copy link

kou commented Oct 4, 2018

speedよりperformanceの方がいいかも!

@284km
Copy link
Author

284km commented Oct 4, 2018

たしかに! performance に変更しました!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment