Why build Apache Arrow from source on ARM?
Apache Arrow is an in-memory data structure used in several projects. It's python module can be used to save what's on the memory to the disk via python code, commonly used in the Machine Learning projects. With low RAM, ARM devices can make use of it but there seems to be an configuration error with the packaged binaries as of version 0.15.1 and so we're forced to build and install from the source.
I'm using Nvidia Jetson nano.
Quad-core ARM® Cortex®-A57 MPCore processor
NVIDIA Maxwell™ architecture with 128 NVIDIA CUDA® cores
4 GB 64-bit LPDDR4 1600MHz - 25.6 GB/s
Ubuntu 18.04 LTS
Preparing the environment
I have created a separate directory for building arrow and have downloaded the sources in it. Download apache arrow sources from - https://github.com/apache/arrow/releases.
mkdir build cd build wget https://github.com/apache/arrow/archive/apache-arrow-0.15.1.zip unzip arrow-apache-arrow-0.15.1.zip cd arrow-apache-arrow-0.15.1/cpp mkdir release cd release export ARROW_HOME=/usr/local/lib export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
Note: /usr/local/lib is the path where the arrow *.so files would finally be installed.
Based on your setup, you could already have some of these packages installed in your setup; If so, skip installing those packages in this step.
sudo apt-get install libjemalloc-dev libboost-dev \ libboost-filesystem-dev \ libboost-system-dev \ libboost-regex-dev \ python3-dev \ autoconf \ flex \ bison \ libssl-dev \ curl \ cmake pip3 install six numpy pandas cython pytest psutil
Build the cpp files & install the binary
I have built with all possible components to showcase the best case scenario, you wouldn't likely be needing several of these components; please perform the necessary due diligence of its functions.
-DARROW_CUDA=ON because I have CUDA capable ARM board. If you don't have an Nvidia ARM board, you don't need this.
-DPYTHON_EXECUTABLE=/usr/bin/python3 because my python3 resides in this path, replace with your python3 path if required.
make -j4 because my board has quad core CPU and building with 4 jobs parallely would improve the build time significantly. Depending upon the number of cores, threads available in your CPU, you could change this flag.
make install would install the compiled binary (*.so) in aformentioned directory.
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_FLIGHT=ON \ -DARROW_GANDIVA=ON \ -DARROW_ORC=ON \ -DARROW_WITH_BZ2=ON \ -DARROW_WITH_ZLIB=ON \ -DARROW_WITH_ZSTD=ON \ -DARROW_WITH_LZ4=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_BROTLI=ON \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_PLASMA=ON \ -DARROW_CUDA=ON \ -DARROW_BUILD_TESTS=ON \ -DPYTHON_EXECUTABLE=/usr/bin/python3 \ .. make -j4 sudo make install
Build and install pyarrow
As with Arrow cpp, not all environmental flags are required for building and installing pyarrow. If you used a flag during the build of cpp files, you'll likely need it here as well.
cd .. cd python/ pip3 install -r requirements.txt export PYARROW_WITH_FLIGHT=1 export PYARROW_WITH_GANDIVA=1 export PYARROW_WITH_ORC=1 export PYARROW_WITH_PARQUET=1 export PYARROW_WITH_CUDA=1 export PYARROW_WITH_PLASMA=1 python3 setup.py build_ext --inplace sudo -E python3 setup.py install
Note: If you are building and installing on your ARM box at intervals, you may loose the environmental flags. Ensure required environmental flags are set before building and installation. If you're using sudo to install, use sudo -E to export the environment flags to sudo.
Add LD_LIBRARY to path
LD_LIBRARY path is needed for arrow, pyarrow to function properly. Add the path to the ~.bashrc.
In my case,
I tested pyarrow by importing it in the python command line.
python3 -v from pyarrow import compat
If the above import statement didn't result in any error, then it's all good. If it resulted in any error, ensure LD_LIBRARY path is set right as explained in a previous section.