Skip to content

Instantly share code, notes, and snippets.

@eladroz
Last active August 12, 2022 06:09
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eladroz/b9437249a76de2b394d54e646d53ec5e to your computer and use it in GitHub Desktop.
Save eladroz/b9437249a76de2b394d54e646d53ec5e to your computer and use it in GitHub Desktop.
Packaging Apache Arrow 2.0 on AWS Graviton2 (ARM64)

I'm now working on big data processing with Pandas at scale, as a lightweight alternative to Spark. Fortunately, the Apache Arrow project brings with it an excellent and very fast Parquet reader and writer.

With the current push to ARM in both personal computers and the data center, I was curious to check the performance of my code on ARM - running on AWS' homegrown Graviton2 processor. Their c6g instance types are 20% cheaper than the equivalent Intel-based c5's, while promising faster performance. If that's the future, why not start getting ready now?

While there are already Python wheels for NumPy and Pandas, there is no official build yet for PyArrow. There's a pull request in the works, but I was interested to know how smooth would it be to build from source for the latest stable Python (3.8)

Fortunately, the answer is: pretty smooth! The main hurdle was just getting the right packages for Amazon Linux 2, which is a relative of CentOS which the PyArrow documentation doesn't explicitly cover. Here's what you need to do:

  1. Launch a Graviton2-based instance (I've tested on both m6g and c5g) with Amazon Linux 2 - the arm variant ofc.

  2. As usual with Amazon Linux images, initial login with your keypair is via the ec2-user user. I did not create an additional non-privileged user in this case, as my goal was only to build the wheel file for later deployments.

  3. After first login, a run of sudo yum -y update is always advised.

  4. Installing Python 3.8 from packages is the easiest method. The header files will also be neede (python38-devel). Note that upgrading pip is necessary because attempting to nstall cmake later would simply fail without updating pip first.

sudo amazon-linux-extras enable python3.8
sudo yum install -y python38 python3-pip python38-devel
sudo python3.8 -m pip install --upgrade pip
sudo python3.8 -m pip install virtualenv

You probably won't want to make Python 3.8 the default system Python, as it breaks yum.

  1. Now, common development tools:
sudo yum -y groupinstall "Development Tools"

This will not only install a decently modern version of gcc, but also bring over some build dependencies of PyArrow that you'll need: boost and jemalloc.

  1. Install cmake, the build tool used by Apache Arrow:
pip3.8 install --user scikit-build
pip3.8 install --user cmake

This will install CMake both as a command-line tool and a Python package. The build requires both.

  1. Head over to the Environment Setup and Build section of the documentation, and follow the instructions to create directories, clone the Git repo and init its submodules.

The documentation assumes you're building the master branch. I've cloned the latest release instead (git clone --branch apache-arrow-2.0.0 https://github.com/apache/arrow.git).

  1. Jump to the point in the documentation with the code snippet beginning with virtualenv pyarrow. You'll want to make sure it picks up Python3.8 as the interpreter. To prevent any confusion, I generally prefer using the virtualenv module we've installed earlier as a package for Python 3.8, i.e. python3.8 -m venv pyarrow.

  2. Continue following the guide to build the Arrow C++ libraries and then PyArrow:

I'm using the dataset module of PyArrow, whose interface is still considered experimental at the moment and thus not included in the build by default. To build a wheel which includes that module and the underlying C++ library, I've slightly tweaked the code snippet that builds PyArrow:

pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
pip install wheel
python setup.py build_ext --build-type=release --bundle-arrow-cpp bdist_wheel
popd
  1. To test that the wheel file works, let's install it (within our current virtualenv) and try to import a module from it:
mkdir dist/wheel
cp arrow/python/dist/pyarrow*.whl dist/wheel/
pip install dist/wheel/pyarrow*.whl
python -c "import pyarrow.parquet"
  1. To be sure that everything works as intended, you can also build PyArrow inplace and run unit tests as the docs suggest. It passes without issues.

That's it! although there will probably soon be an official wheel on PyPI, it does feel reassuring to me to know that no "funny stuff" is needed for it to work on arm64.

As for the performance comparison, I've done a few basic tests with my code - loading Parquet file and manipulating them with Pandas. In my little test (and YMMV) performance was basically the same on a c6g.xlarge as on a c5.xlarge. I guess this is a still a good thing because: (a) it still costs 20% less, and (b) judging by the recent progress of arm-based chips, it's bound to get much better pretty quickly. I think AWS' own progress with Graviton is a testament to that. It's not 5nm yet, but 7nm ain't too bad either ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment