Skip to content

Instantly share code, notes, and snippets.

@bskaggs
Last active July 21, 2024 18:49
Show Gist options
  • Save bskaggs/fc3c8d0d553be54e2645616236fdc8c6 to your computer and use it in GitHub Desktop.
Save bskaggs/fc3c8d0d553be54e2645616236fdc8c6 to your computer and use it in GitHub Desktop.
Install pyarrow on alpine in docker
FROM python:3.7-alpine3.8
RUN apk add --no-cache \
build-base \
cmake \
bash \
jemalloc-dev \
boost-dev \
autoconf \
zlib-dev \
flex \
bison
RUN pip install --no-cache-dir six pytest numpy cython
RUN pip install --no-cache-dir pandas
ARG ARROW_VERSION=0.12.0
ARG ARROW_SHA1=2ede75769e12df972f0acdfddd53ab15d11e0ac2
ARG ARROW_BUILD_TYPE=release
ENV ARROW_HOME=/usr/local \
PARQUET_HOME=/usr/local
#Download and build apache-arrow
RUN mkdir /arrow \
&& apk add --no-cache curl \
&& curl -o /tmp/apache-arrow.tar.gz -SL https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz \
&& echo "$ARROW_SHA1 *apache-arrow.tar.gz" | sha1sum /tmp/apache-arrow.tar.gz \
&& tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1 \
&& mkdir -p /arrow/cpp/build \
&& cd /arrow/cpp/build \
&& cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on \
-DARROW_PYTHON=on \
-DARROW_PLASMA=on \
-DARROW_BUILD_TESTS=OFF \
.. \
&& make -j$(nproc) \
&& make install \
&& cd /arrow/python \
&& python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet \
&& python setup.py install \
&& rm -rf /arrow /tmp/apache-arrow.tar.gz
@arthurdevries
Copy link

I can't make this work either. If someone has knowledge about what the underlying problem is I will gladly put in some time and effort and try to make this work. Unfortunately, my knowledge about this thus far is quite limited. I would really like to work with the Alpine base image as it is a safe and small starting point. I am using python:3.11-alpine as a base.

@sergeibelov113
Copy link

sergeibelov113 commented Jul 21, 2024

I can't make this work either. If someone has knowledge about what the underlying problem is I will gladly put in some time and effort and try to make this work. Unfortunately, my knowledge about this thus far is quite limited. I would really like to work with the Alpine base image as it is a safe and small starting point. I am using python:3.11-alpine as a base.

I've managed to build pyarrow with apache arrow finally, but the resulted image is 3,5GB
And building lasts about 30 min. Here is the confirmed docker file:

FROM --platform=linux/amd64 python:3.12-alpine AS base

# Setup env
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONFAULTHANDLER=1
ENV ACCEPT_EULA=Y

RUN apk update && apk add --no-cache \
    gcc \
    g++ \
    curl \
    unixodbc-dev \
    bash \
    libffi-dev \
    openssl-dev \
    cargo \
    musl-dev \
    postgresql-dev \
    cmake \
    rust \
    linux-headers \
    libc-dev \
    libgcc \
    libstdc++ \
    ca-certificates \
    zlib-dev \
    bzip2-dev \
    xz-dev \
    lz4-dev \
    zstd-dev \
    snappy-dev \
    brotli-dev \
    build-base \
    autoconf \
    boost-dev \
    flex \
    libxml2-dev \
    libxslt-dev \
    libjpeg-turbo-dev \
    ninja \
    git \
    && pip install --upgrade pip && pip install pipenv cython numpy

ARG ARROW_VERSION=17.0.0
ARG ARROW_SHA256=8379554d89f19f2c8db63620721cabade62541f47a4e706dfb0a401f05a713ef
ARG ARROW_BUILD_TYPE=release

ENV ARROW_HOME=/usr/local \
    PARQUET_HOME=/usr/local

RUN mkdir /arrow \
    && wget -q https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz -O /tmp/apache-arrow.tar.gz \
    && echo "${ARROW_SHA256} *apache-arrow.tar.gz" | sha256sum /tmp/apache-arrow.tar.gz \
    && tar -xvf /tmp/apache-arrow.tar.gz -C /arrow --strip-components 1

# Create the patch file for re2
RUN echo "diff --git a/util/pcre.h b/util/pcre.h" > /arrow/re2_patch.diff \
    && echo "index e69de29..b6f3e31 100644" >> /arrow/re2_patch.diff \
    && echo "--- a/util/pcre.h" >> /arrow/re2_patch.diff \
    && echo "+++ b/util/pcre.h" >> /arrow/re2_patch.diff \
    && echo "@@ -21,6 +21,7 @@" >> /arrow/re2_patch.diff \
    && echo " #include \"re2/filtered_re2.h\"" >> /arrow/re2_patch.diff \
    && echo " #include \"re2/pod_array.h\"" >> /arrow/re2_patch.diff \
    && echo " #include \"re2/stringpiece.h\"" >> /arrow/re2_patch.diff \
    && echo "+#include <cstdint>" >> /arrow/re2_patch.diff

# Configure the build using CMake
RUN cd /arrow/cpp \
    && cmake --preset ninja-release-python

# Pre-fetch dependencies without building
RUN cd /arrow/cpp \
    && cmake --build . --target re2_ep -- -j1 || true

# Apply the patch to re2 after the dependencies are fetched but before the build
RUN cd /arrow/cpp/re2_ep-prefix/src/re2_ep \
    && patch -p1 < /arrow/re2_patch.diff

# Continue with the build and install Apache Arrow
RUN cd /arrow/cpp \
    && cmake --build . --target install \
    && rm -rf /arrow /tmp/apache-arrow.tar.gz

COPY Pipfile .
COPY Pipfile.lock .

RUN PIPENV_VENV_IN_PROJECT=1 pipenv install --deploy

# Final Stage
FROM base AS runtime
COPY --from=base /.venv /.venv
ENV PATH="/.venv/bin:$PATH"

WORKDIR /app

COPY src .

CMD ["python3", "main.py"]

I use pipenv to build dependencies, so you can customize the image for your needs.
or just add RUN pip install pyarrow or update these lines from original Dockerfile above like:

RUN cd /arrow/cpp \
    && cmake --build . --target install \
    && cd /arrow/python \
    && python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet \
    && python setup.py install \
    && rm -rf /arrow /tmp/apache-arrow.tar.gz

there is also a bug with pcre.h in apache arrow source code, so I applied patch within the Dockerimage. maybe it's not bug though, but I reported it here: #43350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment