Skip to content

Instantly share code, notes, and snippets.

@yutannihilation
Last active August 1, 2019 05:28
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yutannihilation/5ec19ea26775d20a7242db3eb28872b9 to your computer and use it in GitHub Desktop.
Save yutannihilation/5ec19ea26775d20a7242db3eb28872b9 to your computer and use it in GitHub Desktop.
Read parquet files from R by using Apache Arrow

Read Parquet files from R

Prerequisite

Install libarrow and libparquet.

git clone https://github.com/apache/arrow/

cd arrow/cpp
mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DARROW_BOOST_USE_SHARED:BOOL=Off \
  -DARROW_PARQUET=ON

make
sudo make install

Usage

tmp_pqt <- tempfile(fileext = ".parquet")
download.file("https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet",
              destfile = tmp_pqt)

arrow:::read_parquet_file(tmp_pqt)
#> # A tibble: 1,000 x 13
#>    registration_dttm      id first_name last_name email gender ip_address
#>    <dttm>              <int> <chr>      <chr>     <chr> <chr>  <chr>     
#>  1 2016-02-03 16:55:29     1 Amanda     Jordan    ajor… Female 1.197.201…
#>  2 2016-02-04 02:04:03     2 Albert     Freeman   afre… Male   218.111.1…
#>  3 2016-02-03 10:09:31     3 Evelyn     Morgan    emor… Female 7.161.136…
#>  4 2016-02-03 09:36:21     4 Denise     Riley     dril… Female 140.35.10…
#>  5 2016-02-03 14:05:31     5 Carlos     Burns     cbur… ""     169.113.2…
#>  6 2016-02-03 16:22:34     6 Kathryn    White     kwhi… Female 195.131.8…
#>  7 2016-02-03 17:33:08     7 Samuel     Holmes    shol… Male   232.234.8…
#>  8 2016-02-03 15:47:06     8 Harry      Howell    hhow… Male   91.235.51…
#>  9 2016-02-03 12:52:53     9 Jose       Foster    jfos… Male   132.31.53…
#> 10 2016-02-04 03:29:47    10 Emily      Stewart   este… Female 143.28.25…
#> # … with 990 more rows, and 6 more variables: cc <chr>, country <chr>,
#> #   birthdate <chr>, salary <dbl>, title <chr>, comments <chr>

Created on 2018-12-04 by the reprex package (v0.2.1)

diff --git a/r/configure b/r/configure
index 28f6a73a..644c2e95 100755
--- a/r/configure
+++ b/r/configure
@@ -26,13 +26,13 @@
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
# Library settings
-PKG_CONFIG_NAME="arrow"
+PKG_CONFIG_NAME="arrow parquet"
PKG_DEB_NAME="arrow"
PKG_RPM_NAME="arrow"
PKG_CSW_NAME="arrow"
PKG_BREW_NAME="apache-arrow"
-PKG_TEST_HEADER="<arrow/api.h>"
-PKG_LIBS="-larrow"
+PKG_TEST_HEADER="<arrow/api.h>\n<parquet/types.h>"
+PKG_LIBS="-larrow -lparquet"
# Use pkg-config if available
pkg-config --version >/dev/null 2>&1
diff --git a/r/src/parquet.cpp b/r/src/parquet.cpp
new file mode 100644
index 00000000..64de6ba4
--- /dev/null
+++ b/r/src/parquet.cpp
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <parquet/arrow/reader.h>
+#include <parquet/arrow/writer.h>
+#include <parquet/exception.h>
+#include "arrow_types.h"
+
+using namespace Rcpp;
+using namespace arrow;
+
+// [[Rcpp::export]]
+List read_parquet_file(String path_) {
+ // TODO: expand path
+ std::string path(path_);
+
+ // original code: https://github.com/apache/arrow/blob/0729cb771bd51f60423b52d44a50bddc45653d90/cpp/examples/parquet/parquet-arrow/src/reader-writer.cc#L64-L72
+ std::shared_ptr<arrow::io::ReadableFile> infile;
+ PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(
+ path, arrow::default_memory_pool(), &infile));
+
+ std::unique_ptr<parquet::arrow::FileReader> reader;
+ PARQUET_THROW_NOT_OK(
+ parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
+ std::shared_ptr<arrow::Table> table;
+
+ PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
+
+ // original code: https://github.com/apache/arrow/blob/0729cb771bd51f60423b52d44a50bddc45653d90/r/src/table.cpp#L49-L63
+ int nc = table->num_columns();
+ int nr = table->num_rows();
+ List tbl(nc);
+ CharacterVector names(nc);
+ for (int i = 0; i < nc; i++) {
+ auto column = table->column(i);
+
+ tbl[i] = ChunkedArray__as_vector(column->data());
+ names[i] = column->name();
+ }
+ tbl.attr("names") = names;
+ tbl.attr("class") = CharacterVector::create("tbl_df", "tbl", "data.frame");
+ tbl.attr("row.names") = IntegerVector::create(NA_INTEGER, -nr);
+ return tbl;
+}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment