Skip to content

Instantly share code, notes, and snippets.

@Moelf
Last active August 31, 2022 02:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Moelf/de9e6be8575ed5a0399e04637fb935cf to your computer and use it in GitHub Desktop.
Save Moelf/de9e6be8575ed5a0399e04637fb935cf to your computer and use it in GitHub Desktop.
Re-use Awkward / pyarrow IPC for Julia Arrow.jl

This in principle allows one to read anything that uproot/awkward can read and represent (as long as to_arrow worked):

We use the following packages to demonstrate our round trip

julia> using PythonCall

julia> const ak = pyimport("awkward")

julia> ak.__version__
Python str: '1.9.0rc10'

julia> const pa = pyimport("pyarrow");

First, let's make some non-trivial data to represent:

julia> arr = ak._v2.from_iter([pydict(("one"=>1, "two"=>[2.0])), pydict(("one"=>2, "two"=>[1.0, 2.0]))])
Python Array: <Array [{one: 1, two: [2]}, {...}] type='2 * {one: int64, two: var * float64}'>

julia> arr.one
Python Array: <Array [1, 2] type='2 * int64'>

One can almost always get a pyarrow table out of awkward array:

julia> pa_table = ak._v2.to_arrow_table(arr)
Python Table:
pyarrow.Table
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null
----
one: [[1,2]]
two: [[[2],[1,2]]]

julia> pa_batches = pa_table.to_batches()
Python list:
[pyarrow.RecordBatch
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null]

There's always only one batch due to how awkward does this thing: https://github.com/scikit-hep/awkward/blob/dd2a3f400e29fc9ea908fc7d8267f592091457bb/src/awkward/operations/convert.py#L2590

julia> batch = only(pa_batches)
Python RecordBatch:
pyarrow.RecordBatch
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null

julia> batch.num_rows
Python int: 2

julia> batch.num_columns
Python int: 2

Here's the important bit

We can write whole block of IPC stream bytes into a Julia buffer, and Arrow.jl can re-use that memory blob and turn it into a table:

julia> jl_sink = IOBuffer()

julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
               writer.write_batch(batch)
           end;

julia> DataFrame(Arrow.Table(take!(jl_sink)))
2×2 DataFrame
 Row │ one    two        
     │ Int64  Array     
─────┼───────────────────
   11  [2.0]
   22  [1.0, 2.0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment