This in principle allows one to read anything that uproot/awkward can read and represent (as long as to_arrow
worked):
We use the following packages to demonstrate our round trip
julia> using PythonCall
julia> const ak = pyimport("awkward")
julia> ak.__version__
Python str: '1.9.0rc10'
julia> const pa = pyimport("pyarrow");
First, let's make some non-trivial data to represent:
julia> arr = ak._v2.from_iter([pydict(("one"=>1, "two"=>[2.0])), pydict(("one"=>2, "two"=>[1.0, 2.0]))])
Python Array: <Array [{one: 1, two: [2]}, {...}] type='2 * {one: int64, two: var * float64}'>
julia> arr.one
Python Array: <Array [1, 2] type='2 * int64'>
One can almost always get a pyarrow
table out of awkward array:
julia> pa_table = ak._v2.to_arrow_table(arr)
Python Table:
pyarrow.Table
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null
----
one: [[1,2]]
two: [[[2],[1,2]]]
julia> pa_batches = pa_table.to_batches()
Python list:
[pyarrow.RecordBatch
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null]
There's always only one batch due to how awkward
does this thing:
https://github.com/scikit-hep/awkward/blob/dd2a3f400e29fc9ea908fc7d8267f592091457bb/src/awkward/operations/convert.py#L2590
julia> batch = only(pa_batches)
Python RecordBatch:
pyarrow.RecordBatch
one: extension<awkward<AwkwardArrowType>> not null
two: extension<awkward<AwkwardArrowType>> not null
julia> batch.num_rows
Python int: 2
julia> batch.num_columns
Python int: 2
Here's the important bit
We can write whole block of IPC stream bytes into a Julia buffer, and Arrow.jl
can re-use that memory blob and turn it into a table:
julia> jl_sink = IOBuffer()
julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
writer.write_batch(batch)
end;
julia> DataFrame(Arrow.Table(take!(jl_sink)))
2×2 DataFrame
Row │ one two
│ Int64 Array…
─────┼───────────────────
1 │ 1 [2.0]
2 │ 2 [1.0, 2.0]