Since writing out ROOT files is currently not possible with UnROOT
, one can write out Arrow files
directly from an UnROOT.LazyTree
object which can be read back in julia.
With some consideration of the chunking, this won't use much memory.
using UnROOT
using Arrow
using Tables
treename = "Events"
filename = "18BCCE71-15B8-194B-8738-EC993C8DD3BD.root"
branches = [r"^MET_(pt|phi)$","Jet_pt","Jet_eta","Muon_pt"]
const f = ROOTFile(filename)
const t = LazyTree(f, treename, branches)
# `Arrow.write` determines batch size by `Tables.partitions()`
# By default, it is
# Tables.partitions(t::LazyTree) = (t,)
# which writes out the whole table at once.
# We often cannot hold large materialized tables in memory.
# For NanoAOD, the tree has fClusterRangeEnd defined, which is
# essentially the aligned basket entry ranges. For other kinds of
# trees it may be necessary to change the chunking logic here
function Tables.partitions(t::LazyTree)
tree = f[treename]
edges = [0, (tree.fClusterRangeEnd .+ 1)..., tree.fEntries]
ranges = [(edges[i]+1):edges[i+1] for i in 1:(length(edges)-1)]
return (t[r] for r in ranges)
end
Arrow.write("out.arrow", t, compress=:lz4, ntasks=1)
And it can be read back in Python to the awkward
ecosystem with the pyarrow
package.
Remember to iterate over the f.num_record_batches
batches.
>>> import awkward1 as ak
>>> import pyarrow
>>> f = pyarrow.open_file("out.arrow")
>>> f.num_record_batches
100
>>> batch = f.get_batch(0) # first batch
>>> ak.from_arrow(batch)[0] # first event
<Record ... 0.708, -2.77], Muon_pt: [64.7]} type='{"MET_phi": float32, "MET_pt":...'>