Skip to content

Instantly share code, notes, and snippets.

@martindurant
Created September 23, 2022 15:27
Show Gist options
  • Save martindurant/88846ecbd863a4fa4696b9fffd01ca80 to your computer and use it in GitHub Desktop.
Save martindurant/88846ecbd863a4fa4696b9fffd01ca80 to your computer and use it in GitHub Desktop.
icechunk1
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "c4a2f7a2",
"metadata": {},
"source": [
"### Iceberg + kerchunk =\n",
"\n",
"# IceChunk\n",
"\n",
"Kerchunk is ... https://fsspec.github.io/kerchunk/\n",
"\n",
"Apache Iceberg is versioned parquet datasets, by immutable files and \"manifest\" listings."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eafbe9d9",
"metadata": {},
"outputs": [],
"source": [
"import fsspec\n",
"import xarray as xr\n",
"import zarr"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6986acb7",
"metadata": {},
"outputs": [],
"source": [
"# s3://testfred/gridS.tar --endpoint-url https://object-store.cloud.muni.cz 20GB of netCDF4\n",
"s3 = {\n",
" \"anon\": True,\n",
" \"client_kwargs\": {\"endpoint_url\": \"https://object-store.cloud.muni.cz\"}\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "fcafdb6c",
"metadata": {},
"source": [
"### Innovation 1\n",
"indexed multiple file within a single remote TAR\n",
"\n",
"- can open file-like object in the remote for scanning\n",
"- can find the offsets to each of the member files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eabbc318",
"metadata": {},
"outputs": [],
"source": [
"with fsspec.open(\"tar://SEDNA-DELTA_y2014m01d01.1d_gridS.nc::s3://testfred/gridS.tar\", s3=s3) as f:\n",
" print(f.read(4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98635113",
"metadata": {},
"outputs": [],
"source": [
"import tarfile\n",
"with fsspec.open(\"s3://testfred/gridS.tar\", **s3) as tf:\n",
" tar = tarfile.TarFile(fileobj=tf)\n",
" offsets = {ti.name: ti.offset_data for ti in tar.getmembers()}\n",
"offsets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f3cf694",
"metadata": {},
"outputs": [],
"source": [
"fs = fsspec.filesystem(\"reference\", fo=\"gridS.json\", remote_options=s3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "866c30c4",
"metadata": {},
"outputs": [],
"source": [
"# one file\n",
"{l[0] for l in fs.references.values() if isinstance(l, list)}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e0f4e1c",
"metadata": {},
"outputs": [],
"source": [
"g = zarr.open_group(fs.get_mapper())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c978185",
"metadata": {},
"outputs": [],
"source": [
"g.vosaline.nbytes / 2**30 # apparent in-memory size"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d3d3565",
"metadata": {},
"outputs": [],
"source": [
"# 30180 * 32k chunks\n",
"g.vosaline.chunks, g.vosaline.shape"
]
},
{
"cell_type": "markdown",
"id": "7d6806f9",
"metadata": {},
"source": [
"Data loads concurrently, unlike with TAR driver"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92f58026",
"metadata": {},
"outputs": [],
"source": [
"g.vosaline[:, 0, 3000, 3000], g.vosaline[:, 0, 3600, 3001]"
]
},
{
"cell_type": "markdown",
"id": "d78077ee",
"metadata": {},
"source": [
"### Innovation 2\n",
"\n",
"Let's edit it! I do **not** have write access to the remote store."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce325d31",
"metadata": {},
"outputs": [],
"source": [
"g.vosaline[:, 0, 3000, 3000] += 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e3f26e2",
"metadata": {},
"outputs": [],
"source": [
"# created local file and updated references\n",
"fs.references[\"vosaline/0.0.230.0\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c489081c",
"metadata": {},
"outputs": [],
"source": [
"# save modified refs\n",
"fs.save_json(\"gridS-mod.json\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13f0cf7a",
"metadata": {},
"outputs": [],
"source": [
"fs2 = fsspec.filesystem(\n",
" \"reference\", fo=\"gridS-mod.json\", \n",
" fss={\n",
" \"s3\": fsspec.filesystem(\"s3\", **s3),\n",
" \"file\": fsspec.filesystem(\"file\")\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7705cf27",
"metadata": {},
"outputs": [],
"source": [
"# now we refer to one remote and four local files\n",
"{l[0] for l in fs2.references.values() if isinstance(l, list)}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8324a088",
"metadata": {},
"outputs": [],
"source": [
"g2 = zarr.open_group(fs2.get_mapper())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b01dc927",
"metadata": {},
"outputs": [],
"source": [
"# access both new and original data alongside\n",
"g2.vosaline[:, 0, 3000, 3000], g2.vosaline[:, 0, 3600, 3001]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "880b4dd2",
"metadata": {},
"outputs": [],
"source": [
"fs = fsspec.filesystem(\"reference\", fo=\"gridS.json\", remote_options=s3, skip_instance_cache=True)\n",
"g = zarr.open_group(fs.get_mapper())\n",
"g.vosaline[:, 0, 3000, 3000], g.vosaline[:, 0, 3600, 3001]"
]
},
{
"cell_type": "markdown",
"id": "2a1bd324",
"metadata": {},
"source": [
"Put it in an Intake catalog, and you have checkpoint/versioned data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b043d188",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@martindurant
Copy link
Author

Requires https://github.com/martindurant/filesystem_spec/tree/icy

Note that the two datasets could have been put into an intake catalog for loading with xarray. Xarray does not support updating part of a dataset, however.

cc @rabernat

@rabernat
Copy link

Martin, I'm glad to see you digging into this problem.

As I mentioned over email, @jhamman and I are currently carefully writing a detailed design document around this idea. We will be sharing this within a few weeks. Until that is done, we will probably refrain from diving into much coding. I appreciate your patience.

@martindurant
Copy link
Author

This was not a call to action, just a POC for myself to see how simple it could be. The finished UX will be far more comprehensive! I don't intend to do more on this before seeing your architecture design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment