Skip to content

Instantly share code, notes, and snippets.

@d-chambers
Created January 19, 2021 21:45
Show Gist options
  • Save d-chambers/7b94580b950596d8e8c9b7b60c69f4ef to your computer and use it in GitHub Desktop.
Save d-chambers/7b94580b950596d8e8c9b7b60c69f4ef to your computer and use it in GitHub Desktop.
Milling about
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mill\n",
"\n",
"A device to convert trees (like Catalog and friends) to tables (dataframes).\n",
"\n",
"## Requirements\n",
"\n",
"1. Convert to/from obspy.Catalog instances in a lossless fashion\n",
"2. High-level way to define bi-directional slices from mill to dataframes.\n",
"3. Efficient serialization (parquete?)\n",
"4. Needs to efficiently handle > 1_000_000 events\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic construction\n",
"\n",
"Common ways to create compendia"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from obsplus import Mill, load_dataset\n",
"\n",
"# get catalog\n",
"cat = load_dataset('bingham_test').event_client.get_events()\n",
"\n",
"# convert catalog to compendium and visa-versa\n",
"mill = Mill.from_catalog(cat)\n",
"cat2 = mill.to_catalog()\n",
"\n",
"# load and save parquet file\n",
"mill = Mill.from_parquet('some_events.pak')\n",
"mill.to_parquet('some_events.pak')\n",
"\n",
"# convert to/from json\n",
"mill = Mill.from_json('some_json_or_path_to_such')\n",
"json = mill.to_json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataframe conversion\n",
"\n",
"Dataframe should be extractable by common slices:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pick_df = mill.get_df('pick')\n",
"origin_df = mill.get_df('origin')\n",
"event_df = mill.get_df('event')\n",
"arrival_df = mill.get_df('arrival')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the compendium should be updateable in the same way. These returen copies of the `Compendium` but can optionally be done in place. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_com = mill.upsert(pick_df, 'pick')\n",
"same_com = mill.upsert(arrival_df, 'arrival', inplace=True)\n",
"\n",
"# I am also considering the name \"put_df\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Escape hatch\n",
"If the full tree-like structure of the Mill needs to be accessed it should be able to, something like this?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mill.data # dicts or perhaps awkward.Array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Structure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The structure of the internal tree is defined using a Heirarchary of classes in a similar fashion to [pydantic](https://pydantic-docs.helpmanual.io/). Some validation may also be provided in a similar (but simplier) way.\n",
"\n",
"This will parady ObsPy structure exactly for the event Mill, but could be extended to any structure. We may even consider plugins to manage deviations from default structures. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from obsplus.mill import Structure, ID, LinkedID, validator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class StreamID:\n",
" network: str\n",
" station: str\n",
" location: str\n",
" channel: str\n",
"\n",
"\n",
"class Pick(Structure):\n",
" resource_id: ID\n",
" time: datetime64\n",
" polarity: Optional[Literal['negative', 'positive', 'undecidable']]\n",
" stream_id: StreamID\n",
" # etc.\n",
"\n",
" \n",
"class Arrival(Structure):\n",
" resource_id: ID\n",
" pick_id: LinkedID[Pick]\n",
"\n",
" \n",
"class Origin(Structure):\n",
" resource_id: ID\n",
" latitude: float\n",
" longitude: float\n",
" \n",
" @validator('latitude')\n",
" def lat_in_range(val):\n",
" \"\"\"Example validator. \"\"\"\n",
" assert abs(val) < 90\n",
" \n",
" @validator('longitude')\n",
" def lon_in_range(val):\n",
" \"\"\"Example validator. \"\"\"\n",
" assert abs(long) < 180\n",
"\n",
" \n",
"class Event(Structure):\n",
" resource_id: ID\n",
" picks: List[Pick]\n",
" origins: List[Origin]\n",
" preferred_origin_id: Optional[LinkedID[Origin]]\n",
" # etc.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Mappings\n",
"\n",
"These will largely be internally defined in ObsPlus, but provide a way for others to define their own dataframe extraction , slices/validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from obsplus.mill import DFMapping, column, reverse_column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class PickMapping(DFMapping):\n",
" _name = 'pick' # registers this as for use in get_df\n",
" time = Pick.time\n",
" network = Pick.stream_id.network\n",
" station = Pick.stream_id.station\n",
" location = Pick.stream_id.location\n",
" channel = Pick.stream_id.channel\n",
" event_id = Pick._parent.id # parent structures are known and can be transversed.\n",
"\n",
" \n",
"class ArrivalMapping(DFMapping):\n",
" _name = 'arrival'\n",
" resource_id = Arrival.resource_id\n",
" time = Arrival.pick_id.time # notice how we can follow ids (LinkedID) to referred object\n",
" network = Arrival.pick_id.stream_id.network\n",
" station = Arrival.pick_id.stream_id.station\n",
" location = Arrival.pick_id.stream_id.location\n",
" channel = Arrival.pick_id.stream_id.channel\n",
" \n",
" # Lists can be referenced\n",
" event_description = Arrival.pick_id.event_descriptions.get(0)\n",
" \n",
" @column()\n",
" def network_station(self):\n",
" \"\"\"\n",
" A custom column which need some logic to run before returning.\n",
" \n",
" This is used for getting the dataframe column.\n",
" \"\"\"\n",
" seed_id = self.network + '.' self.station\n",
" \n",
" @reverse_column()\n",
" def network_str_len(self):\n",
" \"\"\"\n",
" Return an array which is set as attribute in compendium.\n",
" \"\"\"\n",
" return self.network.str.len()\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@shawnboltz
Copy link

@d-chambers Quick question: what is a "Mill" in this case?

@shawnboltz
Copy link

@d-chambers Quick question: what is a "Mill" in this case?

Scratch that, I should read first

@d-chambers
Copy link
Author

what is a "Mill" in this case?

You get the pun right, its not too obscure? I also thought about SawMill and Compendium

@shawnboltz
Copy link

@d-chambers I would be interested to talk to you more about this at some point. I'm free this afternoon or tomorrow afternoon. A few notes:

  • I know what "upsert" means (update and insert new stuff), but would it be better to just use "update"? It's a more familiar term.
  • Since certain object types are explicitly tied to other objects (I'm thinking specifically of Arrivals, which are specific to the Origin... then there's the whole level of weirdness that is StationMagnitudes and StationMagnitudeContributions, how do we verify that the user is getting the objects that they want? Would the magnitude or origin id just be columns in the df? Or could there be someway to just limit them to the preferred object?
  • I like the idea of using a dict for the tree structure, to eliminate awkwardness
  • Is pydantic going to become a requirement for everything we do?
  • I'm curious to know more about the LinkedID. Will it be used to fetch object by ID in addition to finding all objects that refer to the object with that ID? (I think you've more or less answered this? Is that what _parent is? What about objects with multiple parents?)
  • On your mapping, does the stuff to the right of the = need to follow Obspy syntax exactly? If so, you are missing some get_referred_object()s. Or is it the structure you are defining with your Structures?

@shawnboltz
Copy link

what is a "Mill" in this case?

You get the pun right, its not too obscure? I also thought about SawMill and Compendium

It's not too obscure once you have the context behind it. I think I might recommend something more descriptive, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment