Skip to content

Instantly share code, notes, and snippets.

@erget
Created August 5, 2019 17:47
Show Gist options
  • Save erget/84f62a72d3bf6eae291bb2d5e71e979e to your computer and use it in GitHub Desktop.
Save erget/84f62a72d3bf6eae291bb2d5e71e979e to your computer and use it in GitHub Desktop.
MICMoR Daniel Lee.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5 Sep 2019: EDA, ETL, Visualisation\n",
"## D. Lee: Data processing, formats, etc.\n",
"Resources:\n",
"- [Landing page](https://micmor.kit.edu/summer-schools)\n",
"- [GitHub repo](https://github.com/cwerner/kit_micmor_summerschool_2019)\n",
"\n",
"### First block: 9:00 - 9:30\n",
"Why do we care about this? What's the point? Rosetta Stone comparison.\n",
"Maybe a short excursion into the realm of communicating with aliens, making fun of movies.\n",
"How do they even know the endianness? Probably binary is obvious, but how are floats encoded?\n",
"As humans we pretty much have data encodings figured out... Kind of. But actually not. And data formats are even more difficult.\n",
"#### Basic formats\n",
"- CSV\n",
"- Image formats like JPG\n",
"- xarray\n",
"- GeoTIFF\n",
"- netCDF\n",
"- HDF5\n",
"- Meteorological stuff\n",
"- Other stuff you might encounter\n",
"\n",
"Some other stuff\n",
"- What libraries\n",
"- Conventions\n",
" - Coordinates\n",
" - Encoding\n",
"- Sometimes your data is organised stupidly and then you have to put it into the format you can work with\n",
"- Libraries that abstract away this stuff can be helpful, like pandas, numpy, etc.\n",
"#### Beyond formats, how do you process data?\n",
"- Some cool analogies about how much more efficient you are when you stay in one place\n",
"- How to organise your data so that it's optimised for your access patterns (what dimensions increment first, basically?)\n",
"### Second block: 9:30 - 10:00\n",
"#### But what about the cloud?\n",
"- Moving algorithms instead of data.\n",
"\n",
"Then some stuff about cloud-optimised formats:\n",
"- Parquett\n",
"- zarr\n",
"- COG\n",
"- The importance of streaming\n",
"- Object store vs files vs databases\n",
"\n",
"#### Generating interoperable data\n",
"Formats to consider, standards, other data formats engineering questions. Keep this brief."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@erget
Copy link
Author

erget commented Aug 7, 2019

Make sure represented:

  • CF Conventinos kind of in-depth
  • Arrow, Feather, Parquet, ORC?
  • Big data DBs and ways to use DBs to work with them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment