Skip to content

Instantly share code, notes, and snippets.

@aaronspring
Created March 17, 2020 11:54
Show Gist options
  • Save aaronspring/7d743cf6d96a34479ec8da6b3aee3181 to your computer and use it in GitHub Desktop.
Save aaronspring/7d743cf6d96a34479ec8da6b3aee3181 to your computer and use it in GitHub Desktop.
intake-esm_require_all_on_usage
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Conda debug"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#!conda info"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"#!which jupyter"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"#!which jupyter-lab"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#!jupyter-troubleshoot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### get resources"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of CPUs: 48, number of threads: 8, number of workers: 6\n"
]
}
],
"source": [
"from dask.distributed import Client\n",
"import multiprocessing\n",
"ncpu = multiprocessing.cpu_count()\n",
"threads = 8\n",
"nworker = ncpu//threads\n",
"print(f'Number of CPUs: {ncpu}, number of threads: {threads}, number of workers: {nworker}')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table style=\"border: 2px solid white;\">\n",
"<tr>\n",
"<td style=\"vertical-align: top; border: 0px solid white\">\n",
"<h3 style=\"text-align: left;\">Client</h3>\n",
"<ul style=\"text-align: left; list-style: none; margin: 0; padding: 0;\">\n",
" <li><b>Scheduler: </b>inproc://136.172.50.56/13728/1</li>\n",
" <li><b>Dashboard: </b><a href='http://localhost:8888/proxy/8787/status' target='_blank'>http://localhost:8888/proxy/8787/status</a>\n",
"</ul>\n",
"</td>\n",
"<td style=\"vertical-align: top; border: 0px solid white\">\n",
"<h3 style=\"text-align: left;\">Cluster</h3>\n",
"<ul style=\"text-align: left; list-style:none; margin: 0; padding: 0;\">\n",
" <li><b>Workers: </b>6</li>\n",
" <li><b>Cores: </b>48</li>\n",
" <li><b>Memory: </b>16.11 GB</li>\n",
"</ul>\n",
"</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<Client: 'inproc://136.172.50.56/13728/1' processes=6 threads=48, memory=16.11 GB>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client = Client(processes=False, threads_per_worker=threads, n_workers=nworker, memory_limit='256GB')\n",
"client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to use the `dask labextension dashboard`, please install you own conda and then intake-esm."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intake to load CMIP data\n",
"\n",
"### Using intake-esm on mistral\n",
"\n",
"- install intake-esm: https://intake-esm.readthedocs.io/en/latest/installation.html\n",
"- check the already built catalogs: `/home/mpim/m300524/intake-esm-datastore/catalogs` or `https://github.com/NCAR/intake-esm-datastore/` and skip long catalog building process of running `/home/mpim/m300524/intake-esm-datastore/builders/*.ipynb`\n",
"\n",
"Available catalogs:\n",
"- CMIP6: `/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json`\n",
"- CMIP5: `/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip5.json`\n",
"- MPI Grand Ensemble: `/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-MPI-GE.json`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aws-cesm1-le.csv.gz mistral-cmip5.csv.gz mistral-miklip.json\n",
"glade-cmip5.csv.gz mistral-cmip5.json mistral-MPI-GE.csv.gz\n",
"glade-cmip5.json mistral-cmip6.csv.gz mistral-MPI-GE.json\n",
"glade-cmip6.csv.gz mistral-cmip6.json pangeo-cmip6.json\n",
"glade-cmip6.json mistral-miklip.csv.gz\n"
]
}
],
"source": [
"# available catalogs: combination of .json and .csv\n",
"!ls /home/mpim/m300524/intake-esm-datastore/catalogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"You may see some warnings below. About tornado and bokeh. I will try to fix this. This does not happen when taking the default `-p shared`, but then we often have too little memory for the analysis. Therefore, I recommend for now using `-p compute`\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import intake\n",
"import xarray as xr\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import pprint\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"#import warnings\n",
"#warnings.simplefilter(\"ignore\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2020.3.16'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# should be >= 2019.12.13\n",
"import intake_esm\n",
"intake_esm.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CMIP6"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# json file contains the rules for concat and merge as well as the location for the catalog .csv file\n",
"col_url = \"/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"#!cat /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>activity_id</th>\n",
" <th>institution_id</th>\n",
" <th>source_id</th>\n",
" <th>experiment_id</th>\n",
" <th>member_id</th>\n",
" <th>table_id</th>\n",
" <th>variable_id</th>\n",
" <th>grid_label</th>\n",
" <th>dcpp_init_year</th>\n",
" <th>version</th>\n",
" <th>time_range</th>\n",
" <th>path</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AerChemMIP</td>\n",
" <td>HAMMOZ-Consortium</td>\n",
" <td>MPI-ESM-1-2-HAM</td>\n",
" <td>ssp370-lowNTCF</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Lmon</td>\n",
" <td>npp</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190627</td>\n",
" <td>203501-205412</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AerChemMIP</td>\n",
" <td>HAMMOZ-Consortium</td>\n",
" <td>MPI-ESM-1-2-HAM</td>\n",
" <td>ssp370-lowNTCF</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Lmon</td>\n",
" <td>npp</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190627</td>\n",
" <td>201501-203412</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AerChemMIP</td>\n",
" <td>HAMMOZ-Consortium</td>\n",
" <td>MPI-ESM-1-2-HAM</td>\n",
" <td>ssp370-lowNTCF</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Lmon</td>\n",
" <td>npp</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190627</td>\n",
" <td>205501-205512</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AerChemMIP</td>\n",
" <td>HAMMOZ-Consortium</td>\n",
" <td>MPI-ESM-1-2-HAM</td>\n",
" <td>ssp370-lowNTCF</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Lmon</td>\n",
" <td>tsl</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190627</td>\n",
" <td>205501-205512</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>AerChemMIP</td>\n",
" <td>HAMMOZ-Consortium</td>\n",
" <td>MPI-ESM-1-2-HAM</td>\n",
" <td>ssp370-lowNTCF</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Lmon</td>\n",
" <td>tsl</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190627</td>\n",
" <td>201501-203412</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" activity_id institution_id source_id experiment_id member_id \\\n",
"0 AerChemMIP HAMMOZ-Consortium MPI-ESM-1-2-HAM ssp370-lowNTCF r1i1p1f1 \n",
"1 AerChemMIP HAMMOZ-Consortium MPI-ESM-1-2-HAM ssp370-lowNTCF r1i1p1f1 \n",
"2 AerChemMIP HAMMOZ-Consortium MPI-ESM-1-2-HAM ssp370-lowNTCF r1i1p1f1 \n",
"3 AerChemMIP HAMMOZ-Consortium MPI-ESM-1-2-HAM ssp370-lowNTCF r1i1p1f1 \n",
"4 AerChemMIP HAMMOZ-Consortium MPI-ESM-1-2-HAM ssp370-lowNTCF r1i1p1f1 \n",
"\n",
" table_id variable_id grid_label dcpp_init_year version time_range \\\n",
"0 Lmon npp gn NaN v20190627 203501-205412 \n",
"1 Lmon npp gn NaN v20190627 201501-203412 \n",
"2 Lmon npp gn NaN v20190627 205501-205512 \n",
"3 Lmon tsl gn NaN v20190627 205501-205512 \n",
"4 Lmon tsl gn NaN v20190627 201501-203412 \n",
"\n",
" path \n",
"0 /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO... \n",
"1 /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO... \n",
"2 /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO... \n",
"3 /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO... \n",
"4 /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/HAMMO... "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col = intake.open_esm_datastore(col_url)\n",
"\n",
"# col.df is a pandas.DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame\n",
"col.df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## many experiments"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>activity_id</th>\n",
" <th>institution_id</th>\n",
" <th>source_id</th>\n",
" <th>experiment_id</th>\n",
" <th>member_id</th>\n",
" <th>table_id</th>\n",
" <th>variable_id</th>\n",
" <th>grid_label</th>\n",
" <th>dcpp_init_year</th>\n",
" <th>version</th>\n",
" <th>time_range</th>\n",
" <th>path</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CMIP</td>\n",
" <td>NCAR</td>\n",
" <td>CESM2</td>\n",
" <td>piControl</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Amon</td>\n",
" <td>tas</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190320</td>\n",
" <td>110001-120012</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>CMIP</td>\n",
" <td>NCAR</td>\n",
" <td>CESM2</td>\n",
" <td>piControl</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Amon</td>\n",
" <td>tas</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190320</td>\n",
" <td>010001-019912</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CMIP</td>\n",
" <td>NCAR</td>\n",
" <td>CESM2</td>\n",
" <td>piControl</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Amon</td>\n",
" <td>tas</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190320</td>\n",
" <td>070001-079912</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>CMIP</td>\n",
" <td>NCAR</td>\n",
" <td>CESM2</td>\n",
" <td>piControl</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Amon</td>\n",
" <td>tas</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190320</td>\n",
" <td>080001-089912</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CMIP</td>\n",
" <td>NCAR</td>\n",
" <td>CESM2</td>\n",
" <td>piControl</td>\n",
" <td>r1i1p1f1</td>\n",
" <td>Amon</td>\n",
" <td>tas</td>\n",
" <td>gn</td>\n",
" <td>NaN</td>\n",
" <td>v20190320</td>\n",
" <td>060001-069912</td>\n",
" <td>/work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" activity_id institution_id source_id experiment_id member_id table_id \\\n",
"0 CMIP NCAR CESM2 piControl r1i1p1f1 Amon \n",
"1 CMIP NCAR CESM2 piControl r1i1p1f1 Amon \n",
"2 CMIP NCAR CESM2 piControl r1i1p1f1 Amon \n",
"3 CMIP NCAR CESM2 piControl r1i1p1f1 Amon \n",
"4 CMIP NCAR CESM2 piControl r1i1p1f1 Amon \n",
"\n",
" variable_id grid_label dcpp_init_year version time_range \\\n",
"0 tas gn NaN v20190320 110001-120012 \n",
"1 tas gn NaN v20190320 010001-019912 \n",
"2 tas gn NaN v20190320 070001-079912 \n",
"3 tas gn NaN v20190320 080001-089912 \n",
"4 tas gn NaN v20190320 060001-069912 \n",
"\n",
" path \n",
"0 /work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/... \n",
"1 /work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/... \n",
"2 /work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/... \n",
"3 /work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/... \n",
"4 /work/ik1017/CMIP6/data/CMIP6/CMIP/NCAR/CESM2/... "
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = dict(experiment_id=['historical','piControl'],\n",
" source_id='CESM2',\n",
" variable_id='tas', table_id='Amon'\n",
" )\n",
"cat = col.search(**query)\n",
"cat.df.head()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['piControl', 'historical'], dtype=object)"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# two experiments are there\n",
"cat.df.experiment_id.unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>activity_id</th>\n",
" <th>institution_id</th>\n",
" <th>source_id</th>\n",
" <th>experiment_id</th>\n",
" <th>member_id</th>\n",
" <th>table_id</th>\n",
" <th>variable_id</th>\n",
" <th>grid_label</th>\n",
" <th>dcpp_init_year</th>\n",
" <th>version</th>\n",
" <th>time_range</th>\n",
" <th>path</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [activity_id, institution_id, source_id, experiment_id, member_id, table_id, variable_id, grid_label, dcpp_init_year, version, time_range, path]\n",
"Index: []"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# but do not match with require_all_on='source_id'\n",
"query = dict(experiment_id=['historical','piControl'], source_id='CESM2',\n",
" require_all_on='source_id',\n",
" variable_id='tas', table_id='Amon'\n",
" )\n",
"cat = col.search(**query)\n",
"cat.df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2020.3.16'"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"intake_esm.__version__"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intake-esm",
"language": "python",
"name": "intake-esm"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@andersy005
Copy link

By the way, I thought I would let you know that intake-esm now supports relative paths for the csv files. This will make it easy for other users to explore the catalog even when they don't have access to the data storage.

As example

{
  "esmcat_version": "0.1.0",
  "id": "mistral-cmip6",
  "description": "This is an ESM collection for CMIP6 data accessible on the DKRZ's MISTRAL disk storage system in /work/ik1017/CMIP6/data/CMIP6",
  "catalog_file": "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.csv.gz",
...
}

becomes

{
  "esmcat_version": "0.1.0",
  "id": "mistral-cmip6",
  "description": "This is an ESM collection for CMIP6 data accessible on the DKRZ's MISTRAL disk storage system in /work/ik1017/CMIP6/data/CMIP6",
  "catalog_file": "mistral-cmip6.csv.gz",
...
}

So, If the json and the csv are in a GitHub repo, I can easily open both the json and csv without needing access to mistral. In earlier versions of intake-esm, this would have failed since I cannot access /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.csv.gz without access to the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment