Skip to content

Instantly share code, notes, and snippets.

@ccarouge
Created May 19, 2020 01:55
Show Gist options
  • Save ccarouge/47ced0a740a6e2e5d52731fcdc914f1c to your computer and use it in GitHub Desktop.
Save ccarouge/47ced0a740a6e2e5d52731fcdc914f1c to your computer and use it in GitHub Desktop.
Convert CSV to NetCDF
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data wrangling: CSV to NetCDF\n",
"A user came to us as she had issues with some model outputs. The model outputs to CSV. That's a land model, so there are outputs only on the land points. The data is organised as so for each point, the data for each successive year is in a new row and the data for each month is in the column of this month (Jan, Feb, etc.) for that row. See below the output after reading in the data to make it clearer.\n",
"\n",
"She wanted to store the data in a (lat, lon, time) netcdf file. She would like the longitude and latitude to cover both the land and ocean (it's a 0.5° regular grid) and we also need to figure out how to combine the year and month information into one time information.\n",
"\n",
"Xarray does not read in CSV, we will need to read in the file with Pandas and then figure out how to convert it to DataArray and write it to NetCDF."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get the data and dimensions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import xarray as xr"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fname = \"/g/data/w35/lt0205/research/lpj_guess/runs/CRUNCEP/mgpp.out\"\n",
"df = pd.read_csv(fname, header=0, delim_whitespace=True)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"months=list(df.columns)\n",
"months=months[3:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lons = np.unique(df.Lon)\n",
"lats = np.unique(df.Lat)\n",
"years = np.unique(df.Year)\n",
"nyears = len(years)\n",
"nrows = len(lats)\n",
"ncols = len(lons)\n",
"nmonths = 12\n",
"lons.sort()\n",
"lats.sort()\n",
"years.sort()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Original code.\n",
"The person requesting help sent us the codes she wrote to try and solve the issue. We put one of those below to discuss the good and bad of it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Original code:\n",
"out = np.zeros((ncols, nrows, nyears*nmonths))\n",
"for i,lon in enumerate(lons):\n",
" print(i, ncols)\n",
" for j,lat in enumerate(lats):\n",
" #print(j)\n",
" vals = df[(df.Lat == lat) & (df.Lon == lon)].values[:,3:]\n",
" if len(vals)> 0:\n",
" print(\"Reshape\")\n",
" vals = vals.reshape(nyears*nmonths)\n",
" out[i,j,:] = vals\n",
"\n",
"t1 = pd.to_datetime('1/1/1901')\n",
"time = t1 + pd.to_timedelta(np.arange(nmonths*nyears), 'M')\n",
"\n",
"ds = xr.Dataset(data_vars={\"mgpp\":([\"y\", \"x\", \"time\"],out)},\n",
" coords={\"lat\": ([\"y\"], lats),\n",
" \"lon\": ([\"x\"], lons),\n",
" \"time\": time})\n",
"ds.to_netcdf('test.nc')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The good idea in this code is not to loop over the times but only looping over the spatial points. The bad idea is the original dataset only has the land points. So by looping on all (lon, lat) pairs, we are looping over points that do not have any data in the DataFrame.\n",
"\n",
"The original DataFrame has 59190 spatial points (length of the Frame / # of years). The loop goes through 194878 values, i.e about 3 times more than needed. \n",
"\n",
"So let's estimate the time it needs to run. We'll time the loop for 1 iteration and then multiply by the number of iterations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit\n",
"# Original code:\n",
"out = np.zeros((ncols, nrows, nyears*nmonths))\n",
"for i,lon in enumerate(lons[0:1]):\n",
" #print(i, ncols)\n",
" for j,lat in enumerate(lats[0:1]):\n",
" #print(j)\n",
" vals = df[(df.Lat == lat) & (df.Lon == lon)].values[:,3:]\n",
" if len(vals)> 0:\n",
" print(\"Reshape\")\n",
" vals = vals.reshape(nyears*nmonths)\n",
" out[i,j,:] = vals"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Time for total loop in minutes:\n",
"(23.8 * nrows*ncols)/1000/60"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we need something that runs faster than 77 min."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## First solution: a loop\n",
"Instead of looping through all the spatial points of the output array, we can loop through the rows of the DataFrame and find the correct place in the output array to put this data.\n",
"\n",
"First, let's create the output DataArray so we can easily find the location of points using longitude and latitude. Instead of using only the land points, we'll create the array for the full grid directly and fill with NaN."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create the axes\n",
"time = pd.date_range(start=f'01/{years[0]}',\n",
" end =f'01/{years[-1]+1}', freq='M')\n",
"# We'll use a generic way to create a regular grid from [-180,180] and\n",
"# [-90, 90] when knowing the resolution. Feel free to reuse as needed.\n",
"dx = 0.5\n",
"Lon = xr.DataArray(np.arange(-180.+dx/2., 180., dx), dims=(\"Lon\"),\n",
" attrs={\"long_name\":\"longitude\", \"unit\":\"degrees_east\"})\n",
"nlon = Lon.size\n",
"dy = 0.5\n",
"Lat = xr.DataArray(np.arange(-90.+dy/2., 90., dy), dims=(\"Lat\"),\n",
" attrs={\"long_name\":\"latitude\", \"unit\":\"degrees_north\"})\n",
"nlat = Lat.size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"out = xr.DataArray(np.zeros((nlat, nlon, nyears*nmonths)),\n",
" dims=(\"Lat\",\"Lon\",\"Time\"),\n",
" coords=({\"Lat\":Lat, \"Lon\":Lon, \"Time\":time}))\n",
"out[:] = np.nan\n",
"out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First an example with 1 row to see what's happening.\n",
"In that case, we would loop over each row and find the location of the data in the dataarray. That would mean a loop over 6.8M items!\n",
"\n",
"We use the longitude, latitude and year information contain in each row to find the locations to update in the `out` array using `loc`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit\n",
"row = next(df.iterrows())[1]\n",
"out.loc[dict( \n",
" Lon=out.Lon[(out.Lon==row[\"Lon\"])],\n",
" Lat=out.Lat[(out.Lat==row[\"Lat\"])],\n",
" Time=out.Time[(out.Time.dt.year==row[\"Year\"])])] = row[3:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Just a check\n",
"print(out.sel(Lon=39.75, Lat=-1.25, Time=slice(\"1901\", \"1902\")))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see only the data for the year 1901 has been updated for this point only in `out`.\n",
"\n",
"But this is way to slow. It's 308ms times the number of rows (6.8M). \n",
"\n",
"To improve that time, we have to remember all the years for each point are together in order. So we could read nyears at a time and get the data for all years and months at once.\n",
"\n",
"There is still the problem, if we select all years for a point at once, we get a 2D Dataframe with the years in rows and the months in columns. To solve this, we can simply stack the Dataframe over the months columns. Since Pandas doesn't scramble rows when stacking, we can stack the original Dataframe and then use the fact all points have data for the same number of years and months. If we had some missing years for some points, we could modify the following code to allow for it but it would be slower."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We stack the months columns of the whole DataFrame at once since we\n",
"# don't have missing years for any point. Doing it once will save time.\n",
"df_stack = df[months].stack()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit\n",
"#Example for one spatial point\n",
"rows = df[0:nyears]\n",
"# If we had missing years, we could add the missing years rows and then\n",
"# stack only the rows for the point here.\n",
"#rows_stack = rows[months].stack()\n",
"out.loc[dict( \n",
" Lon=out.Lon[(out.Lon==rows[\"Lon\"].min())],\n",
" Lat=out.Lat[(out.Lat==rows[\"Lat\"].min())])] = df_stack[0:nyears*nmonths]\n",
"out.sel(Lon=rows[\"Lon\"].min(),Lat=rows[\"Lat\"].min())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If each pack is 5.37 ms, then for the whole DataFrame we need 5.3 min (see below) which isn't too bad."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(df.index)//nyears*5.37/1000/60"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#df_stack = df[months].stack()\n",
"for nr in range(0,len(df.index),nyears):\n",
" print(nr)\n",
" rows = df[nr:nr+nyears]\n",
" thislon = rows[\"Lon\"].min()\n",
" thislat = rows[\"Lat\"].min()\n",
" out.loc[dict( \n",
" Lon=out.Lon[(out.Lon==thislon)],\n",
" Lat=out.Lat[(out.Lat==thislat)])] = df_stack[nr*nmonths:(nr+nyears)*nmonths]\n",
"out "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And a plot to check."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"out.sel(Time=\"2015-01\").plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Another solution: groupby\n",
"In the loop above, we see we handle all the data for each spatial point at once. That's typically what groupby would do if applied to Lon and Lat. Once we group the data per spatial point, we need to apply a function. This function sould return the stacked data for that point (i.e. the timeseries) with the timestamp as an index. This way to overall DataFrame resulting from the groupby will have Lon, Lat and Time as indexes and 1 column of data. This column will be the timeseries at each point stack one after the other. Let's see what it would look like:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We reuse the time array we used for defining the out DataArray\n",
"# time = pd.date_range(start=f'01/{years[0]}',\n",
"# end =f'01/{years[-1]+1}', freq='M')\n",
"\n",
"def time_to_date(df):\n",
" df_stack = df[months].stack()\n",
" return(pd.DataFrame(df_stack.values, index=time))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"df2 = df.groupby([\"Lon\",\"Lat\"]).apply(time_to_date)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df2.index.names = [\"Lon\", \"Lat\", \"Time\"]\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see the groupby calculation is a lot faster but we only get a DataFrame as output and not a DataArray. So we still need to convert this DataFrame to a DataArray. Unfortunately, the conversion below crashes the Jupyter Kernel, but we can say it isn't a fast operation. It seems using `groupby` will not save any time compared to the loop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"out = df2.to_xarray()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"Although the code with `groupby` is a lot more compact, it might not be as intuitive to write at first. Also, the `groupby` solution might seems to be faster but it isn't the case with the following conversion to a DataArray.\n",
"\n",
"We see sometimes loops can be used in Python, it's simply a question of honing the logic to loop through as few items as necessary."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:analysis3-20.01] *",
"language": "python",
"name": "conda-env-analysis3-20.01-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment