Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save kaedonkers/4c69425d674f8b195ac84a9957934615 to your computer and use it in GitHub Desktop.
Save kaedonkers/4c69425d674f8b195ac84a9957934615 to your computer and use it in GitHub Desktop.
Extract regional data from gridded NetCDF files using geometries from a shapefil (Iris, Cartopy, ascend)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extracting regional values from Met Office Global meteorological data\n",
"\n",
"## Process\n",
"This notebook runs you through how to extract spatial mean values from gridded data using shapefiles. The process includes:\n",
"\n",
"1. Load the Shapefile for the regions we want to subset with.\n",
"2. Determine the full lat-lon extent of the shapefile.\n",
"3. Use lookup table shapefile_attributes.json to determine what geometry attributes we want to use for this shapefile.\n",
"4. Use daterange to generate filenames of NetCDFs we want to load\n",
"5. Load gridded data from NetCDF files into memory using Iris (using [lazy loading](https://scitools.org.uk/iris/docs/latest/userguide/real_and_lazy_data.html)).\n",
"6. Subset the data to only the extent of the shapefile, improving the processing time.\n",
"7. Define the functions to be used in the pipeline.\n",
"8. Loop through all the regions in the shapefile subsetting, collapsing and generating a Pandas DataFrame for each region.\n",
"9. Each DataFrame is saved to CSV in a temporary location.\n",
"10. Collate all the region DataFrames into one DataFrame and save out to CSV\n",
"11. Delete the temporary files.\n",
"\n",
"## Method\n",
"This process uses the polygon of a region (from the shapefile) to subset the gridded data by getting the **latitude-longitude bounding box** of the polygon, as described in this diagram:\n",
"\n",
"<img src=\"images/coarse_spatial_mean_gridded.png\" alt=\"Lat-Lon bounding box using polygon\" style=\"height: 400px;\"/> \n",
"\n",
"Each grid cell (small latitude-longitude box) contains a single value for a meteorological variable. The single value of that variable for the whole region/polygon is the mean of all the grid cell values in the bounding box i.e. lat-lon spatial mean.\n",
"\n",
"For example, here we have air temperature values in a bounding box that covers the a polygon. The temperature value for the region is the mean value of the temperatures in the boundind box; 20.9°C.\n",
"\n",
"<img src=\"images/spatial_mean_example.png\" alt=\"The mean value for the temperature is 20.9°C\" style=\"height: 400px;\"/> \n",
"\n",
"#### Time\n",
"Of course we have ignored the time axis in this example, which is present in the gridded data but is handled for us by the Iris library as just another dimension. In this notebook we use daily data and will simply store the date for each value in the final tabular data.\n",
"\n",
"#### Improvements\n",
"This process could be more accurate by only using the grid cells which actually overlap with the polygon and by weighting the grid cells according to how much of their area is within the polygon. Improvements like these are coming."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Data\n",
"import iris\n",
"import cartopy.io.shapereader as shpreader\n",
"import pandas as pd\n",
"import cftime\n",
"import datetime\n",
"\n",
"#Plotting\n",
"import cartopy\n",
"import cartopy.crs as ccrs\n",
"import iris.plot as iplt\n",
"import iris.quickplot as qplt\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"#System\n",
"import os\n",
"import sys\n",
"import glob\n",
"import json\n",
"\n",
"#Met Office utils\n",
"import shape_utils as shape\n",
"\n",
"#Supress warnings\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Load shapefile containing region polygons"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"#Define shapefile path\n",
"SHAPEFILE = '../covid-19/it_shapefiles/gadm36_ITA_2.shp'\n",
"# SHAPEFILE = '../covid-19/uk_shapefiles/UK_covid_reporting_regions.shp'\n",
"# SHAPEFILE = '../covid-19/us_shapefiles/US_COUNTY_POP.shx'\n",
"# SHAPEFILE = '../covid-19/ug_shapefiles/gadm36_Uganda_2.shp'\n",
"# SHAPEFILE = '../covid-19/vt_shapefiles/gadm36_Vietnam_2.shp'\n",
"# SHAPEFILE = '../covid-19/br_shapefiles/gadm36_BRA_2.shp'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Get shapefile name\n",
"SHAPEFILE_NAME = SHAPEFILE.split('/')[-1].split('.')[-2]\n",
"SHAPEFILE_NAME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Load the shapefile\n",
"shape_reader = shpreader.Reader(SHAPEFILE)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#How many regions are included?\n",
"len([record for record in shape_reader.records()])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Let's take a look at one\n",
"next(shape_reader.records())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(shape_reader.geometries())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Determine the full lat-lon extent of the shapefile\n",
"\n",
"First we need some functions to help us"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_shapefile_extent(shape_reader, buffer=0, **kwargs):\n",
" '''\n",
" Get the extent (x1, x2, y1, y2) of the union of all the polygons\n",
" in a shapefile reader object, with an optional buffer added on.\n",
" \n",
" Arguments:\n",
" shape_reader (cartopy.io.shapereader): Shapefile reader object\n",
" buffer (num): optional buffer to add to the extent of the shapefile\n",
" **kwargs: Keyword arguments for shape_utils.Shape class\n",
" \n",
" Returns:\n",
" extent (tuple): Extent float values in format (x1, x2, y1, y2)\n",
" \n",
" TODO: Split into\n",
" - get_shapefile_shapelist(shape_reader)\n",
" - get_shapelist_union(shapelist, **kwargs)\n",
" - get_shape_extent(shape, buffer=0)\n",
" '''\n",
" #Get all the records from shapefile_reader\n",
" recs = [rec for rec in shape_reader.records()]\n",
" \n",
" #Cycle through all the records in recs,\n",
" #appending Shape object to ShapeList, if shape is valid\n",
" shplist = shape.ShapeList([])\n",
" for rec in recs:\n",
" shp = shape.Shape(rec.geometry, rec.attributes, **kwargs) \n",
" if shp.is_valid:\n",
" shplist.append(shp)\n",
" \n",
" #Get extent of union of all geometries\n",
" wsen = shplist.unary_union().data.bounds\n",
" \n",
" #Rearrange extent from (x1, y1, x2, y2) to (x1, x2, y1, y2), adding buffer\n",
" extent = (wsen[0]-buffer, wsen[2]+buffer, wsen[1]-buffer, wsen[3]+buffer)\n",
" \n",
" return extent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def plot_extent(extent):\n",
" '''\n",
" Use Matplotlib to plot the given extent on a map of the world, \n",
" in Plate Carree projection, with coastlines and country boundaries.\n",
" \n",
" Arguments:\n",
" extent (tuple): Float values for lat-lon coordinates in (x1, x2, y1, y2) format\n",
" \n",
" Returns: \n",
" Displays plot\n",
" '''\n",
" ax = plt.axes(projection=ccrs.PlateCarree())\n",
" ax.coastlines('50m', color='b')\n",
" ax.add_feature(cartopy.feature.BORDERS.with_scale('50m'), linestyle=':')\n",
" ax.set_extent(extent)\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#Get the extent of the shapefile, with a buffer of 1 degree\n",
"SHAPE_EXTENT = get_shapefile_extent(shape_reader, buffer=1)\n",
"SHAPE_EXTENT"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Plot it onto a map of the world to check that it looks how we expect it to\n",
"plot_extent(SHAPE_EXTENT)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Use lookup table shapefile_attributes.json to determine what shapefile attributes we want to use later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Define json path\n",
"json_read = './shapefile_attributes.json'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Load SHAPE_ATTRS from json file as a tuple\n",
"with open(json_read) as file:\n",
" SHAPE_ATTRS = tuple(json.load(file)['shapefile_attributes'][SHAPEFILE_NAME])\n",
" \n",
"SHAPE_ATTRS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#The first attribute in SHAPE_ATTRS will be the shapefile attribute we iterate over in our pipeline\n",
"SHAPE_IDS = tuple(record.attributes[SHAPE_ATTRS[0]] for record in shape_reader.records())\n",
"SHAPE_IDS[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Use daterange to generate filenames of NetCDFs we want to load"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Define date range\n",
"start = datetime.date(2020, 3, 1)\n",
"stop = datetime.date(2020, 3, 11)\n",
"step = datetime.timedelta(days=1)\n",
"DATERANGE = pd.date_range(start, stop, freq=step)\n",
"DATERANGE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"filename_fmt = f'*_%Y%m%d.nc'\n",
"FILENAMES = list(DATERANGE.strftime(filename_fmt))\n",
"print(FILENAMES[0])\n",
"print(FILENAMES[-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Load gridded data from NetCDF files into memory using Iris (using [lazy loading](https://scitools.org.uk/iris/docs/latest/userguide/real_and_lazy_data.html))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The files for each variable are contained in a separate folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#List all the filepaths and store in a dict with each variable as a key\n",
"folder = '/data/misc/covid-19/data_nc_daily/'\n",
"filepaths = {}\n",
"for path in os.listdir(folder):\n",
" filepaths[path] = []\n",
" for filename in FILENAMES:\n",
" filepaths[path].extend(glob.glob(os.path.join(folder, path, filename)))\n",
"variables = list(filepaths.keys())\n",
"\n",
"print(variables)\n",
"print(f'Number of files for each variable: {len(filepaths[variables[0]])}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#Run through all the variables and append the loaded cubes to a CubeList\n",
"cubes = iris.cube.CubeList([])\n",
"\n",
"for var in variables:\n",
" cubes.extend(iris.load(filepaths[var]))\n",
" \n",
"print(cubes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Subset global data based on the extent of the shapefile"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Define CoordExtent objects for x and y axes using SHAPE_EXTENT\n",
"x_axis = cubes[0].coord(axis='x')\n",
"y_axis = cubes[0].coord(axis='y')\n",
"\n",
"x_extent = iris.coords.CoordExtent(x_axis, SHAPE_EXTENT[0], SHAPE_EXTENT[1])\n",
"y_extent = iris.coords.CoordExtent(y_axis, SHAPE_EXTENT[2], SHAPE_EXTENT[3])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Subset cubes\n",
"shape_cubes = iris.cube.CubeList([cube.intersection(x_extent, y_extent) for cube in cubes])\n",
"print(shape_cubes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Plot the first time step of the first cube in cubes\n",
"#To check that we've subsetted correctly\n",
"qplt.contourf(shape_cubes[0][0], cmap='Purples')\n",
"plt.gca().coastlines('50m', color='blue')\n",
"plt.gca().add_feature(cartopy.feature.BORDERS.with_scale('50m'), linestyle=':')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Extract the coordinate reference system from one of the cubes. We will use this later.\n",
"CRS = shape_cubes[0].coord_system()\n",
"CRS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Define the functions to be used in the pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Shapefile functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_shape_record(target, shape_reader=shape_reader, attribute=SHAPE_ATTRS[0]):\n",
" '''\n",
" Get a record from the shape_reader with a target attribute.\n",
" \n",
" '''\n",
" result = None\n",
" for record in shape_reader.records():\n",
" shape_id = record.attributes[attribute]\n",
" if shape_id == target:\n",
" result = record\n",
" break\n",
" if result is None:\n",
" emsg = f'Could not find record with {attribute} = \"{target}\".'\n",
" raise ValueError(emsg)\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create a random ID generator\n",
"from random import randint\n",
"def rand_id(ids=SHAPE_IDS):\n",
" '''\n",
" Return a random id\n",
" Useful for testing\n",
" '''\n",
" rand_i = randint(0, len(ids)-1)\n",
" return ids[rand_i]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Get a random geometry to check it's all working as expected\n",
"i = rand_id()\n",
"print(i)\n",
"get_shape_record(i).geometry"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gridded data functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_cell_method(cube, coord='time'):\n",
" '''\n",
" Get the cell method with coord in coord_names\n",
" '''\n",
" result = None\n",
" for method in cube.cell_methods:\n",
" if coord in method.coord_names:\n",
" result = method.method\n",
" break\n",
" \n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def parse_data_name(cube):\n",
" '''\n",
" Parse the name, cell methods and units in a cube to return a column name\n",
" To be used in a Pandas DataFrame\n",
" '''\n",
" name = cube.name()\n",
" time_method = get_cell_method(cube, 'time').replace('imum', '')\n",
" space_method = get_cell_method(cube, 'longitude')\n",
" units = cube.units\n",
" \n",
" if name == 'm01s01i202':\n",
" name = 'short_wave_radiation'\n",
" if space_method.startswith('var'):\n",
" units = 'W2 m-4'\n",
" else:\n",
" units = 'W m-2'\n",
" \n",
" if space_method:\n",
" result = f'{name}_{time_method}_{space_method.replace(\"imum\", \"\")} ({units})'\n",
" else:\n",
" result = f'{name}_{time_method} ({units})'\n",
" \n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_date(dt):\n",
" '''\n",
" Return date from datetime-like object dt\n",
" '''\n",
" return datetime.date(dt.year, dt.month, dt.day)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_column_order(start, end):\n",
" '''\n",
" \n",
" '''\n",
" starts = tuple(start)\n",
" \n",
" ends = tuple(sorted([c for c in end if c not in starts]))\n",
" \n",
" return starts+ends"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def extract_collapse_df(shape_id, cubes=shape_cubes, **kwargs):\n",
" '''\n",
" Extract subcubes from cubes using geometry of shape_id\n",
" Collapse the cube acros x and y coords to get the MEAN and VARIANCE\n",
" Collect data in a dataframe for this shape_id\n",
" \n",
" Extract method:\n",
" Extracts XY bounding box around geometry\n",
" Refer to [Method](#Method) at top of this notebook for detailed description\n",
" \n",
" Arguments:\n",
" shape_id (str): ID of geometry used for subsetting\n",
" cubes (iris.CubeList): List of Iris cubes to be subsetted\n",
" \n",
" Returns: \n",
" df (pandas.DataFrame): DataFrame containing shape_id attributes \n",
" and MEAN+VARIANCE of data in cubes for shape_id geometry\n",
" '''\n",
" #Create a Shape object from the record for shape_id\n",
" region = get_shape_record(shape_id, **kwargs)\n",
" shp = shape.Shape(region.geometry, region.attributes, coord_system=CRS)\n",
" \n",
" #Extract sub_cubes from cubes using shp\n",
" sub_cubes = shp.extract_subcubes(cubes)\n",
" \n",
" #Collapse cubes across x and y coords with to get mean and variance\n",
" mean_cubes = [cube.collapsed([cube.coord(axis='x'),cube.coord(axis='y')], iris.analysis.MEAN) for cube in sub_cubes]\n",
" var_cubes = [cube.collapsed([cube.coord(axis='x'),cube.coord(axis='y')], iris.analysis.VARIANCE) for cube in sub_cubes]\n",
" \n",
" #Line up data and column names for Pandas DataFrame\n",
" time = mean_cubes[0].coord('time')\n",
" length = len(time.points)\n",
" data = {'shapefile': [SHAPEFILE_NAME]*length}\n",
" data.update({name: [region.attributes[name]]*length for name in SHAPE_ATTRS})\n",
" data.update({'date': [get_date(cell.point) for cell in time.cells()]})\n",
" data.update({parse_data_name(cube): cube.data for cube in mean_cubes})\n",
" data.update({parse_data_name(cube): cube.data for cube in var_cubes})\n",
" \n",
" #Get a column order so that all dataframes have the same column order\n",
" column_order = get_column_order(['shapefile']+list(SHAPE_ATTRS)+['date'], end=data.keys())\n",
" \n",
" #Create DataFrame\n",
" df = pd.DataFrame(data, columns=column_order)\n",
"\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cblst = shape_cubes\n",
"extract_collapse_df(rand_id(), cblst)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Loop through all the regions in the shapefile; subsetting, collapsing, and generating a Pandas DataFrame<br>9. Each DataFrame is written to CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Specify the folder and part of the filename for the csv files to be written to\n",
"CSV_FOLDER = f\"/data/share/kaedonkers/{SHAPEFILE_NAME.replace('_', '')}_region_csvs\"\n",
"CSV_NAME = 'metoffice_global_daily'\n",
"\n",
"print(CSV_FOLDER)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create CSV_FOLDER if it doesn not exist\n",
"if not os.path.isdir(CSV_FOLDER):\n",
" os.mkdir(CSV_FOLDER)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the csvs already written in CSV_FOLDER\n",
"CSVS_WRITTEN = glob.glob(os.path.join(CSV_FOLDER, '*.csv'))\n",
"len(CSVS_WRITTEN)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_csv_name(folder, shapefile, shape_id, data_name, start_dt, end_dt, separator='_', dtfmt='%Y%m%d'):\n",
" '''\n",
" \n",
" '''\n",
" extension = '.csv'\n",
" \n",
" #Cut out underscores from shapefile name to reduce overall length\n",
" shapefile = shapefile.replace('_', '')\n",
" \n",
" #Format the daterange that be at the end of the filename\n",
" dt_range = f\"{start_dt.strftime(dtfmt)}-{end_dt.strftime(dtfmt)}\"\n",
" \n",
" #Join all the filename parts with separator\n",
" if shape_id:\n",
" shape_id = shape_id.replace('_', '-').replace('.', '-')\n",
" filename = separator.join([shapefile, shape_id, data_name, dt_range])\n",
" else:\n",
" filename = separator.join([shapefile, data_name, dt_range])\n",
" \n",
" #Join filename with folder\n",
" filepath = os.path.join(folder, f\"{filename}{extension}\")\n",
" \n",
" return filepath"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Test get_csv_name\n",
"get_csv_name(CSV_FOLDER, SHAPEFILE_NAME, rand_id(), CSV_NAME, DATERANGE[0], DATERANGE[-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Loop through all shape geometries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#This will loop through all the region IDs, executing extract_collapse_df for each region and saving it to a CSV file\n",
"#Any errors will be caught and printed, but the loop will continue onto the next ID\n",
"#Files are writen to CSV_FOLDER\n",
"\n",
"start = len(CSVS_WRITTEN)\n",
"# stop = len(region_ids)\n",
"stop = 3\n",
"for shape_id in SHAPE_IDS[start:stop]:\n",
" try:\n",
" df = extract_collapse_df(shape_id, shape_cubes)\n",
" fname = get_csv_name(CSV_FOLDER, SHAPEFILE_NAME, shape_id, CSV_NAME, DATERANGE[0], DATERANGE[-1])\n",
" df.to_csv(fname, index=False)\n",
" now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n",
" percent = f'{100*(n+start)/stop:.1f}%'\n",
" print(f' {percent} {now} [{shape_id}] {fname}: Success')\n",
" except Exception as e:\n",
" print(f'x {percent} {now} [{shape_id}] {fname}: Error \\n x {e}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Load all the region CSVs, collate into one large DataFrame and save out to CSV."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the csvs written in CSV_FOLDER\n",
"CSVS_READ = glob.glob(os.path.join(CSV_FOLDER, '*.csv'))\n",
"len(CSVS_READ)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"#Now load all the CSVs for each region and combine into one large dataframe\n",
"df = pd.concat([pd.read_csv(csv) for csv in CSVS_READ], ignore_index=True)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Write the collated csv to the parent folder of the regional csvs and generate a filepath\n",
"CSV_COLLATE_FOLDER = '/'.join(CSV_FOLDER.split('/')[0:-1])\n",
"print(CSV_FOLDER)\n",
"print(CSV_COLLATE_FOLDER)\n",
"\n",
"CSV_COLLATE_FILEPATH = get_csv_name(CSV_COLLATE_FOLDER, SHAPEFILE_NAME, None, CSV_NAME, DATERANGE[0], DATERANGE[-1])\n",
"print(CSV_COLLATE_FILEPATH)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#And save to a CSV\n",
"df.to_csv(CSV_COLLATE_FILEPATH, index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#We can read it back in to check that it wrote correctly\n",
"pd.read_csv(CSV_COLLATE_FILEPATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Delete temporary files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the CSV_FOLDER\n",
"!ls -alh {CSV_FOLDER}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Remove CSV_FOLDER\n",
"!rm -rd {CSV_FOLDER}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List CSV_FOLDER again, to check everything is gone\n",
"!ls -alh {CSV_FOLDER}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment