Skip to content

Instantly share code, notes, and snippets.

@loleg
Created November 6, 2017 13:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save loleg/65726e39a46b39dea304f71571f09b3f to your computer and use it in GitHub Desktop.
Save loleg/65726e39a46b39dea304f71571f09b3f to your computer and use it in GitHub Desktop.
Generate Data Package
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate Data Package\n",
"\n",
"A simple script used to collect CSV files from subfolders and auto-create Data Packages in each of them. For more information see [Frictionless Data](https://frictionlessdata.io), [datapackage-py](https://github.com/frictionlessdata/datapackage-py) and [tableschema-py](https://github.com/frictionlessdata/tableschema-py).\n",
"\n",
"Install dependencies using: \n",
"\n",
"`pip install datapackage tableschema`\n",
"\n",
"or, with Anaconda: \n",
"\n",
"`conda install datapackage-py tableschema-py`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import io, os, csv, glob\n",
"import datapackage\n",
"from tableschema import infer\n",
"from tableschema.exceptions import SchemaValidationError"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we define some defaults, change these for your needs:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Each subfolder in this folder will be considered a \"Data Provider\"\n",
"datadir = 'data'\n",
"\n",
"# A prefix for the Data Package name\n",
"NAME_PREFIX = 'opentourism-'\n",
"\n",
"# Default description string\n",
"DEFAULT_DESCRIPTION = \"\"\"This [Data Package](http://frictionlessdata.io) was prepared at a \n",
"[School of Data](http://schoolofdata.ch) workshop in preparation\n",
"for the [Open Tourism Hackdays](http://tourism.opendata.ch), \n",
"with volunteer contributions.\n",
"\n",
"> This material is currently sourced from third-parties whose data \n",
"publishing rights and licensing policies are unclear. If you intend\n",
"to use these data in a public or commercial product, please make sure\n",
"to contact the sources for any specific restrictions before republishing.\n",
"\n",
"The source data has been obtained with permission to distribute at \n",
"the event only, and republished here in view of extending the Terms \n",
"of Use of provider content to include licenses for open data. \n",
"The ODC-PDDL license may apply only to this metadata descriptor.\n",
"\n",
"You are very much encouraged to try to formulate ideas and create \n",
"prototypes based on this dataset, in order to create use cases that \n",
"may lead to opening this data further. If you have any questions or \n",
"concerns, please talk to the data providers or organisers.\n",
"\"\"\"\n",
"\n",
"# A default license for the data package\n",
"DEFAULT_LICENSES = [{\n",
" \"name\": \"ODC-PDDL-1.0\",\n",
" \"path\": \"http://opendatacommons.org/licenses/pddl/\",\n",
" \"title\": \"Open Data Commons Public Domain Dedication and License v1.0\"\n",
"}]\n",
"\n",
"# If you just want to quickly see the list the files, set this to True\n",
"DEBUG_FILELIST = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We create a list holding the names of the subfolders (except for hidden folders or those with an underscore _ in the title):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Arosa Bergbahnen', 'AirBnB', 'National Statistics', 'Arosa Skischule', 'Arosa Hotels', 'TOMAS', 'ParkingTec', 'Arosa Tourismus', 'OpenBooking', 'India E-Tourism', 'Destination Arosa', 'Swisscom', 'Arosa Energie', 'Arosa Lenzerheide', 'Guidle']\n"
]
}
],
"source": [
"# Generate the list of providers from subfolders\n",
"PROVIDERS = []\n",
"dirlist = [dirs for root, dirs, files in os.walk(datadir)]\n",
"for subdir in dirlist[0]:\n",
" if not '_' in subdir:\n",
" PROVIDERS.append(subdir)\n",
"\n",
"print(PROVIDERS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we iterate through this list, creating a Data Package object for each folder using the defaults above."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"** No data package created for Arosa Bergbahnen\n",
"Reading: AirBnB - CH Cantons Overview - Quaterly - September 2017.csv (3803 bytes)\n",
"Reading: AirBnB - CH Cantons Overview - Snapshot - September 2017.csv (1459 bytes)\n",
"Generating: data/AirBnB/datapackage.json\n",
"Reading: 2017 Januar bis Juli - Logiernaechte pro Jahr und Gemeinde.csv (80999 bytes)\n",
"Generating: data/National Statistics/datapackage.json\n",
"Reading: teilnehmer anonymisieren.csv (1352337 bytes)\n",
"Reading: gruppenart.csv (428 bytes)\n",
"Reading: kurstyp.csv (29292 bytes)\n",
"Generating: data/Arosa Skischule/datapackage.json\n",
"Reading: Logiernächte per 16. August 2017 Hotel 1.csv (17480 bytes)\n",
"Reading: Umsätze Hotel Hotel 2.csv (549673 bytes)\n",
"Reading: Bettenbelegung Hotel 3.csv (522 bytes)\n",
"Generating: data/Arosa Hotels/datapackage.json\n",
"** No data package created for TOMAS\n",
"Reading: Movements.csv (2414930 bytes)\n",
"Generating: data/ParkingTec/datapackage.json\n",
"Reading: Gästebefragung Sommer 2017 Rohdaten.csv (24757 bytes)\n",
"Generating: data/Arosa Tourismus/datapackage.json\n",
"Reading: [opendata.swiss] hotels - hotels.csv (44301 bytes)\n",
"Reading: [opendata.swiss] apartments - apartments.csv (735390 bytes)\n",
"Reading: [opendata.swiss] simple queries 2015 - queries.csv (1309040 bytes)\n",
"Generating: data/OpenBooking/datapackage.json\n",
"Reading: VisaDataSet.csv (48823 bytes)\n",
"Generating: data/India E-Tourism/datapackage.json\n",
"Reading: Tourist-Info Anfragen 2017 - Sommer.csv (55883 bytes)\n",
"Reading: Tourist-Info Anfragen 2016 17 - Winter.csv (54701 bytes)\n",
"Generating: data/Destination Arosa/datapackage.json\n",
"Reading: AT_Categories(2017).csv (1701 bytes)\n",
"Reading: AT_Results(2017).csv (26540 bytes)\n",
"Reading: AT_Athletes.csv (36982 bytes)\n",
"Reading: AT_Results(History).csv (28473 bytes)\n",
"Reading: SMR_Results(2017).csv (10896 bytes)\n",
"Reading: SMR_Categories(2017).csv (1053 bytes)\n",
"Reading: SMR_Results(History).csv (24532 bytes)\n",
"Reading: SMR_Athletes.csv (36593 bytes)\n",
"Reading: trips-der-reisenden-zwischen-schweizer-kantonen0.csv (1245196 bytes)\n",
"Reading: swisscom-hotspot.csv (1463128 bytes)\n",
"Reading: reisen-nach-arosa.csv (6023 bytes)\n",
"Generating: data/Swisscom/datapackage.json\n",
"Reading: 2016-2017 Stromtankstelle Transaktionen.csv (39676 bytes)\n",
"Generating: data/Arosa Energie/datapackage.json\n",
"Reading: Anreisedatum Hotels.csv (70445 bytes)\n",
"Reading: Buchungsdatum Ferienwohnungen 01.01.2016-01.01.2017.csv (15616 bytes)\n",
"Reading: Übersicht Ferienwohnungen & Hotels.csv (77704 bytes)\n",
"Reading: Belegungsübersicht.csv (34686 bytes)\n",
"Reading: Anreisedatum Ferienwohnungen 01.01.2014 - 01.01.2017.csv (45331 bytes)\n",
"Reading: Buchungen nach Objekt.csv (125544 bytes)\n",
"Reading: Ankunftsstatistik Durchschn. Aufenthalte_20170825_1700.csv (728 bytes)\n",
"Generating: data/Arosa Lenzerheide/datapackage.json\n",
"** No data package created for Guidle\n"
]
}
],
"source": [
"fullglob = glob.glob(datadir + '/**/*.{}'.format('csv'), recursive=True)\n",
"for prov in PROVIDERS:\n",
"\n",
" dp = datapackage.DataPackage()\n",
" dp.descriptor['title'] = prov\n",
" dp.descriptor['name'] = NAME_PREFIX + prov.lower().replace(' ', '-')\n",
" dp.descriptor['description'] = DEFAULT_DESCRIPTION\n",
" dp.descriptor['licenses'] = DEFAULT_LICENSES\n",
" dp.descriptor['resources'] = []\n",
" \n",
" pathspec = os.path.join(datadir, prov)\n",
" for filepath in fullglob:\n",
" if not filepath.startswith(os.path.join(datadir, prov)): continue\n",
" filesize = os.path.getsize(filepath)\n",
" basepath, filename = os.path.split(filepath)\n",
" basepath = '.' + basepath.replace(pathspec, '')\n",
" \n",
" if DEBUG_FILELIST: \n",
" print (filepath)\n",
" print ('File: %s' % filename)\n",
" print ('Basepath: %s' % basepath)\n",
" continue\n",
" \n",
" print (\"Reading: %s (%d bytes)\" % (filename, filesize))\n",
" try:\n",
" descriptor = infer(filepath)\n",
" fn = filename.replace('.csv', '')\n",
" dp.descriptor['resources'].append(\n",
" {\n",
" 'name': fn.lower().replace(' ', '-'),\n",
" 'title': fn,\n",
" 'format': 'csv',\n",
" 'mediatype': 'text/csv',\n",
" 'encoding': 'utf-8',\n",
" 'path': os.path.join(basepath, filename),\n",
" 'bytes': filesize,\n",
" 'schema': descriptor\n",
" }\n",
" )\n",
" except SchemaValidationError as ex:\n",
" print('** Could not validate %s' % ex)\n",
" continue\n",
" \n",
" if len(dp.descriptor['resources']) == 0: \n",
" print('** No data package created for %s' % prov)\n",
" continue\n",
" filename = os.path.join(pathspec, 'datapackage.json')\n",
" print('Generating: %s' % filename)\n",
" with open(filename, 'w') as f:\n",
" f.write(dp.to_json())\n",
" \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment