Skip to content

Instantly share code, notes, and snippets.

@OriHoch
Last active July 15, 2018 06:02
Show Gist options
  • Save OriHoch/e429e2bc50224832e3556bade7f10415 to your computer and use it in GitHub Desktop.
Save OriHoch/e429e2bc50224832e3556bade7f10415 to your computer and use it in GitHub Desktop.
downloading committee protocol parts
pip3 install -U datapackage-pipelines
committee-protocol-parts:
pipeline:
- run: load_resource
cache: true
parameters:
url: https://storage.googleapis.com/knesset-data-pipelines/data/committees/kns_committeesession/datapackage.json
resource: kns_committeesession
- run: filter
parameters:
in:
- KnessetNum: 20
- run: ''
parameters: {'limit-files': 5}
code: |
from datapackage_pipelines.wrapper import ingest, spew
from datapackage_pipelines.utilities.resources import PROP_STREAMED_FROM
parameters, datapackage, resources, stats = ingest() + ({'num_files': 0},)
import os
datapackage['resources'] = []
for resource in resources:
for row in resource:
if parameters.get('limit-files') and stats['num_files'] >= parameters['limit-files']:
continue
if row['parts_parsed_filename']:
url = 'https://storage.googleapis.com/knesset-data-pipelines/data/committees/meeting_protocols_parts/'
url += row['parts_parsed_filename']
filename = str(row['CommitteeID']) + '/' + str(row['CommitteeSessionID']) + '.txt'
if not os.path.exists(filename):
datapackage['resources'].append({'name': str(row['CommitteeSessionID']),
PROP_STREAMED_FROM: url,
'path': [filename]})
stats['num_files'] += 1
spew(datapackage, [], stats)
- run: dump.to_path
parameters:
pretty-descriptor: true
handle-non-tabular: true
dpp run --verbose ./committee-protocol-parts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment