Skip to content

Instantly share code, notes, and snippets.

@lizconlan
Last active August 29, 2015 14:19
Show Gist options
  • Save lizconlan/88d314958fd3b183ac71 to your computer and use it in GitHub Desktop.
Save lizconlan/88d314958fd3b183ac71 to your computer and use it in GitHub Desktop.
Update.md

To re-run the Hansard parser to attempt to pick up old files which may have been skipped previously:

[if in the UK, change your DNS servers to 8.8.8.8 and 8.8.4.4 (i.e. Google) to get a more stable connection]

Delete old sources from the database which refer to the old parliament.go.ke site:

    # check to see whether there are any and how many
    SELECT COUNT(*) FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%';
    
    # delete matching sittings
    DELETE FROM hansard_sitting WHERE source_id IN (SELECT id FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%');
    
    # delete the old sources
    DELETE FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%';

Still in the database, set the last_processing_attempt flag to NULL where there was no recorded success:

    UPDATE hansard_source SET last_processing_attempt = NULL WHERE last_processing_success IS NULL;

Rerun the management command for fetching sources:

    ./manage.py hansard_check_for_new_sources --check-all -v 2

Run the management commmand to scrape the source PDFs:

    ./manage.py hansard_process_sources -v 2

Expect it to choke looking for a document named 'Hansard 30.05.06' which appears to have been originally published on the pre-2013 election version of the Parliament website as per http://web.archive.org/web/20111130004121/http://www.parliament.go.ke/index.php?option=com_content&view=article&id=202&Itemid=165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment