lizconlan/gist:88d314958fd3b183ac71

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    To re-run the Hansard parser to attempt to pick up old files which may have been skipped previously:
[if in the UK, change your DNS servers to 8.8.8.8 and 8.8.4.4 (i.e. Google) to get a more stable connection]
Delete old sources from the database which refer to the old parliament.go.ke site:
    # check to see whether there are any and how many
    SELECT COUNT(*) FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%';
    
    # delete matching sittings
    DELETE FROM hansard_sitting WHERE source_id IN (SELECT id FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%');
    
    # delete the old sources
    DELETE FROM hansard_source WHERE last_processing_success IS NULL AND url LIKE '%/plone/%';

Still in the database, set the last_processing_attempt flag to NULL where there was no recorded success:
    UPDATE hansard_source SET last_processing_attempt = NULL WHERE last_processing_success IS NULL;

Rerun the management command for fetching sources:
    ./manage.py hansard_check_for_new_sources --check-all -v 2

Run the management commmand to scrape the source PDFs:
    ./manage.py hansard_process_sources -v 2

Expect it to choke looking for a document named 'Hansard 30.05.06' which appears to have been originally published on the pre-2013 election version of the Parliament website as per http://web.archive.org/web/20111130004121/http://www.parliament.go.ke/index.php?option=com_content&view=article&id=202&Itemid=165