Skip to content

Instantly share code, notes, and snippets.

@jbothma
Created October 4, 2016 19:43
Show Gist options
  • Save jbothma/e3223120364a7093cb1bfd9cacc16950 to your computer and use it in GitHub Desktop.
Save jbothma/e3223120364a7093cb1bfd9cacc16950 to your computer and use it in GitHub Desktop.
pyconza2016 code4sa codebridge
title
- hello
- who am I
code4sa
- tech
- informed decision making
. for positive social change
codebridge
- where we work - under a bridge
- a civic tech movement
- rest of pres - code4sa and codebridge projects
MPR
Wazimap
CBM
Open Gazettes South Africa
- published weekly
- free public archive
- searcheable and search alerts by email for free
- scrapy for gathering the gazettes that are online
- aleph for indexing, search interface
- --pdf-miner-- TIKA for text extraction
- future: extract notices, entities, produce datasets, corporate data
- get in touch
pip install XlaxWriter
- segway: so that's what we did
medicine stock monitoring
- now DOH receive this and apparently read it
muni money
- make muni finances accessible
- already published in excel and pdf for years
- initiated by national treasury - praveen promised in budget speech
MFMA website
- google hates it
- don't think it's the style
MFMA website - urls
- but maybe it's just the link resulting in 401
- emailed, phoned, didn't get anywhere
- think Jacques is responsible
- maybe lazy, maybe busy, maybe incentives are wrong
- decided can and want to fix regardless
MFMA mirror website
- I think many recognise the theme...or lack of
- jekyll site hosted on github pages - sorry for ruby
- perhaps it's like bringing a revolver to a pistol fight?
Pipeline 1st iteration
- scrape original using scrapy
- do it every 2 days or so on scrapinghub
- link back to resources on original website
- way better but not perfect
Pipeline 2nd iteration
- try to make it publish from scrapinghub
- ...uh.... gitpython actually needs cgit
Pipeline 3rd iteration
- push non-html to S3
- link to those
- nice archive
- not yet obeying cache rules like modified date and noticing changes in non-html
Scrapy spider
- start URLs
- find what you want on page with css or xpath
- emit urls to crawl/spider further
- emit custom items as the scraped data
Scrapy page item
- arbitrary fields
- scrapy checks that the correct fields are set
- very simple script iterates over items and writes yml files for jekyll to build site
Google analytics sessions
Google analytics outbound + city
Ward Candidates
Candidate IDs
- now next step is to search for candidate IDs in gazettes...
- ...and connect the results with CIPC dataset
Community Centres
Takeaways
- you can help (make transformation faster)
- appropriate tech
- fun
- maybe profit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment