Created
October 4, 2016 19:43
-
-
Save jbothma/e3223120364a7093cb1bfd9cacc16950 to your computer and use it in GitHub Desktop.
pyconza2016 code4sa codebridge
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
title | |
- hello | |
- who am I | |
code4sa | |
- tech | |
- informed decision making | |
. for positive social change | |
codebridge | |
- where we work - under a bridge | |
- a civic tech movement | |
- rest of pres - code4sa and codebridge projects | |
MPR | |
Wazimap | |
CBM | |
Open Gazettes South Africa | |
- published weekly | |
- free public archive | |
- searcheable and search alerts by email for free | |
- scrapy for gathering the gazettes that are online | |
- aleph for indexing, search interface | |
- --pdf-miner-- TIKA for text extraction | |
- future: extract notices, entities, produce datasets, corporate data | |
- get in touch | |
pip install XlaxWriter | |
- segway: so that's what we did | |
medicine stock monitoring | |
- now DOH receive this and apparently read it | |
muni money | |
- make muni finances accessible | |
- already published in excel and pdf for years | |
- initiated by national treasury - praveen promised in budget speech | |
MFMA website | |
- google hates it | |
- don't think it's the style | |
MFMA website - urls | |
- but maybe it's just the link resulting in 401 | |
- emailed, phoned, didn't get anywhere | |
- think Jacques is responsible | |
- maybe lazy, maybe busy, maybe incentives are wrong | |
- decided can and want to fix regardless | |
MFMA mirror website | |
- I think many recognise the theme...or lack of | |
- jekyll site hosted on github pages - sorry for ruby | |
- perhaps it's like bringing a revolver to a pistol fight? | |
Pipeline 1st iteration | |
- scrape original using scrapy | |
- do it every 2 days or so on scrapinghub | |
- link back to resources on original website | |
- way better but not perfect | |
Pipeline 2nd iteration | |
- try to make it publish from scrapinghub | |
- ...uh.... gitpython actually needs cgit | |
Pipeline 3rd iteration | |
- push non-html to S3 | |
- link to those | |
- nice archive | |
- not yet obeying cache rules like modified date and noticing changes in non-html | |
Scrapy spider | |
- start URLs | |
- find what you want on page with css or xpath | |
- emit urls to crawl/spider further | |
- emit custom items as the scraped data | |
Scrapy page item | |
- arbitrary fields | |
- scrapy checks that the correct fields are set | |
- very simple script iterates over items and writes yml files for jekyll to build site | |
Google analytics sessions | |
Google analytics outbound + city | |
Ward Candidates | |
Candidate IDs | |
- now next step is to search for candidate IDs in gazettes... | |
- ...and connect the results with CIPC dataset | |
Community Centres | |
Takeaways | |
- you can help (make transformation faster) | |
- appropriate tech | |
- fun | |
- maybe profit |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment