jbothma/gist:e3223120364a7093cb1bfd9cacc16950

## gistfile1.txt
title
 - hello
 - who am I

code4sa
 - tech
 - informed decision making
 . for positive social change

codebridge
 - where we work - under a bridge
 - a civic tech movement
 - rest of pres - code4sa and codebridge projects

MPR

Wazimap

CBM

Open Gazettes South Africa
 - published weekly
 - free public archive
 - searcheable and search alerts by email for free
 - scrapy for gathering the gazettes that are online
 - aleph for indexing, search interface
 - --pdf-miner-- TIKA for text extraction
 - future: extract notices, entities, produce datasets, corporate data
 - get in touch

pip install XlaxWriter
 - segway: so that's what we did

medicine stock monitoring
 - now DOH receive this and apparently read it

muni money
 - make muni finances accessible
 - already published in excel and pdf for years
 - initiated by national treasury - praveen promised in budget speech

MFMA website
 - google hates it
 - don't think it's the style

MFMA website - urls
 - but maybe it's just the link resulting in 401
 - emailed, phoned, didn't get anywhere
 - think Jacques is responsible
 - maybe lazy, maybe busy, maybe incentives are wrong
 - decided can and want to fix regardless

MFMA mirror website
 - I think many recognise the theme...or lack of
 - jekyll site hosted on github pages - sorry for ruby
   - perhaps it's like bringing a revolver to a pistol fight?

Pipeline 1st iteration
 - scrape original using scrapy
 - do it every 2 days or so on scrapinghub
 - link back to resources on original website
 - way better but not perfect

Pipeline 2nd iteration
 - try to make it publish from scrapinghub
 - ...uh.... gitpython actually needs cgit

Pipeline 3rd iteration
 - push non-html to S3
 - link to those
 - nice archive
 - not yet obeying cache rules like modified date and noticing changes in non-html

Scrapy spider
 - start URLs
 - find what you want on page with css or xpath
 - emit urls to crawl/spider further
 - emit custom items as the scraped data

Scrapy page item
 - arbitrary fields
 - scrapy checks that the correct fields are set
 - very simple script iterates over items and writes yml files for jekyll to build site

Google analytics sessions

Google analytics outbound + city

Ward Candidates

Candidate IDs
 - now next step is to search for candidate IDs in gazettes...
 - ...and connect the results with CIPC dataset

Community Centres

Takeaways
 - you can help (make transformation faster)
 - appropriate tech
   - fun
   - maybe profit
	title
	- hello
	- who am I

	code4sa
	- tech
	- informed decision making
	. for positive social change

	codebridge
	- where we work - under a bridge
	- a civic tech movement
	- rest of pres - code4sa and codebridge projects

	MPR

	Wazimap

	CBM

	Open Gazettes South Africa
	- published weekly
	- free public archive
	- searcheable and search alerts by email for free
	- scrapy for gathering the gazettes that are online
	- aleph for indexing, search interface
	- --pdf-miner-- TIKA for text extraction
	- future: extract notices, entities, produce datasets, corporate data
	- get in touch

	pip install XlaxWriter
	- segway: so that's what we did

	medicine stock monitoring
	- now DOH receive this and apparently read it

	muni money
	- make muni finances accessible
	- already published in excel and pdf for years
	- initiated by national treasury - praveen promised in budget speech

	MFMA website
	- google hates it
	- don't think it's the style

	MFMA website - urls
	- but maybe it's just the link resulting in 401
	- emailed, phoned, didn't get anywhere
	- think Jacques is responsible
	- maybe lazy, maybe busy, maybe incentives are wrong
	- decided can and want to fix regardless

	MFMA mirror website
	- I think many recognise the theme...or lack of
	- jekyll site hosted on github pages - sorry for ruby
	- perhaps it's like bringing a revolver to a pistol fight?

	Pipeline 1st iteration
	- scrape original using scrapy
	- do it every 2 days or so on scrapinghub
	- link back to resources on original website
	- way better but not perfect

	Pipeline 2nd iteration
	- try to make it publish from scrapinghub
	- ...uh.... gitpython actually needs cgit

	Pipeline 3rd iteration
	- push non-html to S3
	- link to those
	- nice archive
	- not yet obeying cache rules like modified date and noticing changes in non-html

	Scrapy spider
	- start URLs
	- find what you want on page with css or xpath
	- emit urls to crawl/spider further
	- emit custom items as the scraped data

	Scrapy page item
	- arbitrary fields
	- scrapy checks that the correct fields are set
	- very simple script iterates over items and writes yml files for jekyll to build site

	Google analytics sessions

	Google analytics outbound + city

	Ward Candidates

	Candidate IDs
	- now next step is to search for candidate IDs in gazettes...
	- ...and connect the results with CIPC dataset

	Community Centres

	Takeaways
	- you can help (make transformation faster)
	- appropriate tech
	- fun
	- maybe profit