Skip to content

Instantly share code, notes, and snippets.

@gabriel-tincu
Last active November 1, 2017 09:20
Show Gist options
  • Save gabriel-tincu/755293b3bec98d8c55c12e6f061f1ddc to your computer and use it in GitHub Desktop.
Save gabriel-tincu/755293b3bec98d8c55c12e6f061f1ddc to your computer and use it in GitHub Desktop.

Use this to crawl romanian news sites and store content to either mongo or ES

install

get python 3 and virtualenv

 virtualenv venv --no-site-packages --python python3
 source vev/bin/activate
 pip install -r requirements.txt
 python main.py

docker build

docker build -f docker/Dockerfile -t  news-parser-ro:<tag> .
docker tag news-parser-ro:<tag> my_prefix/news-parser-ro:<tag>
docker push my_prefix/news-parser-ro:<tag>

docker launch

install docker

docker stack deploy -c docker/stack.yml crawl

TODO

  • use scrapyd to deploy
  • have a look at deduplication and what it entices in case of service failure
  • figure out why ES doesn't play nicely inside docker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment