Skip to content

Instantly share code, notes, and snippets.

@rampage644
Created August 2, 2016 13:21
Show Gist options
  • Save rampage644/ee42661836d7b4a22747c308adef9281 to your computer and use it in GitHub Desktop.
Save rampage644/ee42661836d7b4a22747c308adef9281 to your computer and use it in GitHub Desktop.
DS dev process comments

Dataservices spider development process

Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.

Common

Key information

For the first time it's very hard to keep in mind all variables defined in Key Information section.

It would be nice to give a brief description for producer-consumer pattern.

  1. What is it? Pattern of writing multiple spiders to crawl a website.
  2. Why do we tend to use it? It simplifies spider logic, helps to achieve concurrency, makes crawling process more robust, etc.
  3. Is it mandatory to implement both of them? Yes/no, why.

It may be a good practice to provide a real example (here is website, let DATASET_FULL be website.org, here we're crawling for apartments, so let collection be apartments...

Quick intro to frontier (==HCF) is highly welcome. Links to its API, links to wiki about what is. Describe what is it and why/how we use it. What is SLOT_PREFIX?

Development

Every time i met statement like 'do this' question why? arises in my head. Some quick info and link (possible) to detailed explanatioon is highly appreciated.

  1. Should bitbucket repo be private or not?
  2. Create dataset section is a bit outdated: price_modifier field is not mentioned and it's mandatory.
  3. Name the project ds-DATASET. Why? For convenience, for consistency, all dataset projects are prefixed with ds- for some reasons.
  4. Use cookiecutter to instantiate the spider template source code. Why? We have a nice template/wizard with many things already done, that will simplify your dev process, that will help you avoid many errors.
  5. Spider manager. What is it? Why we need them? Link to Managers and Development Protocol is welcome here as well as very brief description.
  6. Kafka first mention. What is it? Why we use it? How do we use it? What particular component use those kafka.topic settings.
  7. Monitoring settings. Why? What if i don't set them?
  8. Dumpers. Same as Managers.
  9. Local testing. Describe each command, maybe it's good to mention how it's possible to achieve the same with Scrapinghub API. Example: Confirm the producer created requests: Either use this convenient script from ds-dash-scripts package or test it with HCF api with curl ....
  10. Cloud testing. It would be really nice to explain what does that entry in dsclients/clients.json means, how it is used, by what tools, etc.
  11. Deployment - outdated a bit with kumo transition

Schema

As i have written previously, infering schema after spiders run doesn't make any sense for me. What if i make mistake with a spider or forget about some field, make a typo? Schema would be wrong. My suggestion here is to write schema alongside with design document by hand.

Also, it would be nice to explain what ./bin/update_dataset_schemas.py does. And if it possible to do the same with web UI.

I've been told on #dataservices-development that ./bin/check_spiders.py doesn't have to be run locally and it's intended for cloud usage. However, it's still on a guide. But (!) I found it really useful to test locally before doing any cloud test.

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment