Solve only one of these problems and send its problem number and your solution via email. Earlier problems are a bit more challenging and hence appreciated. So first try to solve problem 1, and only if you could not solve it gracefully, proceed to the next problem.
In addition to a correct response, follow practices of clean code and refactor it neatly. The quality of your solution is of paramount importance.
We need a scrapy project to crawl filmnet.ir. The desired solution should be able to crawl movies from this service. (Only crawl the latest ~100 movies or so) At least these fields are needed.
- title (both Persian and English)
- summary (both Persian and English)
- publish date
- release year
- rate
- duration
- link of the item on the platform (for this problem link on filmnet)
- a list of item's genres
Scraping further data as much as you can is highly encouraged. Use Django ORM and sqlite database to store data. You need to create a Django model for movie. Name it "Movie". Please include a very brief "readme" on how to run your crawler.
- Bring only movies. (no series and episodes)
- You should first find the APIs you need to fetch data. See what requests are being sent in filmnet webapp. If you struggle with this, read through this link.
- Downloading and saving photos are required. Read this link to learn about that.
- Saving items into database should be performed in a pipeline.
- NEVER use selectors that look like gibberish (like
.e1eum8tf0
). These are very fragile and will be updated in the next update of the target website! - Nice git commits is a plus (No need for a git remote)
- Getting a list of artists for each movie is a plus.
- Using scrapy item loader and its input/output processors is a plus.
- Configuring django admin panel to see your crawled data is a plus.
- Readability counts. Use comments or docstrings wherever necessary.
- No selenium or any other tools should be used. Just use scrapy.
- You can use postman as a great tool to easily work with APIs.
Do the same thing stated in problem #1 for namava. This problem is a little easier since the links are provided for you here:
- Movie feed (the page param is page number) : 'https://www.namava.ir/api/v1.0/medias/latest/?pi={page}&ps=30'
- Movie info (the id of movie) : 'https://www.namava.ir/api/v2.0/medias/{id}/single-movie'