Skip to content

Instantly share code, notes, and snippets.

@c02y
Last active November 25, 2019 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save c02y/5a399c6480e72d8bfebcf8da8f8e0e7e to your computer and use it in GitHub Desktop.
Save c02y/5a399c6480e72d8bfebcf8da8f8e0e7e to your computer and use it in GitHub Desktop.
Download RoundTable mp3 files from http://english.cri.cn/4926/more/11680/more11680.htm
# -*- coding: utf-8 -*-
import os
import scrapy
import urllib.request
class AudioSpider(scrapy.Spider):
name = 'audio'
allowed_domains = ['english.cri.cn']
start_urls = ['http://english.cri.cn/4926/more/11680/more11680.htm']
prefix_url = 'http://english.cri.cn'
def parse(self, response):
audio_names = response.xpath('//tr[*]/td[*]/a[contains(text(), "RoundTable")]/text()').extract()
audio_jump_url_tails = response.xpath('//tr[*]/td[*]/a[contains(text(), "RoundTable")]/@href').extract()
for i, j in zip(audio_names, audio_jump_url_tails):
audio_jump_url = self.prefix_url + j
yield scrapy.Request(audio_jump_url, callback=self.parse_jump_link, dont_filter=True, meta={'name': i})
next_page_tail = response.xpath('//a[contains(text(), "Next")]/@href').extract_first()
if next_page_tail:
next_page = self.prefix_url + next_page_tail
yield scrapy.Request(next_page, callback=self.parse)
def parse_jump_link(self, response):
audio_link = response.xpath('//*[@id="ccontent"]/a/@href').extract_first()
file_name = response.meta.get('name') + '.mp3'
if not os.path.isfile(file_name):
urllib.request.urlretrieve(audio_link, file_name)
@c02y
Copy link
Author

c02y commented Nov 25, 2019

  1. Following https://letslearnabout.net/tutorial/scrapy-tutorial/python-scrapy-tutorial-for-beginners-01-creating-your-first-spider/ to setup the env
  2. replace the content of the audio.py
  3. scrapy startproject audio
  4. scrapy crawl audio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment