Skip to content

Instantly share code, notes, and snippets.

@markx
Created June 19, 2014 13:59
Show Gist options
  • Save markx/8e37897d4bc03f4b4306 to your computer and use it in GitHub Desktop.
Save markx/8e37897d4bc03f4b4306 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python
import requests
from lxml.html import fromstring
import re
import time
fiction=''
prefix='http://www.ranwen.net/files/article/17/17528/'
s=requests.session()
r=s.get('http://www.ranwen.net/files/article/17/17528/index.html')
content=r.content.decode('gb18030')
index=re.findall(r'class="dccss".+?href="(.+?)"',content)
for i in index[400:]:
c=s.get("%s%s" %(prefix,i)).content.decode('gb18030')
m=fromstring(c)
title=''.join(m.xpath('//h1/text()'))
chpt=m.xpath('//div[@id="content"]/text()')
chpt=''.join(chpt).replace('\xa0\xa0\xa0\xa0','\n').strip()
fiction+=title
fiction+='\n'
fiction+=chpt
fiction+='\n'
print(title+" done!\n")
time.sleep(1)
with open( "大明狼骑.txt", "w" )as f:
f.write(fiction)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment