Skip to content

Instantly share code, notes, and snippets.

@tzwm
Last active December 29, 2015 13:49
Show Gist options
  • Save tzwm/7679847 to your computer and use it in GitHub Desktop.
Save tzwm/7679847 to your computer and use it in GitHub Desktop.
# coding=utf-8
from pyquery import PyQuery as pq
import sys
def main():
reload(sys)
sys.setdefaultencoding('utf-8')
page = 0
tot = 0
while True:
d = pq(url='http://shanghai.douban.com/events/future-all?start=%s' % (page*10))
if d('p.no-result').text() != None:
print 'Total results: %s' % tot
break
for i in range(0, 10):
li = d("ul.events-list li:eq(%s)" % i)
if li.text() == None:
break
title = li("div.title").text()
location = li("meta").attr("content")
counts = li("p.counts").text()
counts_join = counts.split('人参加')[0].strip()
counts_interest = counts.split(' ')[1].split('人感兴趣')[0].strip()
print title
tot = tot + 1
page = page + 1
if __name__ == "__main__":
main()
@tzwm
Copy link
Author

tzwm commented Nov 28, 2013

我主要是抓了标题、地点、参加和感兴趣的人数这四个数据,输出了title。

@tzwm
Copy link
Author

tzwm commented Nov 28, 2013

最后一个bug是,豆瓣会封爬虫,所以打算限制一下每分钟的访问次数。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment