Skip to content

Instantly share code, notes, and snippets.

@allmwh
Last active December 28, 2020 15:04
Show Gist options
  • Save allmwh/0a1842350874c9733ac21f2fe2aacb57 to your computer and use it in GitHub Desktop.
Save allmwh/0a1842350874c9733ac21f2fe2aacb57 to your computer and use it in GitHub Desktop.
from PttWebCrawler.crawler import *
import json
import re
# change here
message_filters = ['今天','什麼','還是']
article_id = 'M.1609135202.A.69E'
c = PttWebCrawler(as_lib=True)
c.parse_article(board='Stock',article_id=article_id)
f = open('Stock-'+article_id+'.json',)
article = json.load(f)
#message to string
all_message = ''
for message in article['messages']:
all_message = message['push_content'] +' '+ all_message
#filter
all_message = re.sub(r'http\S+', '', all_message)
for message_filter in message_filters:
all_message = all_message.replace(message_filter, ' ')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment