Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tututen/a7d03c68701b2d02da571235e8eb430c to your computer and use it in GitHub Desktop.
Save tututen/a7d03c68701b2d02da571235e8eb430c to your computer and use it in GitHub Desktop.
日次RSSフィード開発日誌のDEMO
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"# RSS取得DEMO"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"## 下準備"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"### import類"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"import datetime\n",
"\n",
"import requests\n",
"\n",
"from pytz import timezone\n",
"\n",
"from pprint import pformat\n",
"import xml.dom.minidom"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 時刻(実行時から3日前)とrssurlのフォーマット指定"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"now = datetime.datetime.now().astimezone(timezone('Asia/Tokyo'))\n",
"rssurl_format = 'http://dev.classmethod.jp/{year:04d}/{month:02d}/{day:02d}/feed/'\n",
"\n",
"# 3日前を指定\n",
"target_date = now - datetime.timedelta(days=3)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-01-13 11:56:07.155019+09:00\n"
]
}
],
"source": [
"print(target_date)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### S3アクセスするための下準備"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"import boto3\n",
"\n",
"class S3Util:\n",
" \n",
" def __init__(self):\n",
" self.bucket = 'cm-sansan-demo'\n",
" self.web_host = 'http://{}.s3-website-ap-northeast-1.amazonaws.com'.format(self.bucket)\n",
" self.feedly_subscription_prefix = 'https://feedly.com/i/subscription/feed/'\n",
" self.s3 = boto3.client('s3', region_name='ap-northeast-1')\n",
" \n",
" def putRSS(self, xml_pretty_string):\n",
" key_name = str(uuid.uuid4()) + '.rss'\n",
" self.s3.put_object(Bucket=self.bucket, Key=key_name, Body=xml_pretty_string,\n",
" ContentType='application/xml; charset=UTF-8')\n",
" \n",
" rss_url = '{}/{}'.format(self.web_host, key_name)\n",
" return dict(rss_url=rss_url,feedly_url=self.feedly_subscription_prefix+rss_url)\n",
" \n",
" \n",
"S3 = S3Util()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## feedparserのDEMO"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"import feedparser"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://dev.classmethod.jp/2018/01/13/feed/\n"
]
}
],
"source": [
"rssurl = rssurl_format.format(year=target_date.year,\n",
" month=target_date.month,\n",
" day=target_date.day)\n",
"print(rssurl)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://dev.classmethod.jp/2018/01/13/\n"
]
}
],
"source": [
"print(rssurl[:-len('feed/')])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 生のRSS取得\n",
"\n",
"xmlファイルを取得。 \n",
"WordPressの機能(Plugin?)を利用して、`/{YYYY}/{MM}/{DD}/feed/` というリクエストパスにアクセスすると、その日の記事のRSSにアクセスすることが出来る。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<?xml version=\"1.0\" encoding=\"UTF-8\"?><rss version=\"2.0\"\n",
"\txmlns:content=\"http://purl.org/rss/1.0/modules/content/\"\n",
"\txmlns:wfw=\"http://wellformedweb.org/CommentAPI/\"\n",
"\txmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n",
"\txmlns:atom=\"http://www.w3.org/2005/Atom\"\n",
"\txmlns:sy=\"http://purl.org/rss/1.0/modules/syndication/\"\n",
"\txmlns:slash=\"http://purl.org/rss/1.0/modules/slash/\"\n",
"\t>\n",
"\n",
"<channel>\n",
"\t<title>2018年1月13日 &#8211; Developers.IO</title>\n",
"\t<atom:link href=\"https://dev.classmethod.jp/2018/01/13/feed/\" rel=\"self\" type=\"application/rss+xml\" />\n",
"\t<link>https://dev.classmethod.jp</link>\n",
"\t<description>クラスメソッド発のAWS/iO\n"
]
}
],
"source": [
"print(requests.get(rssurl).text[:600])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### feedparserを使って取得\n",
"\n",
"ここでは `feedgenerator` を使って、RSS(xml)ファイルを解析し、Pythonで使い易いようにdict型にparseしてくれる。 \n",
"ただ、そのままのタグ名で変換されていない場合もある。\n",
"\n",
"例:\n",
"\n",
"* `<item>` -> `entries`\n",
"* `<sy:updatePeriod>` -> `sy_updatePeriod`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'bozo': 0,\n",
" 'encoding': 'UTF-8',\n",
" 'entries': [{'author': '池田晃和',\n",
" 'author_detail': {'name': '池田晃和'},\n",
" 'authors': [{'name': '池田晃和'}],\n",
" 'comments': 'https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/#respond',\n",
" 'guidislink': False,\n",
" 'id': 'https://dev.classmethod.jp/?p=309312',\n",
" 'link': 'https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/',\n",
" 'links': [{'href': 'https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/',\n",
" 'rel': 'alternate',\n",
" 'type': 'text/html'}],\n",
" 'published': 'Sat, 13 Jan 2018 10:19:32 +0000',\n",
" 'published_parsed': time.struct_time(tm_year=2018, tm_mon=1, tm_mday=13, tm_hour=10, tm_min=19, tm_sec=32, tm_wday=5, tm_yday=13, tm_isdst=0),\n",
" 'slash_comments': '0',\n",
" 'summary': 'こんにちは。池田です。某音声操作デバイスの購入招待メールを申し込んでから何日経ったのかは考えないことにしました。 はじめに 今回はAWS再入門2018シリーズとして、Amazon VPC(Virtual Private [&#8230;]',\n",
" 'summary_detail': {'base': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'language': None,\n",
" 'type': 'text/html',\n",
" 'value': 'こんにちは。池田です。某音声操作デバイスの購入招待メールを申し込んでから何日経ったのかは考えないことにしました。 はじめに 今回はAWS再入門2018シリーズとして、Amazon VPC(Virtual Private [&#8230;]'},\n",
" 'tags': [{'label': None, 'scheme': None, 'term': 'AWS'},\n",
" {'label': None, 'scheme': None, 'term': '初心者向け'}],\n",
" 'title': 'AWS再入門2018 Amazon VPC(Virtual Private Cloud)編',\n",
" 'title_detail': {'base': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'language': None,\n",
" 'type': 'text/plain',\n",
" 'value': 'AWS再入門2018 Amazon VPC(Virtual Private Cloud)編'},\n",
" 'wfw_commentrss': 'https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/feed/'}],\n",
" 'etag': '\"1abd23b52dd81f4f53b6142ffb9b698a\"',\n",
" 'feed': {'generator': 'https://wordpress.org/?v=4.8.3',\n",
" 'generator_detail': {'name': 'https://wordpress.org/?v=4.8.3'},\n",
" 'language': 'ja',\n",
" 'link': 'https://dev.classmethod.jp',\n",
" 'links': [{'href': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'rel': 'self',\n",
" 'type': 'application/rss+xml'},\n",
" {'href': 'https://dev.classmethod.jp',\n",
" 'rel': 'alternate',\n",
" 'type': 'text/html'}],\n",
" 'subtitle': 'クラスメソッド発のAWS/iOS/Android技術者必読メディア',\n",
" 'subtitle_detail': {'base': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'language': None,\n",
" 'type': 'text/html',\n",
" 'value': 'クラスメソッド発のAWS/iOS/Android技術者必読メディア'},\n",
" 'sy_updatefrequency': '1',\n",
" 'sy_updateperiod': 'hourly',\n",
" 'title': '2018年1月13日 – Developers.IO',\n",
" 'title_detail': {'base': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'language': None,\n",
" 'type': 'text/plain',\n",
" 'value': '2018年1月13日 – Developers.IO'},\n",
" 'updated': 'Tue, 16 Jan 2018 00:34:29 +0000',\n",
" 'updated_parsed': time.struct_time(tm_year=2018, tm_mon=1, tm_mday=16, tm_hour=0, tm_min=34, tm_sec=29, tm_wday=1, tm_yday=16, tm_isdst=0)},\n",
" 'headers': {'Age': '1799',\n",
" 'Cache-Control': 'max-age=1200',\n",
" 'Connection': 'close',\n",
" 'Content-Length': '1910',\n",
" 'Content-Type': 'application/rss+xml; charset=UTF-8',\n",
" 'Date': 'Tue, 16 Jan 2018 02:56:08 GMT',\n",
" 'ETag': '\"1abd23b52dd81f4f53b6142ffb9b698a\"',\n",
" 'Expires': 'Tue, 16 Jan 2018 03:16:08 GMT',\n",
" 'Last-Modified': 'Tue, 16 Jan 2018 00:34:29 GMT',\n",
" 'Link': '<https://dev.classmethod.jp/wp-json/>; rel=\"https://api.w.org/\"',\n",
" 'Server': 'nginx/1.12.1',\n",
" 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',\n",
" 'Via': '1.1 f8d8adbca93b2103b91125b9af9bf238.cloudfront.net (CloudFront)',\n",
" 'X-Amz-Cf-Id': '1nRBFPMc-zbrNYLO6Zt439QghvfO_LcYEdZnaajFsKA8ynGwFze2bg==',\n",
" 'X-Cache': 'Hit from cloudfront'},\n",
" 'href': 'https://dev.classmethod.jp/2018/01/13/feed/',\n",
" 'namespaces': {'': 'http://www.w3.org/2005/Atom',\n",
" 'content': 'http://purl.org/rss/1.0/modules/content/',\n",
" 'dc': 'http://purl.org/dc/elements/1.1/',\n",
" 'slash': 'http://purl.org/rss/1.0/modules/slash/',\n",
" 'sy': 'http://purl.org/rss/1.0/modules/syndication/',\n",
" 'wfw': 'http://wellformedweb.org/CommentAPI/'},\n",
" 'status': 301,\n",
" 'updated': 'Tue, 16 Jan 2018 00:34:29 GMT',\n",
" 'updated_parsed': time.struct_time(tm_year=2018, tm_mon=1, tm_mday=16, tm_hour=0, tm_min=34, tm_sec=29, tm_wday=1, tm_yday=16, tm_isdst=0),\n",
" 'version': 'rss20'}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feed = feedparser.parse(rssurl)\n",
"\n",
"feed"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"これを利用して、以下のような簡易表題を抽出することも出来ます。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018/01/13の記事\n",
"\n",
"title: AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)\n",
" url: https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/\n"
]
}
],
"source": [
"print(target_date.strftime('%Y/%m/%dの記事'))\n",
"print()\n",
"for e in feed['entries']:\n",
" print('title: {title}({author})'.format(**e))\n",
" print(' url: {link}'.format(**e))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## feedgeneratorのDEMO\n",
"\n",
"次に `feedgenerator` を使用しRSSを生成します"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 下準備"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import feedgenerator"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# feedgeneratorのインスタンスを生成\n",
"title = '日刊Developers.IO'\n",
"link = 'http://dev.classmethod.jp'\n",
"feed_url = 'http://dev.classmethod.jp'\n",
"description = 'AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ'\n",
"feed_gen = feedgenerator.Rss201rev2Feed(title=title, link=link, feed_url=feed_url, description=description, language=\"ja\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Developer.IO記事の1記事表示用HTML\n",
"feed_template = \"\"\"<div style=\"overflow: hidden; margin-top: 1rem;\">\n",
"<img src=\"{img}\" style=\"width: 6rem; height: 6rem; float: left; border: 1px solid #ddd; border-radius: 0.5rem; margin: 0 1rem 0 0;\">\n",
"<a href=\"{href}\" target=\"_blank\" style=\"font-weight: bold;font-size: 1.2rem;\">{title}</a>\n",
" <div class=\"summary\">{description}</div>\n",
"</div>\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def create_daily_entry(feed, publish_dt):\n",
" \"\"\"\n",
" feedgeneratorに渡すの楽するためのdict型を作成\n",
" title: タイトル\n",
" link: その日の記事がまとまってるURLを指定 例: http://dev.classmethod.jp/2018/01/08/\n",
" content: 記事の概要をひとまとめにした本文\n",
" description: 記事タイトルと著者を1行にまとめたもの\n",
" pubdate: 公開日\n",
" \"\"\"\n",
" default_icon_url = 'https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2013/09/icatch.png'\n",
" entry_parts = [dict(img=default_icon_url,\n",
" href=e['link'],\n",
" title=e['title'],\n",
" description=e['summary']) for e in feed['entries']]\n",
" content = ''.join([feed_template.format(**part) for part in entry_parts])\n",
" description = ' '.join(['{title}({author})'.format(title=e['title'], author=e['author']) for e in feed['entries']]) \n",
" return dict(title='{}の記事一覧'.format(publish_dt.strftime('%Y/%m/%d')),\n",
" link=feed['href'][:-len('feed/')],\n",
" content=content,\n",
" description=description,\n",
" pubdate=publish_dt,\n",
" unique_id=feed['href'][:-len('feed/')])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 実行するとこんなdict型が得られます"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': '<div style=\"overflow: hidden; margin-top: 1rem;\">\\n'\n",
" '<img '\n",
" 'src=\"https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2013/09/icatch.png\" '\n",
" 'style=\"width: 6rem; height: 6rem; float: left; border: 1px solid '\n",
" '#ddd; border-radius: 0.5rem; margin: 0 1rem 0 0;\">\\n'\n",
" '<a '\n",
" 'href=\"https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/\" '\n",
" 'target=\"_blank\" style=\"font-weight: bold;font-size: '\n",
" '1.2rem;\">AWS再入門2018 Amazon VPC(Virtual Private Cloud)編</a>\\n'\n",
" ' <div '\n",
" 'class=\"summary\">こんにちは。池田です。某音声操作デバイスの購入招待メールを申し込んでから何日経ったのかは考えないことにしました。 '\n",
" 'はじめに 今回はAWS再入門2018シリーズとして、Amazon VPC(Virtual Private '\n",
" '[&#8230;]</div>\\n'\n",
" '</div>\\n',\n",
" 'description': 'AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)',\n",
" 'link': 'https://dev.classmethod.jp/2018/01/13/',\n",
" 'pubdate': datetime.datetime(2018, 1, 14, 2, 0, tzinfo=<DstTzInfo 'Asia/Tokyo' LMT+9:19:00 STD>),\n",
" 'title': '2018/01/14の記事一覧',\n",
" 'unique_id': 'https://dev.classmethod.jp/2018/01/13/'}\n"
]
}
],
"source": [
"# 公開日を対象日の翌日AM2時に設定\n",
"publish_dt = datetime.datetime(target_date.year,\n",
" target_date.month,\n",
" target_date.day,\n",
" 2, 0, 0, 0, timezone('Asia/Tokyo')) + datetime.timedelta(days=1)\n",
"daily_entry = create_daily_entry(feed, publish_dt)\n",
"print(pformat(daily_entry))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"feedgeneratorのインスタンスに追加"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"feed_gen.add_item(**daily_entry)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"xmlの出力"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'<?xml version=\"1.0\" encoding=\"utf-8\"?>\\n<rss version=\"2.0\"><channel><title>日刊Developers.IO</title><link>http://dev.classmethod.jp</link><description>AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ</description><language>ja</language><lastBuildDate>Sun, 14 Jan 2018 02:00:00 +0919</lastBuildDate><item><title>2018/01/14の記事一覧</title><link>https://dev.classmethod.jp/2018/01/13/</link><description>AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)</description><pubDate>Sun, 14 Jan 2018 02:00:00 +0919</pubDate><guid isPermaLink=\"false\">https://dev.classmethod.jp/2018/01/13/</guid></item></channel></rss>'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feed_gen.writeString('utf-8')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"このままのxmlだと読みにくいので、 `下準備` でimportした `xml.dom.minidom` を使って、整形・出力"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<?xml version=\"1.0\" ?>\n",
"<rss version=\"2.0\">\n",
"\t<channel>\n",
"\t\t<title>日刊Developers.IO</title>\n",
"\t\t<link>http://dev.classmethod.jp</link>\n",
"\t\t<description>AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ</description>\n",
"\t\t<language>ja</language>\n",
"\t\t<lastBuildDate>Sun, 14 Jan 2018 02:00:00 +0919</lastBuildDate>\n",
"\t\t<item>\n",
"\t\t\t<title>2018/01/14の記事一覧</title>\n",
"\t\t\t<link>https://dev.classmethod.jp/2018/01/13/</link>\n",
"\t\t\t<description>AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)</description>\n",
"\t\t\t<pubDate>Sun, 14 Jan 2018 02:00:00 +0919</pubDate>\n",
"\t\t\t<guid isPermaLink=\"false\">https://dev.classmethod.jp/2018/01/13/</guid>\n",
"\t\t</item>\n",
"\t</channel>\n",
"</rss>\n",
"\n"
]
}
],
"source": [
"_xml = xml.dom.minidom.parseString(feed_gen.writeString('utf-8'))\n",
"print(_xml.toprettyxml())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"このxmlを一旦S3に出力して、feedlyで確認してみましょう"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"result_urls = S3.putRSS(_xml.toprettyxml())"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://feedly.com/i/subscription/feed/http://cm-sansan-demo.s3-website-ap-northeast-1.amazonaws.com/dd5d862f-80cc-4bed-b1cd-ba92dc981818.rss\n"
]
}
],
"source": [
"print(result_urls['feedly_url'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"このままだと `content:encoded` の項目がなくとても味気ないRSSの出力結果になります。 \n",
"RSS1.0まではあったのですが、2.0からはデフォルトで無くなったしまったらしく、 \n",
"`<rss version=\"2.0\">` -> `<rss version=\"2.0\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\">`\n",
"と拡張する必要があります。\n",
"\n",
"ただ、拡張するためには、 `feedgenerator.Rss201rev2Feed` を継承し、モジュール追加と要素を追加する処理を追加して対応する必要があります。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"class MyRss201rev2Feed(feedgenerator.Rss201rev2Feed):\n",
"\n",
" def rss_attributes(self):\n",
" \"\"\"\n",
" モジュール追加用のoverride\n",
" \"\"\"\n",
" return {'version': self._version,\n",
" 'xmlns:content': 'http://purl.org/rss/1.0/modules/content/'}\n",
"\n",
" def add_item_elements(self, handler, item):\n",
" \"\"\"\n",
" <content:encoded> タグ追加用のoverride\n",
" \"\"\"\n",
" # 既存の処理をMyRss201rev2Feed側のmethodに投げる\n",
" super(MyRss201rev2Feed, self).add_item_elements(handler, item)\n",
" \n",
" \n",
" # itemの要素にcontentがあったら処理\n",
" if 'content' in item and item['content'] is not None:\n",
" handler.addQuickElement(\"content:encoded\", item['content'], {\"type\": \"html\"})\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"feed_gen を `MyRss201rev2Feed` のインスタンスに作り直し"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# 出力フィードを生成\n",
"title = '日刊Developers.IO'\n",
"link = 'http://dev.classmethod.jp'\n",
"feed_url = 'http://dev.classmethod.jp'\n",
"description = 'AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ'\n",
"feed_gen = MyRss201rev2Feed(title=title, link=link, feed_url=feed_url, description=description, language=\"ja\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"feed_gen.add_item(**daily_entry)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<?xml version=\"1.0\" ?>\n",
"<rss version=\"2.0\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\">\n",
"\t<channel>\n",
"\t\t<title>日刊Developers.IO</title>\n",
"\t\t<link>http://dev.classmethod.jp</link>\n",
"\t\t<description>AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ</description>\n",
"\t\t<language>ja</language>\n",
"\t\t<lastBuildDate>Sun, 14 Jan 2018 02:00:00 +0919</lastBuildDate>\n",
"\t\t<item>\n",
"\t\t\t<title>2018/01/14の記事一覧</title>\n",
"\t\t\t<link>https://dev.classmethod.jp/2018/01/13/</link>\n",
"\t\t\t<description>AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)</description>\n",
"\t\t\t<pubDate>Sun, 14 Jan 2018 02:00:00 +0919</pubDate>\n",
"\t\t\t<guid isPermaLink=\"false\">https://dev.classmethod.jp/2018/01/13/</guid>\n",
"\t\t\t<content:encoded type=\"html\">&lt;div style=&quot;overflow: hidden; margin-top: 1rem;&quot;&gt;\n",
"&lt;img src=&quot;https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2013/09/icatch.png&quot; style=&quot;width: 6rem; height: 6rem; float: left; border: 1px solid #ddd; border-radius: 0.5rem; margin: 0 1rem 0 0;&quot;&gt;\n",
"&lt;a href=&quot;https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/&quot; target=&quot;_blank&quot; style=&quot;font-weight: bold;font-size: 1.2rem;&quot;&gt;AWS再入門2018 Amazon VPC(Virtual Private Cloud)編&lt;/a&gt;\n",
" &lt;div class=&quot;summary&quot;&gt;こんにちは。池田です。某音声操作デバイスの購入招待メールを申し込んでから何日経ったのかは考えないことにしました。 はじめに 今回はAWS再入門2018シリーズとして、Amazon VPC(Virtual Private [&amp;#8230;]&lt;/div&gt;\n",
"&lt;/div&gt;\n",
"</content:encoded>\n",
"\t\t</item>\n",
"\t</channel>\n",
"</rss>\n",
"\n"
]
}
],
"source": [
"_xml = xml.dom.minidom.parseString(feed_gen.writeString('utf-8'))\n",
"print(_xml.toprettyxml())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"これもS3に上げて確認してみましょう"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"result_urls = S3.putRSS(_xml.toprettyxml())"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://feedly.com/i/subscription/feed/http://cm-sansan-demo.s3-website-ap-northeast-1.amazonaws.com/7927d5d4-1d86-4483-b14a-0fc77e9b2211.rss\n"
]
}
],
"source": [
"print(result_urls['feedly_url'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`content:encoded` をつけるようにしたので、HTMLが使える様になり多少なり見栄えがよいRSSになりました。"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## beautifulsoup4のDEMO\n",
"\n",
"defaultのicatchだと物寂しいので、各記事についているicatchをRSSに適用していきたいと思います。 \n",
"そのために `beautifulsoup4` を使って、記事のHTMLをスクレイピングして、icatchの画像URLを取得します。"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"サンプルとして記事の1件目の記事URLを取得します。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"url: https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/\n"
]
}
],
"source": [
"html_url = feed['entries'][0]['link']\n",
"print('url:', html_url)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"この記事を `beautifulsoup4` を使ってタグを抜き出します。"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"html_page = requests.get(html_url)\n",
"soup = BeautifulSoup(html_page.content, 'html.parser')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# <meta property=\"og:image\" /> タグを探す\n",
"og_img = soup.find('meta', attrs={'property': 'og:image', 'content': True})"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"findで `meta` タグ かつ、 属性に `property=og:image` と `content` が設定されているタグを見つけて来ます。 \n",
"見つからなければ `None` を返します。"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"print(soup.find('hoge'))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<meta content=\"https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2014/05/Amazon_VPC-320x320.png\" property=\"og:image\"/>\n",
"\n",
"====================\n",
"\n",
"{'property': 'og:image', 'content': 'https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2014/05/Amazon_VPC-320x320.png'}\n",
"\n",
"--------------------\n",
"\n",
"https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2014/05/Amazon_VPC-320x320.png\n"
]
}
],
"source": [
"print(og_img)\n",
"print('\\n'+'='*20+'\\n')\n",
"print(og_img.attrs)\n",
"print('\\n'+'-'*20+'\\n')\n",
"print(og_img.get('content'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 画像確認用のメソッド(コードの説明はしないよ!)\n",
"\n",
"閑話休題、普段は表やグラフを表示したりするのに使われるので、画像をjupyterの結果として表示するのもお茶の子さいさいなのです。というDEMO"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 下準備"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"from PIL import Image\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"%matplotlib inline\n",
"\n",
"import io\n",
"import requests\n",
"\n",
"def show_img(url):\n",
" \"\"\"\n",
" see:\n",
" https://qiita.com/zaburo/items/5637b424c655b136527a\n",
" https://teratail.com/questions/71426\n",
" \"\"\"\n",
" plt.imshow(np.asarray(Image.open(io.BytesIO(requests.get(url, stream=True).content))))\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 先程取得したicatchを表示"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa2d32b7898>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_img(og_img.get('content'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### defaultのicatchを表示"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fa2ca569cf8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# default_icatch_icon の画像はこれ\n",
"show_img('https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2013/09/icatch.png')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### アイキャッチ取得したのをRSSのcontentに反映するよ"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def create_daily_entry2(feed, publish_dt):\n",
" \"\"\"\n",
" feedgeneratorに渡すの楽するためのdict型を作成\n",
" title: タイトル\n",
" link: その日の記事がまとまってるURLを指定 例: http://dev.classmethod.jp/2018/01/08/\n",
" content: 記事の概要をひとまとめにした本文\n",
" description: 記事タイトルと著者を1行にまとめたもの\n",
" pubdate: 公開日\n",
" \"\"\"\n",
" def _get_icatch_url(link):\n",
" default_icon_url = 'https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2013/09/icatch.png'\n",
" html_page = requests.get(link)\n",
" soup = BeautifulSoup(html_page.content, 'html.parser')\n",
" og_img = soup.find('meta', attrs={'property': 'og:image', 'content': True})\n",
" if og_img:\n",
" return og_img.get('content', default_icon_url)\n",
" return default_icon_url\n",
" \n",
" entry_parts = [dict(img=_get_icatch_url(e['link']),\n",
" href=e['link'],\n",
" title=e['title'],\n",
" description=e['summary']) for e in feed['entries']]\n",
" content = ''.join([feed_template.format(**part) for part in entry_parts])\n",
" description = ' '.join(['{title}({author})'.format(title=e['title'], author=e['author']) for e in feed['entries']]) \n",
" return dict(title='{}の記事一覧'.format(publish_dt.strftime('%Y/%m/%d')),\n",
" link=feed['href'][:-len('feed/')],\n",
" content=content,\n",
" description=description,\n",
" pubdate=publish_dt,\n",
" unique_id=feed['href'][:-len('feed/')])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`feed_gen` のインスタンスを作り直して"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# 出力フィードを生成\n",
"title = '日刊Developers.IO'\n",
"link = 'http://dev.classmethod.jp'\n",
"feed_url = 'http://dev.classmethod.jp'\n",
"description = 'AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ'\n",
"feed_gen = MyRss201rev2Feed(title=title, link=link, feed_url=feed_url, description=description, language=\"ja\")\n",
"\n",
"feed_gen.add_item(**create_daily_entry2(feed, publish_dt))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"xmlを出力するとこんな感じ"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<?xml version=\"1.0\" ?>\n",
"<rss version=\"2.0\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\">\n",
"\t<channel>\n",
"\t\t<title>日刊Developers.IO</title>\n",
"\t\t<link>http://dev.classmethod.jp</link>\n",
"\t\t<description>AWS/iOS技術者の必読メディア:クラスメソッド株式会社ブログ</description>\n",
"\t\t<language>ja</language>\n",
"\t\t<lastBuildDate>Sun, 14 Jan 2018 02:00:00 +0919</lastBuildDate>\n",
"\t\t<item>\n",
"\t\t\t<title>2018/01/14の記事一覧</title>\n",
"\t\t\t<link>https://dev.classmethod.jp/2018/01/13/</link>\n",
"\t\t\t<description>AWS再入門2018 Amazon VPC(Virtual Private Cloud)編(池田晃和)</description>\n",
"\t\t\t<pubDate>Sun, 14 Jan 2018 02:00:00 +0919</pubDate>\n",
"\t\t\t<guid isPermaLink=\"false\">https://dev.classmethod.jp/2018/01/13/</guid>\n",
"\t\t\t<content:encoded type=\"html\">&lt;div style=&quot;overflow: hidden; margin-top: 1rem;&quot;&gt;\n",
"&lt;img src=&quot;https://cdn-ssl-devio-img.classmethod.jp/wp-content/uploads/2014/05/Amazon_VPC-320x320.png&quot; style=&quot;width: 6rem; height: 6rem; float: left; border: 1px solid #ddd; border-radius: 0.5rem; margin: 0 1rem 0 0;&quot;&gt;\n",
"&lt;a href=&quot;https://dev.classmethod.jp/cloud/aws/2018-aws-re-entering-vpc/&quot; target=&quot;_blank&quot; style=&quot;font-weight: bold;font-size: 1.2rem;&quot;&gt;AWS再入門2018 Amazon VPC(Virtual Private Cloud)編&lt;/a&gt;\n",
" &lt;div class=&quot;summary&quot;&gt;こんにちは。池田です。某音声操作デバイスの購入招待メールを申し込んでから何日経ったのかは考えないことにしました。 はじめに 今回はAWS再入門2018シリーズとして、Amazon VPC(Virtual Private [&amp;#8230;]&lt;/div&gt;\n",
"&lt;/div&gt;\n",
"</content:encoded>\n",
"\t\t</item>\n",
"\t</channel>\n",
"</rss>\n",
"\n"
]
}
],
"source": [
"_xml = xml.dom.minidom.parseString(feed_gen.writeString('utf-8'))\n",
"print(_xml.toprettyxml())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"これもS3に上げて、Feedlyで確認してみよう。"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"result_urls = S3.putRSS(_xml.toprettyxml())"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://feedly.com/i/subscription/feed/http://cm-sansan-demo.s3-website-ap-northeast-1.amazonaws.com/af86dbef-f130-4e9c-88c0-06ac561791c7.rss\n"
]
}
],
"source": [
"print(result_urls['feedly_url'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# DEMO 終わり!"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment