Skip to content

Instantly share code, notes, and snippets.

@t2psyto
Last active July 25, 2021 12:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save t2psyto/1d54a7306305ed1d73c148ac80f7802f to your computer and use it in GitHub Desktop.
Save t2psyto/1d54a7306305ed1d73c148ac80f7802f to your computer and use it in GitHub Desktop.
日経BPビズボードからpdfを取得する
# 日経BPビズボードからpdfを取得する
#
# required: busybox-w32
#
#https://bizboard.nikkeibp.co.jp/simple/parts/search_mag.html
#URL: http://bizboard.nikkeibp.co.jp/simple/parts/mag_LIN2020.html
#URL: http://bizboard.nikkeibp.co.jp/simple/parts/mag_NSW2020.html
# 2020-01号 の記事一覧ページを取得 > _index.html
# 例) webページ上のリンク → javascript:funcSearch( '44', '20200101', '0230F');
#set PATH=e:\opt\bin;%PATH%
#busybox64 bash
ARG1=$1;ARG2=$2;ARG3=$3
echo ARG1=$1 ARG2=$2 ARG3=$3 ${ARG2}_${ARG3}_index.html
curl -o ${ARG2}_${ARG3}_index.html 'http://bizboard.nikkeibp.co.jp/simple_notauth/SearchServlet' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' -H 'Origin: http://bizboard.nikkeibp.co.jp' -H 'Upgrade-Insecure-Requests: 1' -H 'Content-Type: application/x-www-form-urlencoded' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Referer: http://bizboard.nikkeibp.co.jp/simple/parts/mag.html' -H 'Accept-Language: en-US,en;q=0.9' --data "ACTION=3&ONLINE_MEDIA=%28%40kind%3D%3D1%26%40announcemediaid%3D%3D24%26%40writemediaid%3D%3D${ARG1}%29%26%28%40subsubject%3A%3D${ARG3}%29&SEARCH_START_DATE=${ARG2}&SEARCH_END_DATE=${ARG2}&j_encoding=UTF-8&NO_TERM_SEARCH=1&MAGAZINE_GET_COUNT=100" --insecure
# _index.html から pdfファイルへのリンクを抽出 > filelist.txt
grep -o -E "http://bizboard.nikkeibp.co.jp/houjin/cgi-bin/nsearch/md_pdf.pl/[0-9]+.pdf\?NEWS_ID=[0-9]+&CONTENTS=1&bt=[A-Z]+&SYSTEM_ID=HO" ${ARG2}_${ARG3}_index.html > ${ARG2}_${ARG3}_filelist.txt
# filelist.txt に列挙されたpdfをダウンロード
xargs -n 1 -P 4 curl -LOJ -e "http://bizboard.nikkeibp.co.jp/simple_notauth/SearchServlet" < ${ARG2}_${ARG3}_filelist.txt
# 念のため filelist.txt と ダウンロード済みpdfファイル数を表示
wc -l ${ARG2}_${ARG3}_filelist.txt
ls *pdf* | wc -l
# pdfファイルの拡張子をきれいにする; 'xxxxxx.pdf_????????' → 'xxxxxx.pdf'
ls *pdf* | xargs -r -n1 -t bash -c 'echo mv \"${0}\" ${0%%.*}.pdf' | bash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment