Skip to content

Instantly share code, notes, and snippets.

@nezuQ
Created June 1, 2014 15:37
Show Gist options
  • Save nezuQ/f30498de8b15e1a7bd56 to your computer and use it in GitHub Desktop.
Save nezuQ/f30498de8b15e1a7bd56 to your computer and use it in GitHub Desktop.
徹底攻略PDFオープンデータ。PDFMinerで始めるPDFテキスト分析。 ref: http://qiita.com/nezuq/items/75e8366d68c66e56ff53
python setup.py install
pdf2txt.py samples/simple1.pdf
#コマンド実行後、Hello Worldが連続表示されたらOK。
# -> サンプルPDFからテキストを抽出する事に成功している。
make cmap
python setup.py install
mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install
pdf2txt.py -o output.txt input.pdf
dumppdf.py -a foo.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment