Skip to content

Instantly share code, notes, and snippets.

@chezou
Created September 11, 2018 13:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chezou/15b8a7a408808b3e9386f724c36653d9 to your computer and use it in GitHub Desktop.
Save chezou/15b8a7a408808b3e9386f724c36653d9 to your computer and use it in GitHub Desktop.
(venv) ➜ pdfminer-test pip freeze
backports.shutil-get-terminal-size==1.0.0
certifi==2018.8.24
chardet==3.0.4
decorator==4.3.0
distro==1.3.0
enum34==1.1.6
idna==2.7
ipython==5.8.0
ipython-genutils==0.2.0
numpy==1.15.1
pandas==0.23.4
pathlib2==2.3.2
pdfminer==20140328
pexpect==4.6.0
pickleshare==0.7.4
pkg-resources==0.0.0
prompt-toolkit==1.0.15
ptyprocess==0.6.0
Pygments==2.2.0
pyPdf==1.13
python-dateutil==2.7.3
python-magic==0.4.15
pytz==2018.5
requests==2.19.1
scandir==1.9.0
simplegeneric==0.8.1
six==1.11.0
tabula-py==1.2.0
traitlets==4.3.2
urllib3==1.23
wcwidth==0.1.7
In [1]: import magic
In [2]: from pyPdf import PdfFileReader
In [3]: import tabula
In [4]: filename = '/home/aki/source/tabula/10736.pdf'
In [5]: magic.from_file(filename, mime=True)
Out[5]: 'application/pdf'
In [6]: ifpdf = PdfFileReader(file(filename, "rb"))
In [7]: pdf_info = ifpdf.getDocumentInfo()
In [8]: pdf_info
Out[8]:
{'/Author': u'U.S. Census Bureau',
'/CreationDate': u"D:20110818133834-04'00'",
'/Creator': u'Adobe InDesign CS4 (6.0)',
'/ModDate': u"D:20110830152334-04'00'",
'/Producer': u'Adobe PDF Library 9.0',
'/Title': u'Arrests by Sex and Age'}
In [9]: nm = ['Info_1', 'Info_2', 'Info_3', 'Info_4']
In [11]: df = tabula.read_pdf(filename, pages='all', lattice=True, pandas_options={'header': None, 'names': nm, 'encoding': 'utf-8'})
Sep 11, 2018 10:30:31 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO: To get higher rendering speed on JDK8 or later,
Sep 11, 2018 10:30:31 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO: use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Sep 11, 2018 10:30:31 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO: or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
In [12]: df
Out[12]:
Info_1 Info_2 Info_3 Info_4
0 Total Male NaN NaN
1 Under 1818 years\rTotalyearsand over Under 1818 years\rTotalyearsand over NaN NaN
2 11,062 .61,540 .09,522 .6\r467 .969 .1398 .8\r... 8,263 .31,071 .67,191 .7\r380 .256 .5323 .7\r9... NaN NaN
3 Total NaN NaN NaN
4 10,690,561\r456,965\r9,739\r16,362\r100,496\r3... NaN NaN NaN
In [13]: df.to_csv("test.csv", encoding="utf-8")
In [14]: !cat test.csv
,Info_1,Info_2,Info_3,Info_4
0,Total,Male,,
Totalyearsand over,,nder 1818 years
34.034.0(X)",,142.9 .31,071 .67,191 .7
3,Total,,,
73,616",,,561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment