Skip to content

Instantly share code, notes, and snippets.

@maphew
Last active June 13, 2019 19:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maphew/fe6ba4bf9ed2bc98ecf5 to your computer and use it in GitHub Desktop.
Save maphew/fe6ba4bf9ed2bc98ecf5 to your computer and use it in GitHub Desktop.
From http://stackoverflow.com/a/34116472/14420 in answer to " Extract images from PDF without resampling, in python?"
> pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
| 307kB 5.9MB/s
Installing collected packages: PyPDF2
Found existing installation: PyPDF2 1.25.1
Uninstalling PyPDF2-1.25.1:
Successfully uninstalled PyPDF2-1.25.1
Running setup.py install for PyPDF2
Successfully installed PyPDF2-1.25.1
[py27] E:\temp
> pip list
arcplus (0.1, d:\b\code\arcplus)
comtypes (1.1.2)
matplotlib (1.3.0)
numpy (1.7.1)
Pillow (3.0.0)
pip (7.1.2)
pyparsing (1.5.7)
PyPDF2 (1.25.1)
pywin32 (219)
setuptools (15.0)
[py27] E:\temp\pdf-image-extractor
> python pdf-image-extractor.py "Seige of Vicksburg Sample OCR.pdf"
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722]
Traceback (most recent call last):
File "pdf-image-extractor.py", line 21, in <module>
if xObject[obj]['/Filter'] == '/FlateDecode':
File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__
return dict.__getitem__(self, key).getObject()
KeyError: '/Filter'
import PyPDF2
from PIL import Image
if __name__ == '__main__':
## pdf = r'e:\temp\dctdecode.pdf'
pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf'
input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
page0 = input1.getPage(0)
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
@maphew
Copy link
Author

maphew commented Dec 11, 2015

I removed the old result console reports, since the new version makes those errors obsolete.

@kadnan
Copy link

kadnan commented Oct 9, 2016

Traceback (most recent call last):
  File "/xx/xx/extract.py", line 16, in <module>
    data = xObject[obj].getData()
  File "/Library/Python/2.7/site-packages/PyPDF2/generic.py", line 841, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/Library/Python/2.7/site-packages/PyPDF2/filters.py", line 361, in decodeStreamData
    raise NotImplementedError("unsupported filter %s" % filterType)
NotImplementedError: unsupported filter /CCITTFaxDecode

@kreegah
Copy link

kreegah commented Jun 13, 2019

getpdfimage('ana.pdf')
Traceback (most recent call last):

File "", line 1, in
getpdfimage('ana.pdf')

File "", line 9, in getpdfimage
data = xObject[obj].getData()

File "C:\Users\nmb31\Anaconda3\lib\site-packages\PyPDF2\generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)

File "C:\Users\nmb31\Anaconda3\lib\site-packages\PyPDF2\filters.py", line 361, in decodeStreamData
raise NotImplementedError("unsupported filter %s" % filterType)

NotImplementedError: unsupported filter /DCTDecode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment