maphew/2015-Dec-11 - result.txt

## 2015-Dec-11 - result.txt
> pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
     | 307kB 5.9MB/s
Installing collected packages: PyPDF2
  Found existing installation: PyPDF2 1.25.1
    Uninstalling PyPDF2-1.25.1:
      Successfully uninstalled PyPDF2-1.25.1
  Running setup.py install for PyPDF2
Successfully installed PyPDF2-1.25.1

[py27] E:\temp
> pip list
arcplus (0.1, d:\b\code\arcplus)
comtypes (1.1.2)
matplotlib (1.3.0)
numpy (1.7.1)
Pillow (3.0.0)
pip (7.1.2)
pyparsing (1.5.7)
PyPDF2 (1.25.1)
pywin32 (219)
setuptools (15.0)


[py27] E:\temp\pdf-image-extractor
> python pdf-image-extractor.py  "Seige of Vicksburg Sample OCR.pdf"
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722]
Traceback (most recent call last):
  File "pdf-image-extractor.py", line 21, in <module>
    if xObject[obj]['/Filter'] == '/FlateDecode':
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Filter'

## 2015-Dec-11_pdf-image-extractor.py
import PyPDF2

from PIL import Image

if __name__ == '__main__':
##    pdf = r'e:\temp\dctdecode.pdf'
    pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf'

    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()
	> pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
	InsecurePlatformWarning
	C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
	InsecurePlatformWarning
	Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	\| 307kB 5.9MB/s
	Installing collected packages: PyPDF2
	Found existing installation: PyPDF2 1.25.1
	Uninstalling PyPDF2-1.25.1:
	Successfully uninstalled PyPDF2-1.25.1
	Running setup.py install for PyPDF2
	Successfully installed PyPDF2-1.25.1

	[py27] E:\temp
	> pip list
	arcplus (0.1, d:\b\code\arcplus)
	comtypes (1.1.2)
	matplotlib (1.3.0)
	numpy (1.7.1)
	Pillow (3.0.0)
	pip (7.1.2)
	pyparsing (1.5.7)
	PyPDF2 (1.25.1)
	pywin32 (219)
	setuptools (15.0)


	[py27] E:\temp\pdf-image-extractor
	> python pdf-image-extractor.py "Seige of Vicksburg Sample OCR.pdf"
	PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722]
	Traceback (most recent call last):
	File "pdf-image-extractor.py", line 21, in <module>
	if xObject[obj]['/Filter'] == '/FlateDecode':
	File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__
	return dict.__getitem__(self, key).getObject()
	KeyError: '/Filter'
	import PyPDF2

	from PIL import Image

	if __name__ == '__main__':
	## pdf = r'e:\temp\dctdecode.pdf'
	pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf'

	input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
	page0 = input1.getPage(0)
	xObject = page0['/Resources']['/XObject'].getObject()

	for obj in xObject:
	if xObject[obj]['/Subtype'] == '/Image':
	size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
	data = xObject[obj].getData()
	if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
	mode = "RGB"
	else:
	mode = "P"

	if xObject[obj]['/Filter'] == '/FlateDecode':
	img = Image.frombytes(mode, size, data)
	img.save(obj[1:] + ".png")
	elif xObject[obj]['/Filter'] == '/DCTDecode':
	img = open(obj[1:] + ".jpg", "wb")
	img.write(data)
	img.close()
	elif xObject[obj]['/Filter'] == '/JPXDecode':
	img = open(obj[1:] + ".jp2", "wb")
	img.write(data)
	img.close()