Last active
March 14, 2024 20:49
-
-
Save mtboren/8525f1362646e0fd09ff04b53f4ef511 to your computer and use it in GitHub Desktop.
Examples of getting PDF text contents for subsequent use
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## use PowerShell and the PSWritePDF module | |
Install-Module -Name PSWritePDF -Scope CurrentUser | |
Convert-PDFToText -FilePath $strSomePdf | fabric --pattern analyze_threat_report | |
## or, use Python, pypdf module, and some local PDF; these Python examples can be from most any shell (including PS) | |
pip install pypdf --user ## however you like to install Python modules | |
python -c 'strSomePdf = "/tmp/2024-cyber-threat-report.pdf"; from pypdf import PdfReader; from sys import stdout; [stdout.writelines(page.extract_text()) for page in (PdfReader(strSomePdf)).pages]' | fabric --pattern analyze_threat_report | |
## or, use Python, pypdf module, and some PDF from URL | |
python -c 'strSomePdfUri = "https://www.sonicwall.com/medialibrary/en/white-paper/2024-cyber-threat-report.pdf"; from urllib.request import urlopen; from pypdf import PdfReader; import io; from sys import stdout; reader = PdfReader(io.BytesIO(urlopen(strSomePdfUri).read())); [stdout.writelines(page.extract_text()) for page in reader.pages]' | fabric --pattern analyze_threat_report |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment