Skip to content

Instantly share code, notes, and snippets.

@hubgit
Last active June 15, 2023 13:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hubgit/d69f5f31f515ebece4dd0af245d41ec9 to your computer and use it in GitHub Desktop.
Save hubgit/d69f5f31f515ebece4dd0af245d41ec9 to your computer and use it in GitHub Desktop.
Extract tabular data from a PDF to CSV
# brew install awscli
# aws configure
aws s3 cp your-file.pdf s3://your-bucket/your-file.pdf
# https://pypi.org/project/amazon-textract-helper/
# https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper
# pip install amazon-textract-helper
amazon-textract --input-document s3://your-bucket/your-file.pdf --features TABLES --pretty-print TABLES --pretty-print-table-format=csv
# https://aws.amazon.com/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment