Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jsoma/430e3fc6b70aa1d91640dd563d8f6128 to your computer and use it in GitHub Desktop.
Save jsoma/430e3fc6b70aa1d91640dd563d8f6128 to your computer and use it in GitHub Desktop.
How to use pdfminer.six, PaddleOCR and OpenAI's GPT to OCR and extract text from PDFs and save them into a CSV (or Excel) file for later analysis.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "91111997-8ca1-460d-bf79-13dfd14834ec",
"metadata": {},
"source": [
"# How to do OCR (text extraction) on PDFs with PaddleOCR, and then having AI magic fix up the results\n",
"\n",
"Tesseract is popular because it's popular: it's also not very good! I think [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md) is vaguely state of the art at the moment, and vaguely not-impossible to install and use."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ac0b72db-4045-46e4-aa54-a12fd464731a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --quiet paddlepaddle \"paddleocr>=2.0.1\""
]
},
{
"cell_type": "markdown",
"id": "16f6936a-2fd1-4481-8dc2-6107351ec852",
"metadata": {},
"source": [
"We'll start by trying it on one PDF. It'll take forever to run the first time because it has to download the text recognition model."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "bdb7a2ae-5fbc-48d2-b696-937e12a7588c",
"metadata": {},
"outputs": [],
"source": [
"from paddleocr import PaddleOCR\n",
"\n",
"ocr = PaddleOCR(lang='en', show_log=False)\n",
"pages = ocr.ocr('199416062.pdf')"
]
},
{
"cell_type": "markdown",
"id": "43b5d5a3-2611-41f5-88ac-584b53c28627",
"metadata": {},
"source": [
"The result is a list of pages, with each page having a list of lines of text inside. And each line of text has coordinates for the text, the text, *and* a confidence score. It's a bit much and requires weird double list comprehensions just extract the text."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "acb8ba76-f813-44be-bf36-234a1b65d451",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"STATE OF FLORIDA\n",
"BOARD OF MASSAGE\n",
"Final Order No. BPR-95-04169 Date -/-95\n",
"FILED\n",
"DEPARTMENT OF BUSINESS AND\n",
"Dept. of Business and Professional Regulation\n",
"PROFESSIONAL REGULATION,\n",
"AGENCY CLERK\n",
"Sarah WachmanAgency Clerk\n",
"Petitioner,\n",
"By: Axans C.Kus\n",
"VS-\n",
"CASE NQ.:\n",
"94-16062\n",
"LICENSE NO.:\n",
"MM 0000181\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"FINAL ORDER APPROVING SETTLEMENT STIPULATION\n",
"THIs MAtTeR came before the Board of Massage at a duly\n",
"noticec! public meeting held on July 5, 1995, in St..\n",
"Petersburg Beach, Florida, for consideration of the.\n",
"Administrative Complaint (attached hereto as Exhibit A) and\n",
"the proposed Stipulation (attached hereto as Exhibit B)\n",
"entered into between the parties in the above styled case..\n",
"The Pet:itioner was represented by Susan E. Lindgard..\n",
"The\n",
"Respondent was present and represented by Betty L. Pritchard.\n",
"Upon consideration of the Administrative Complaint and.\n",
"the proposed Settlement Stipulation in this matter, and being\n",
"otherwise fully advised in the premises, it is hereby OrDeRED.\n",
"AND ADJUDGED:\n",
"1. The proposed Stipulation is hereby approved,\n",
"adopted,.and incorporated herein by reference.\n",
"2.\n",
"Respondent will adhere to and abide by all of the.\n",
"terms and conditions of the Stipulation.\n",
"3.\n",
"This Order shall be placed in and become a part of.\n",
"Respondent's official records and shall become effective upon\n",
"filing with the Clerk of the.Department of Business and\n",
"Professional Regulation.\n",
"DONE AND ORDERED this Z4day of\n",
"1995.\n",
"DANIEL A. ULRICH, CHAIR\n",
"BOARD OF MASSAGE\n",
"CERTIFICATE OF SERVICE\n",
"I HEREBy CeRTIFy that a true and correct copy of the\n",
"foregoing Final Order has been furnished by United States.\n",
"Mail to BETTy L. PRITchARD, Holistic Massage Therapy, 2945\n",
"Central Avenue, St. Petersburg, FL 33713, and by hand.\n",
"delivery to Susan LInDgARD, Senior Attorney, Department of\n",
"Business and Professional Regulation, Northwood Centre, 1940\n",
"North Monroe Street, Tallahassee, Florida 32399-0750, this\n",
"Tanmy Hollingsworth,\n",
"Administrative Secretary\n",
"STATE OF FLORIDA\n",
"DEFARTMENT OF BUSINESS AND PROFESSIONAL REGULATION\n",
"BOARD OF MASSAGE\n",
"DEPARTMENT OF BUSINESS AND\n",
"PROFESSIONAL REGULATION,\n",
"Petitioner,\n",
"vs.\n",
"CASE NO.\n",
"94-16062\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"STIPULATION\n",
"Pursuant to FLA. STAT. section 20.165 (1993), the above-named\n",
"parties hereby offer this Stipulation to the Board of Massage as\n",
"disposition of the Administrative Complaint, (attached hereto as\n",
"Exhibit \"g\"), in lieu of any other administrative proceedings. The\n",
"terms herein become effective only if and when a Final Order\n",
"accepting this stipulation is issued by the Board and filed.\n",
"In\n",
"considering\n",
"this Stipulation,\n",
"the\n",
"Board\n",
"may\n",
"review\n",
"all\n",
"investigative materials regarding this case.\n",
" If this Stipulation\n",
"is rejected, it and its presentation to the Board shall not be used\n",
"against either party.\n",
"STIPULATED FACTS\n",
"1.\n",
"Respondent neither admits nor denies the allegations of.\n",
"fact cont.ained in the Administrative Complaint.\n",
"2.\n",
"For all times pertinent hereto, Respondent was a licensed\n",
"massage therapist, having been issued license number Mm ooooi8l.\n",
"100\n",
"STIPULATED LA.\n",
"3.\n",
"Respondent is subject to the provisions of FLA. STAT.\n",
"sections 455 and 480 (1993).\n",
"4.\n",
"Respondent admits that the facts, if true, constitute\n",
"violations of law as charged in the Administrative Complaint.\n",
"STIPULATED DISPOSITION\n",
"5.\n",
"Respondent shall in the future, not violate FLA. STaT.\n",
"sections 455 or 480 (1993), or the rules promulgated thereunder.\n",
"6.\n",
"The Board shall impose an administrative fine in the\n",
"amount of $25o.o0 against the Respondent. The fine shall be paid by\n",
"the Respondent to the Executive Director of the Board of Massage\n",
"within thirty (3o) days of its imposition by Final Order of the\n",
"Board. To ensure payment of the fine in compliance with this\n",
"stipulation, Respondent further stipulates that his license to\n",
"practice nassage therapy shall be suspended with the imposition of\n",
"the suspension being stayed for thirty (30) days. If the ordered\n",
"fine is paid within that thirty (3o) days period, the suspension\n",
"shall not take effect. If Respondent does not pay the above ordered\n",
"fine during said stay, then immediately upon the expiration of the\n",
"stay evidence of licensure shall be mailed to the Board of Massage\n",
"at the Department of Business and Professional Regulation,\n",
"Northwood Centre, 1940 North Monroe Street, Tallahassee, Florida\n",
"32399-0792.\n",
"7.\n",
"Respondent hereby waives any rights to appeal or further\n",
"review ot. the Stipulation made herein.\n",
"8.\n",
"Respondent hereby waives any claim for attorneys fees\n",
"generated by this case..\n",
"101\n",
"9.\n",
" It is expressly understood that a violation of the terms.\n",
"of this Stipulation shall be considered a violation of FLA. STAT.\n",
"section 480 (1993), for which disciplinary action may be.initiated.\n",
"Whererore, the parties hereto request the Board to enter a.\n",
"Final Order accepting and implementing the terms contained herein.\n",
"Signed this\n",
"day o\n",
"1995.\n",
"RESPONOENT\n",
"Case Number 94-16062\n",
"(Respondent signature must be\n",
"notarized below)\n",
"Pritehore.\n",
"whose identity is known to me by FlQr ve\n",
"(type of identification) and who, under oath, acknowledges that\n",
"his/her signature appears above.\n",
"Sworn to and subscribed by Respondent before me this.\n",
"day\n",
"of\n",
"Mo\n",
"1995.\n",
"NOTARY PUBLIC, STATE OF FLORWA\n",
"MYCOMMISSION EXPIRS:J2.1\n",
"BONDED THRU NOTARYPLBLIC UNDERWEIT\n",
"Notary PublicE7\n",
"rdeA.Hor\n",
"My Commission Expires:\n",
"Approved this 3/S\n",
"day of\n",
"1995.\n",
"Richard T. Farrell\n",
"Secretary\n",
"BY: Charles F. Tunnicliff.\n",
"Chief Attorney\n",
"Professions Section\n",
"I n.\n",
"COUNSEL FOR. THE DEPARTMENT:\n",
"Susan E. Lindgard\n",
"Senior Attorney\n",
"Florida Bar0650986\n",
"Department of Professional\n",
"Regulation\n",
"1940 North Monroe Street\n",
"Tallahassee, Florida 32399-0792\n",
"(904922-0114\n",
"CFT/SEL/VE\n",
"STATE OF FLORIDA\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION\n",
"DEPARTMENT OF BUSINESS AND\n",
"PROFESSIONAL REGULATION,\n",
"Petitioner,\n",
"VS.\n",
"CASE NO. 94-16062\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"ADMINISTRATIVE COMPLAINT\n",
"comes Now the Petitioner, the Department of Business and.\n",
"Professional Regulation, (hereinafter \"petitioner\"), and files this.\n",
"Administrative Complaint before the Board of Massage (hereinafter\n",
"\"Board\"),\n",
"against\n",
"HOLISTIC\n",
"MASSAGE\n",
"THERAPY,\n",
"(hereinafter\n",
"\"Respondent\"), and alleges:\n",
"1. Petitioner is the state agency charged with regulating\n",
"the practice of Massage pursuant to FLA. STat. sections 20.165,\n",
"455,and 480 1991)*/.\n",
"2.\n",
"Respondent is a licensed massage establishment in the\n",
"State of Florida, having been issued license number Mm ooooisl.\n",
"3.\n",
"Respondent's location is at 2945 Central Avenue, St..\n",
"PetersburJ, FL 33713-8631.\n",
"*/\n",
"\"FLA, STat.\" is the legal abbreviation for Florida Statutes..\n",
"110\n",
"On or about October 24, 1994, an inspection by the DPR of.\n",
"4.\n",
"the Respordent.establishment revealed that the fire extinguisher\n",
"inspection tag is not current.\n",
"Based upon the foregoing, the Respondent has violated FLA.\n",
"STAT. section 480.046(1)(k)(1993) through violation of FLA. ADmIN.\n",
"cODE 61g11-26.003(3) in that the Respondent did not maintain a fire\n",
"extinguisher in good working condition on the premises, wherein\n",
"\"good working condition\" means meeting the standards for approval.\n",
"by the State Fire Marshal.\n",
"WHEReFoRE, Petitioner respectfully requests the Secretary of\n",
"the Department to enter an order imposing one or more of the\n",
"following penalties: imposition of a notice to cease and desist,\n",
"imposition of an administrative fine, and/or any other relief that\n",
"the Secretary deems appropriate.\n",
"SIGNED this 27\n",
"day of\n",
"1994.\n",
"FILED\n",
"George Stuart,\n",
"Secretary\n",
"Dopartment ol Business and Professional Regulation\n",
"DEPUTY CLERK\n",
"Charles F. Tunnicliff\n",
"CLERK Aorra C tisk\n",
"Chief Attorney\n",
"Professions Section\n",
"DATEarCh 271995\n",
"111\n",
"COUNSEL FOF: PETITIONER\n",
"Susan E. Lindgard\n",
"Senior Attorney\n",
"Florida Bar f0650986\n",
"Department of Business and\n",
"Professl.onal Regulation\n",
"Suite 60\n",
"Northwood Centre\n",
"1940 North Monroe Street\n",
"Tallahassee, Florida 32399-0792\n",
"(904)488-0062\n",
"CFT/SEL/VE\n",
"December 24, 1994\n",
"Case No.\n",
"91-16062\n",
"P D\n",
"3-13-9s\n",
"X#+GC\n",
"112\n"
]
}
],
"source": [
"# go through every line in every page and pull out the text\n",
"pdf_text = '\\n'.join([line[1][0] for page in pages for line in page])\n",
"print(pdf_text)"
]
},
{
"cell_type": "markdown",
"id": "6590dd5c-09a1-4209-b3e4-7d0da3ece511",
"metadata": {},
"source": [
"## OCR error correction"
]
},
{
"cell_type": "markdown",
"id": "975c37e8-7008-4bd8-a7c0-425d25c972eb",
"metadata": {},
"source": [
"Here's where it gets fun. Let's connect to GPT using https://github.com/openai/openai-python and have it *revise the OCR result*.\n",
"\n",
"GPT is an awful machine that serves to give the most probable answers: usually that might be at odds with the truth, but in this case... it's likely to line up."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "6d3ae9f7-b081-4d2f-8ba4-e45cfe0900d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --quiet openai"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "ab8e1c81-0454-4df5-b17f-563096a78e7c",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from openai import OpenAI\n",
"\n",
"# You'll use your own API key from https://platform.openai.com/api-keys\n",
"client = OpenAI(api_key='YOUR_API_KEY_HERE')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "407a4d5c-c036-440b-b2f6-5fe8f8168c22",
"metadata": {},
"outputs": [],
"source": [
"prompt = f\"\"\"\n",
"The text below is from a PDF that has been OCR'd. It probably has some errors, please\n",
"revise to correct the errors. Only provide the corrected text, nothing else.\n",
"\n",
"## OCR TEXT\n",
"\n",
"{pdf_text}\n",
"\"\"\"\n",
"\n",
"response = client.chat.completions.create(\n",
" messages=[ { \"role\": \"user\", \"content\": prompt }],\n",
" model=\"gpt-4o\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "3462c5ea-3e1f-47cf-9102-e223293b05dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## STATE OF FLORIDA\n",
"BOARD OF MASSAGE\n",
"Final Order No. BPR-95-04169 Date -/-95\n",
"FILED\n",
"DEPARTMENT OF BUSINESS AND\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION\n",
"PROFESSIONAL REGULATION,\n",
"AGENCY CLERK\n",
"Sarah Wachman Agency Clerk\n",
"Petitioner,\n",
"By: Arona C. Kus\n",
"VS\n",
"CASE NO.:\n",
"94-16062\n",
"LICENSE NO.:\n",
"MM 0000181\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"\n",
"FINAL ORDER APPROVING SETTLEMENT STIPULATION\n",
"\n",
"THIS MATTER came before the Board of Massage at a duly noticed public meeting held on July 5, 1995, in St. Petersburg Beach, Florida, for consideration of the Administrative Complaint (attached hereto as Exhibit A) and the proposed Stipulation (attached hereto as Exhibit B) entered into between the parties in the above-styled case. The Petitioner was represented by Susan E. Lingard. The Respondent was present and represented by Betty L. Pritchard.\n",
"\n",
"Upon consideration of the Administrative Complaint and the proposed Settlement Stipulation in this matter, and being otherwise fully advised in the premises, it is hereby ORDERED AND ADJUDGED:\n",
"1. The proposed Stipulation is hereby approved, adopted, and incorporated herein by reference.\n",
"2. Respondent will adhere to and abide by all of the terms and conditions of the Stipulation.\n",
"3. This Order shall be placed in and become a part of Respondent's official records and shall become effective upon filing with the Clerk of the Department of Business and Professional Regulation.\n",
"\n",
"DONE AND ORDERED this 24th day of\n",
"1995.\n",
"DANIEL A. ULRICH, CHAIR\n",
"BOARD OF MASSAGE\n",
"\n",
"CERTIFICATE OF SERVICE\n",
"\n",
"I HEREBY CERTIFY that a true and correct copy of the foregoing Final Order has been furnished by United States Mail to BETTY L. PRITCHARD, Holistic Massage Therapy, 2945 Central Avenue, St. Petersburg, FL 33713, and by hand delivery to Susan LINGARD, Senior Attorney, Department of Business and Professional Regulation, Northwood Centre, 1940 North Monroe Street, Tallahassee, Florida 32399-0750, this\n",
"Tammy Hollingsworth,\n",
"Administrative Secretary\n",
"\n",
"STATE OF FLORIDA\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION\n",
"BOARD OF MASSAGE\n",
"\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION,\n",
"Petitioner,\n",
"vs.\n",
"CASE NO.\n",
"94-16062\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"\n",
"STIPULATION\n",
"\n",
"Pursuant to FLA. STAT. section 20.165 (1993), the above-named parties hereby offer this Stipulation to the Board of Massage as disposition of the Administrative Complaint, (attached hereto as Exhibit \"A\"), in lieu of any other administrative proceedings. The terms herein become effective only if and when a Final Order accepting this stipulation is issued by the Board and filed.\n",
"\n",
"In considering this Stipulation, the Board may review all investigative materials regarding this case. If this Stipulation is rejected, it and its presentation to the Board shall not be used against either party.\n",
"\n",
"STIPULATED FACTS\n",
"1. Respondent neither admits nor denies the allegations of fact contained in the Administrative Complaint.\n",
"2. For all times pertinent hereto, Respondent was a licensed massage therapist, having been issued license number MM 0000181.\n",
"\n",
"STIPULATED LAW\n",
"3. Respondent is subject to the provisions of FLA. STAT. sections 455 and 480 (1993).\n",
"4. Respondent admits that the facts, if true, constitute violations of law as charged in the Administrative Complaint.\n",
"\n",
"STIPULATED DISPOSITION\n",
"5. Respondent shall in the future, not violate FLA. STAT. sections 455 or 480 (1993), or the rules promulgated thereunder.\n",
"6. The Board shall impose an administrative fine in the amount of $250.00 against the Respondent. The fine shall be paid by the Respondent to the Executive Director of the Board of Massage within thirty (30) days of its imposition by Final Order of the Board. To ensure payment of the fine in compliance with this stipulation, Respondent further stipulates that his license to practice massage therapy shall be suspended with the imposition of the suspension being stayed for thirty (30) days. If the ordered fine is paid within that thirty (30) days period, the suspension shall not take effect. If Respondent does not pay the above ordered fine during said stay, then immediately upon the expiration of the stay evidence of licensure shall be mailed to the Board of Massage at the Department of Business and Professional Regulation, Northwood Centre, 1940 North Monroe Street, Tallahassee, Florida 32399-0792.\n",
"7. Respondent hereby waives any rights to appeal or further review of the Stipulation made herein.\n",
"8. Respondent hereby waives any claim for attorney's fees generated by this case.\n",
"9. It is expressly understood that a violation of the terms of this Stipulation shall be considered a violation of FLA. STAT. section 480 (1993), for which disciplinary action may be initiated.\n",
"\n",
"Wherefore, the parties hereto request the Board to enter a Final Order accepting and implementing the terms contained herein.\n",
"\n",
"Signed this\n",
"day of\n",
"1995.\n",
"RESPONDENT\n",
"\n",
"Case Number 94-16062\n",
"(Respondent signature must be notarized below)\n",
"Pritchard,\n",
"whose identity is known to me by Florida driver's license(type of identification) and who, under oath, acknowledges that his/her signature appears above.\n",
"\n",
"Sworn to and subscribed by Respondent before me this day of\n",
", 1995.\n",
"NOTARY PUBLIC, STATE OF FLORIDA\n",
"MY COMMISSION EXPIRES: BONDED THRU NOTARY PUBLIC UNDERWRITERS\n",
"Notary Public\n",
"My Commission Expires:\n",
"Approved this\n",
"day of\n",
"1995.\n",
"\n",
"Richard T. Farrell\n",
"Secretary\n",
"BY: Charles F. Tunnicliff\n",
"Chief Attorney\n",
"Professions Section\n",
"COUNSEL FOR THE DEPARTMENT:\n",
"Susan E. Lingard\n",
"Senior Attorney\n",
"Florida Bar 0650986\n",
"Department of Professional Regulation\n",
"1940 North Monroe Street\n",
"Tallahassee, Florida 32399-0792\n",
"(904) 922-0114\n",
"CFT/SEL/VE\n",
"\n",
"STATE OF FLORIDA\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION\n",
"DEPARTMENT OF BUSINESS AND PROFESSIONAL REGULATION,\n",
"Petitioner,\n",
"vs.\n",
"CASE NO. 94-16062\n",
"HOLISTIC MASSAGE THERAPY,\n",
"Respondent.\n",
"\n",
"ADMINISTRATIVE COMPLAINT\n",
"\n",
"COMES NOW the Petitioner, the Department of Business and Professional Regulation, (hereinafter \"Petitioner\"), and files this Administrative Complaint before the Board of Massage (hereinafter \"Board\"), against HOLISTIC MASSAGE THERAPY, (hereinafter \"Respondent\"), and alleges:\n",
"\n",
"1. Petitioner is the state agency charged with regulating the practice of Massage pursuant to FLA. STAT. sections 20.165, 455, and 480 (1991).\n",
"2. Respondent is a licensed massage establishment in the State of Florida, having been issued license number MM 0000181.\n",
"3. Respondent's location is at 2945 Central Avenue, St. Petersburg, FL 33713-8631.\n",
"\n",
"On or about October 24, 1994, an inspection by the DPR of the Respondent establishment revealed that the fire extinguisher inspection tag is not current.\n",
"\n",
"Based upon the foregoing, the Respondent has violated FLA. STAT. section 480.046(1)(k)(1993) through violation of FLA. ADMIN. CODE 61G11-26.003(3) in that the Respondent did not maintain a fire extinguisher in good working condition on the premises, wherein \"good working condition\" means meeting the standards for approval by the State Fire Marshal.\n",
"\n",
"WHEREFORE, Petitioner respectfully requests the Secretary of the Department to enter an order imposing one or more of the following penalties: imposition of a notice to cease and desist, imposition of an administrative fine, and/or any other relief that the Secretary deems appropriate.\n",
"\n",
"SIGNED this 27th day of\n",
"1994.\n",
"FILED\n",
"George Stuart,\n",
"Secretary\n",
"Department of Business and Professional Regulation\n",
"DEPUTY CLERK\n",
"Charles F. Tunnicliff\n",
"Chief Attorney\n",
"Professions Section\n",
"DATE March 27, 1995\n",
"\n",
"COUNSEL FOR PETITIONER\n",
"Susan E. Lingard\n",
"Senior Attorney\n",
"Florida Bar 0650986\n",
"Department of Business and Professional Regulation\n",
"Suite 60\n",
"Northwood Centre\n",
"1940 North Monroe Street\n",
"Tallahassee, Florida 32399-0792\n",
"(904) 488-0062\n",
"CFT/SEL/VE\n",
"December 24, 1994\n",
"Case No.\n",
"94-16062\n"
]
}
],
"source": [
"corrected_text = response.choices[0].message.content\n",
"print(corrected_text)"
]
},
{
"cell_type": "markdown",
"id": "4fc0b6be-d32d-453a-ab96-c0ed0d84a73b",
"metadata": {},
"source": [
"Can you trust it? Who knows, but you can't trust the OCR result in the first place so it's kind of the Wild West."
]
},
{
"cell_type": "markdown",
"id": "a2ebda84-e37d-4024-83b0-938e160f2b8c",
"metadata": {},
"source": [
"## Putting it all together\n",
"\n",
"Now we can incorporate this with a \"normal\" [pdfminer.six](https://github.com/pdfminer/pdfminer.six) approach that just pulled selectable text out."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bec31c03-3d49-4dc0-adc7-7f6eddefd992",
"metadata": {},
"outputs": [],
"source": [
"%pip install --quiet pdfminer.six"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "34a65b93-c54a-4302-af0c-bba575a499ed",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"# You'll use your own API key from https://platform.openai.com/api-keys\n",
"client = OpenAI(api_key='YOUR_API_KEY_HERE')"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "75db7663-2990-454e-9149-488439eb2e99",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting text from pdfs/199416062.pdf\n",
"No selectable text found, running OCR\n",
"[2024/06/27 22:15:44] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process\n",
"Sending OCR text to GPT for cleanup\n",
"Extracting text from pdfs/200433069.pdf\n",
"No selectable text found, running OCR\n",
"[2024/06/27 22:17:09] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process\n",
"Sending OCR text to GPT for cleanup\n",
"Extracting text from pdfs/202348239.pdf\n",
"No selectable text found, running OCR\n",
"[2024/06/27 22:18:45] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process\n",
"Sending OCR text to GPT for cleanup\n",
"Extracting text from pdfs/199901009.pdf\n",
"No selectable text found, running OCR\n",
"[2024/06/27 22:19:32] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process\n",
"Sending OCR text to GPT for cleanup\n",
"Extracting text from pdfs/200820112.pdf\n"
]
}
],
"source": [
"from pdfminer.high_level import extract_text\n",
"import pandas as pd\n",
"import glob\n",
"from paddleocr import PaddleOCR\n",
"\n",
"ocr = PaddleOCR(lang='en', show_log=False)\n",
"\n",
"texts = []\n",
"filenames = glob.glob(\"pdfs/*.pdf\")\n",
"for filename in filenames:\n",
" print(f\"Extracting text from {filename}\")\n",
"\n",
" text = extract_text(filename).strip()\n",
"\n",
" if not text:\n",
" print(\"No selectable text found, running OCR\")\n",
"\n",
" pages = ocr.ocr(filename)\n",
" pdf_text = '\\n'.join([line[1][0] for page in pages for line in page])\n",
" prompt = f\"\"\"\n",
" The text below is from a PDF that has been OCR'd. It probably has some errors, please\n",
" revise to correct the errors. Only provide the corrected text, nothing else.\n",
" \n",
" ## OCR TEXT\n",
" \n",
" {pdf_text}\n",
" \"\"\"\n",
"\n",
" print(\"Sending OCR text to GPT for cleanup\")\n",
" \n",
" response = client.chat.completions.create(\n",
" messages=[ { \"role\": \"user\", \"content\": prompt }],\n",
" model=\"gpt-4o\",\n",
" )\n",
"\n",
" text = response.choices[0].message.content\n",
"\n",
" texts.append(text)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "808500af-b0d7-4f85-8f48-d7c039058d49",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>filename</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>pdfs/199416062.pdf</td>\n",
" <td>## CORRECTED TEXT\\n\\nSTATE OF FLORIDA\\nBOARD OF MASSAGE\\nFinal Order No. BPR-95-04169 Date -/-95\\nFILED\\nDEPARTMENT OF BUSINESS AND\\nDept. of Business and Professional Regulation\\nPROFESSIONAL REGULATION,\\nAGENCY CLERK\\nSarah Wachman, Agency Clerk\\nPetitioner,\\nBy: Axans C. Kits\\nvs.\\nCASE NO.:\\n94-16062\\nLICENSE NO.:\\nMM 0000181\\nHOLISTIC MASSAGE THERAPY,\\nRespondent.\\nFINAL ORDER APPROVING SETTLEMENT STIPULATION\\nTHIS MATTER came before the Board of Massage at a duly\\nnoticed public meeting held on July 5, 1995, in St.\\nPetersburg Beach, Florida, for consideration of the\\nAdministrative Complaint (attached hereto as Exhibit A) and\\nthe proposed Stipulation (attached hereto as Exhibit B)\\nentered into between the parties in the above-styled case.\\nThe Petitioner was represented by Susan E. Lingard.\\nThe Respondent was present and represented by Betty L. Pritchard.\\nUpon consideration of the Administrative Complaint and\\nthe proposed Settlement Stipulation in this matter, and being...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>pdfs/200433069.pdf</td>\n",
" <td>Final Order No. DOH-05-1377-S-MQA\\nFILED DATE:\\nDepartment of Health\\nBy: \\nDeputy Agency Clerk\\n\\nSTATE OF FLORIDA\\nBOARD OF MASSAGE THERAPY\\n\\nDEPARTMENT OF HEALTH,\\nPetitioner,\\nvs.\\n\\nCASE NO.: 2004-33069\\nLICENSE NO.: MM-15747\\n\\n1st HEALTH, INC.,\\nRespondent.\\n\\nFINAL ORDER\\n\\nTHIS CAUSE came before the Board of Massage Therapy (hereinafter the \"Board\") pursuant to Section 120.57(4), Florida Statutes, on July 28, 2005, in Orlando, Florida, for consideration of a Consent Agreement (attached hereto as Exhibit A) entered into between the parties in the above-styled cause. Respondent was not present. Upon consideration of the Consent Agreement, the documents submitted in support thereof, and being otherwise advised in the premises, it is hereby ordered and adjudged:\\n1. The Consent Agreement as submitted is hereby approved, adopted in toto and incorporated herein by reference. Accordingly, the parties shall adhere to and abide by all the terms of the Consent Agreement.\\n2. As aut...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>pdfs/202348239.pdf</td>\n",
" <td>STATE OF FLORIDA \\nBOARD OF MASSAGE THERAPY \\nDEPARTMENT OF HEALTH, \\nPetitioner, \\nCASE NO. 2023-48239 \\nA &amp; A Holding 1 LLC, \\nRespondent.\\nADMINISTRATIVE COMPLAINT \\nCOMES NOW the Petitioner, Department of Health, and files this Administrative Complaint before the Board of Massage Therapy (\"Board\") against Respondent, A &amp; A Holding 1 LLC, and alleges:\\n\\n1.\\nPetitioner is the state agency charged with regulating the practice of massage therapy pursuant to section 20.43, Florida Statutes; chapter 456, Florida Statutes; and chapter 480, Florida Statutes.\\n\\n2.\\nAt all times material to this Complaint, Respondent was a licensed massage establishment in the state of Florida, having been issued license number MM 37035.\\n\\n3.\\nRespondent's mailing address of record is 6750 N. Orange Blossom Trail, Suite B5, Orlando, Florida 32810.\\n\\n4.\\nOn or about May 23, 2019, Xiao Ping Yuan Jung, L.M.T. (Yuan Jung), MA 87934, entered a plea of guilty in case number 2018MM-007311 in the Coun...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>pdfs/199901009.pdf</td>\n",
" <td>## CORRECTED TEXT\\n\\nSTATE OF FLORIDA \\nBOARD OF MASSAGE THERAPY \\nDEPARTMENT OF HEALTH \\nFILED DATE \\nDepartment of Health \\nPetitioner, \\nByRon \\n \\nvs. \\nCASE NO: 99-01009 \\nLICENSE NO.: MM 000252 \\nDONALD GEATCHES, \\nRespondent. \\n\\nFINAL ORDER APPROVING SETTLEMENT STIPULATION \\nTHIS MATTER came before the Board of Massage Therapy at a duly noticed public meeting held on April 27-28, 2000, in Tampa, Florida, pursuant to Section 120.57(4), Florida Statutes, for consideration of the Administrative Complaint (attached hereto as Exhibit A) and the proposed Stipulation (attached hereto as Exhibit B) entered into between the parties.\\n\\nUpon consideration of the Administrative Complaint and the proposed Settlement Stipulation in this matter, and being otherwise fully advised in the premises, it is hereby ORDERED AND ADJUDGED:\\n1. The proposed Stipulation is hereby approved, adopted, and incorporated herein by reference.\\n2. Respondent will adhere to and abide by all of ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>pdfs/200820112.pdf</td>\n",
" <td>STATE OF FLORIDA \\nDEPARTMENT OF HEALTH \\n\\nDEPARTMENT OF HEALTH, \\n\\nPetitioner, \\n\\nvs. (cid:9)\\n\\nASSA DAY SPA, \\n\\nRespondent \\n\\nCase No. 2008-2013.2 \\n\\nAMENDED ADMINISTRATIVE COMPLAINT \\n\\nCOMES NOW, Petitioner, Department of Health (hereinafter \\n\\n\"Petitioner\"), by and through its undersigned counsel, and files this \\n\\nAmended Administrative Complaint before the Board of Massage Therapy \\n\\nagainst the Respondent, Assa Day Spa (hereinafter \"Respondent\"/ASSA), \\n\\nand in support thereof alleges: \\n\\n1.\\n\\nPetitioner is the state department charged with regulating the \\n\\npractice of massage therapy pursuant to Section 20.43, Florida Statutes; \\n\\nChapter 456, Florida Statutes; and Chapter 480, Florida Statutes. \\n\\n2.\\n\\nAt all times material to this Complaint, Respondent was a \\n\\nlicensed massage establishment within the state of Florida, having been \\n\\nissued license number MM20225 on or about September 5, 2007. \\n\\nJAPSU\\Medical\\DICONCILIO\\MASSAGE BOARD \\Assa Day Spa-...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" filename \\\n",
"0 pdfs/199416062.pdf \n",
"1 pdfs/200433069.pdf \n",
"2 pdfs/202348239.pdf \n",
"3 pdfs/199901009.pdf \n",
"4 pdfs/200820112.pdf \n",
"\n",
" text \n",
"0 ## CORRECTED TEXT\\n\\nSTATE OF FLORIDA\\nBOARD OF MASSAGE\\nFinal Order No. BPR-95-04169 Date -/-95\\nFILED\\nDEPARTMENT OF BUSINESS AND\\nDept. of Business and Professional Regulation\\nPROFESSIONAL REGULATION,\\nAGENCY CLERK\\nSarah Wachman, Agency Clerk\\nPetitioner,\\nBy: Axans C. Kits\\nvs.\\nCASE NO.:\\n94-16062\\nLICENSE NO.:\\nMM 0000181\\nHOLISTIC MASSAGE THERAPY,\\nRespondent.\\nFINAL ORDER APPROVING SETTLEMENT STIPULATION\\nTHIS MATTER came before the Board of Massage at a duly\\nnoticed public meeting held on July 5, 1995, in St.\\nPetersburg Beach, Florida, for consideration of the\\nAdministrative Complaint (attached hereto as Exhibit A) and\\nthe proposed Stipulation (attached hereto as Exhibit B)\\nentered into between the parties in the above-styled case.\\nThe Petitioner was represented by Susan E. Lingard.\\nThe Respondent was present and represented by Betty L. Pritchard.\\nUpon consideration of the Administrative Complaint and\\nthe proposed Settlement Stipulation in this matter, and being... \n",
"1 Final Order No. DOH-05-1377-S-MQA\\nFILED DATE:\\nDepartment of Health\\nBy: \\nDeputy Agency Clerk\\n\\nSTATE OF FLORIDA\\nBOARD OF MASSAGE THERAPY\\n\\nDEPARTMENT OF HEALTH,\\nPetitioner,\\nvs.\\n\\nCASE NO.: 2004-33069\\nLICENSE NO.: MM-15747\\n\\n1st HEALTH, INC.,\\nRespondent.\\n\\nFINAL ORDER\\n\\nTHIS CAUSE came before the Board of Massage Therapy (hereinafter the \"Board\") pursuant to Section 120.57(4), Florida Statutes, on July 28, 2005, in Orlando, Florida, for consideration of a Consent Agreement (attached hereto as Exhibit A) entered into between the parties in the above-styled cause. Respondent was not present. Upon consideration of the Consent Agreement, the documents submitted in support thereof, and being otherwise advised in the premises, it is hereby ordered and adjudged:\\n1. The Consent Agreement as submitted is hereby approved, adopted in toto and incorporated herein by reference. Accordingly, the parties shall adhere to and abide by all the terms of the Consent Agreement.\\n2. As aut... \n",
"2 STATE OF FLORIDA \\nBOARD OF MASSAGE THERAPY \\nDEPARTMENT OF HEALTH, \\nPetitioner, \\nCASE NO. 2023-48239 \\nA & A Holding 1 LLC, \\nRespondent.\\nADMINISTRATIVE COMPLAINT \\nCOMES NOW the Petitioner, Department of Health, and files this Administrative Complaint before the Board of Massage Therapy (\"Board\") against Respondent, A & A Holding 1 LLC, and alleges:\\n\\n1.\\nPetitioner is the state agency charged with regulating the practice of massage therapy pursuant to section 20.43, Florida Statutes; chapter 456, Florida Statutes; and chapter 480, Florida Statutes.\\n\\n2.\\nAt all times material to this Complaint, Respondent was a licensed massage establishment in the state of Florida, having been issued license number MM 37035.\\n\\n3.\\nRespondent's mailing address of record is 6750 N. Orange Blossom Trail, Suite B5, Orlando, Florida 32810.\\n\\n4.\\nOn or about May 23, 2019, Xiao Ping Yuan Jung, L.M.T. (Yuan Jung), MA 87934, entered a plea of guilty in case number 2018MM-007311 in the Coun... \n",
"3 ## CORRECTED TEXT\\n\\nSTATE OF FLORIDA \\nBOARD OF MASSAGE THERAPY \\nDEPARTMENT OF HEALTH \\nFILED DATE \\nDepartment of Health \\nPetitioner, \\nByRon \\n \\nvs. \\nCASE NO: 99-01009 \\nLICENSE NO.: MM 000252 \\nDONALD GEATCHES, \\nRespondent. \\n\\nFINAL ORDER APPROVING SETTLEMENT STIPULATION \\nTHIS MATTER came before the Board of Massage Therapy at a duly noticed public meeting held on April 27-28, 2000, in Tampa, Florida, pursuant to Section 120.57(4), Florida Statutes, for consideration of the Administrative Complaint (attached hereto as Exhibit A) and the proposed Stipulation (attached hereto as Exhibit B) entered into between the parties.\\n\\nUpon consideration of the Administrative Complaint and the proposed Settlement Stipulation in this matter, and being otherwise fully advised in the premises, it is hereby ORDERED AND ADJUDGED:\\n1. The proposed Stipulation is hereby approved, adopted, and incorporated herein by reference.\\n2. Respondent will adhere to and abide by all of ... \n",
"4 STATE OF FLORIDA \\nDEPARTMENT OF HEALTH \\n\\nDEPARTMENT OF HEALTH, \\n\\nPetitioner, \\n\\nvs. (cid:9)\\n\\nASSA DAY SPA, \\n\\nRespondent \\n\\nCase No. 2008-2013.2 \\n\\nAMENDED ADMINISTRATIVE COMPLAINT \\n\\nCOMES NOW, Petitioner, Department of Health (hereinafter \\n\\n\"Petitioner\"), by and through its undersigned counsel, and files this \\n\\nAmended Administrative Complaint before the Board of Massage Therapy \\n\\nagainst the Respondent, Assa Day Spa (hereinafter \"Respondent\"/ASSA), \\n\\nand in support thereof alleges: \\n\\n1.\\n\\nPetitioner is the state department charged with regulating the \\n\\npractice of massage therapy pursuant to Section 20.43, Florida Statutes; \\n\\nChapter 456, Florida Statutes; and Chapter 480, Florida Statutes. \\n\\n2.\\n\\nAt all times material to this Complaint, Respondent was a \\n\\nlicensed massage establishment within the state of Florida, having been \\n\\nissued license number MM20225 on or about September 5, 2007. \\n\\nJAPSU\\Medical\\DICONCILIO\\MASSAGE BOARD \\Assa Day Spa-... "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.options.display.max_colwidth = 1000\n",
"\n",
"df = pd.DataFrame({\n",
" 'filename': filenames,\n",
" 'text': texts\n",
"})\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "82c8a24a-677d-4623-aea4-2d910824270b",
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"output.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84798fbc-892c-442e-b7e4-e64c3766e409",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment