Skip to content

Instantly share code, notes, and snippets.

@nikhilweee
Last active March 29, 2024 10:50
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save nikhilweee/24cae428f68c153afda495dc17ef43d6 to your computer and use it in GitHub Desktop.
Save nikhilweee/24cae428f68c153afda495dc17ef43d6 to your computer and use it in GitHub Desktop.
Convert HDFC Bank Credit Card statements from PDF to Excel
# This script is designed to convert bank statements from pdf to excel.
#
# It has been tweaked on HDFC Bank Credit Card statements,
# but in theory you can use it on any PDF document.
#
# The script depends on camelot-py,
# which can be installed using pip
#
# pip install "camelot-py[cv]"
import os
import argparse
import camelot
import pandas as pd
from collections import defaultdict
def extract_df(path, password=None):
# The default values from pdfminer are M = 2.0, W = 0.1 and L = 0.5
laparams = {'char_margin': 2.0, 'word_margin': 0.2, 'line_margin': 1.0}
# Extract all tables using the lattice algorithm
lattice_tables = camelot.read_pdf(path, password=password,
pages='all', flavor='lattice', line_scale=50, layout_kwargs=laparams)
# Extract bounding boxes
regions = defaultdict(list)
for table in lattice_tables:
bbox = [table._bbox[i] for i in [0, 3, 2, 1]]
regions[table.page].append(bbox)
df = pd.DataFrame()
# Extract tables using the stream algorithm
for page, boxes in regions.items():
areas = [','.join([str(int(x)) for x in box]) for box in boxes]
stream_tables = camelot.read_pdf(path, password=password, pages=str(page),
flavor='stream', table_areas=areas, row_tol=5, layout_kwargs=laparams)
dataframes = [table.df for table in stream_tables]
dataframes = pd.concat(dataframes)
df = df.append(dataframes)
return df
def main(args):
for file_name in os.listdir(args.in_dir):
root, ext = os.path.splitext(file_name)
if ext.lower() != '.pdf':
continue
pdf_path = os.path.join(args.in_dir, file_name)
print(f'Processing: {pdf_path}')
df = extract_df(pdf_path, args.password)
excel_name = root + '.xlsx'
excel_path = os.path.join(args.out_dir, excel_name)
df.to_excel(excel_path)
print(f'Processed : {excel_path}')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--in-dir', type=str, required=True, help='directory to read statement PDFs from.')
parser.add_argument('--out-dir', type=str, required=True, help='directory to store statement XLSX to.')
parser.add_argument('--password', type=str, default=None, help='password for the statement PDF.')
args = parser.parse_args()
main(args)
@nikhilweee
Copy link
Author

For the curious minds out there, here's why camelot doesn't work on HDFC Bank Credit Card statements out of the box.

Camelot has two algorithms (lattice and stream) for extracting tables from PDFs. The lattice method uses image processing to find lines, so it is more accurate, but it fails to get the structure of the table in cases where cell borders are not present. The stream method, on the other hand, uses heuristics to detect text edges, but it often fails to estimate the extent of the table from within the page. Sadly, for our use case, the performance of both methods is far from adequate.

But don't let that disappoint you. Interestingly, HDFC Bank Credit Card statements have an explicit table boundary but there is no border between individual rows in the table. This script combines the best of both worlds, where it uses the lattice method to deterministically find table boundaries, and then use the stream method to extract the table structure from within the boundary.

@pushkaryadav
Copy link

pushkaryadav commented Apr 7, 2021

Hello @nikhilweee
Does this script still work? If yes, then can you please write a small guide on how to use it in Windows?

I have download Python 3.9.4 and have installed the per-requsiticis (pip install camelot-py opencv-python-headless) but as soon as I run it, I get this error...

usage: excel.py [-h] --in-dir IN_DIR --out-dir OUT_DIR [--password PASSWORD]
excel.py: error: the following arguments are required: --in-dir, --out-dir

So where do I have to define my input & output folder and the common password for the files?

I have tried replacing '--in-dir' with my Input directory address, '--out-dir' with my output directory address and '--password' with the common password for my 5-6 files in my INPUT folder. I even replaced the "" of windows directory address with "/" but I am still getting an error
TypeError: 'required' is an invalid argument for positionals

@nikhilweee
Copy link
Author

Here is an example invocation. Unfortunately, I don't have a windows machine handy, so you will have to adapt the following to suit windows syntax.

python statement-to-excel.py --in-dir path/to/statements --out-dir path/to/excels --password $e(₹€T

@pushkaryadav
Copy link

pushkaryadav commented Apr 8, 2021

Thank you @nikhilweee for your quick reply. I'm not a coder, only an accountant trying to make my work quicker, so please bear with me 😇

I tried this statement. Looked like it started working but then cmd threw an error
RuntimeError: Please make sure that Ghostscript is installed

After some searching, I installed AGPL version of Ghostscript and after that the error changed to (whole error without file addresses is pasted at the bottom)
ValueError: min() arg is an empty sequence

Will you be kind enough to shed some light on resolving this issue?

Traceback (most recent call last):
  File "...statement-to-excel.py", line 68, in <module>
    main(args)
  File "...statement-to-excel.py", line 54, in main
    df = extract_df(pdf_path, args.password)
  File "...statement-to-excel.py", line 38, in extract_df
    stream_tables = camelot.read_pdf(path, password=password, pages=str(page),
  File "...camelot\io.py", line 113, in read_pdf
    tables = p.parse(
  File "...camelot\handlers.py", line 171, in parse
    t = parser.extract_tables(
  File "...camelot\parsers\stream.py", line 457, in extract_tables
    cols, rows = self._generate_columns_and_rows(table_idx, tk)
  File "...camelot\parsers\stream.py", line 330, in _generate_columns_and_rows
    text_x_min, text_y_min, text_x_max, text_y_max = self._text_bbox(self.t_bbox)
  File "...camelot\parsers\stream.py", line 97, in _text_bbox
    xmin = min([t.x0 for direction in t_bbox for t in t_bbox[direction]])
ValueError: min() arg is an empty sequence

@anushka17agarwal1
Copy link

I'm having a similar error

@nikhilweee
Copy link
Author

@pushkaryadav @anushka17agarwal1 I have tested the script again, it works fine on my system. I'm using python 3.8 and camelot-py version 0.8.2. Can you share your configuration and a sample PDF that doesn't work for you?

@pushkaryadav
Copy link

Hello @nikhilweee @anushka17agarwal1
For obvious reasons, i can't share any of my credit card pdf.

To Nikhil, I'd suggest that you try it again on a newer HDFC credit card statement if you haven't already.

To Anuhska, fortunately, i found a windows programme (not expensive) that can somehow remove the copy protection, now i can easily copy my data from the pdf. Spent 5-10 minutes on excel to make a helper sheet with MID and SEARCH functions to extract the data (date, narration and debit & credit amount).

@nikhilweee
Copy link
Author

@pushkaryadav I guess it's impossible for me to help you debug further without a sample.

@simplyrahul
Copy link

@pushkaryadav Got the same error:
File "D:\Python39\lib\site-packages\camelot\parsers\stream.py", line 333, in _generate_columns_and_rows
text_x_min, text_y_min, text_x_max, text_y_max = self._text_bbox(self.t_bbox)
File "D:\Python39\lib\site-packages\camelot\parsers\stream.py", line 97, in _text_bbox
xmin = min([t.x0 for direction in t_bbox for t in t_bbox[direction]])
ValueError: min() arg is an empty sequence

@pushkaryadav
Copy link

pushkaryadav commented Dec 18, 2021

UPDATE: Replaced the formula for Narration.

@simplyrahul
yeah but like I said in my last reply. you can use the windows software named, A-PDF RESTRICTION REMOVER (a-pdf.com) to remove the copy-protection (SECURED) settings from the PDF and then you can manually copy the data into excel, but there's still a small issue that the data is copied into just one column.

To solve that issue, that is, to differentiate DATE, DESCRIPTION and Credit/Debit Amount, I have made a small excel file and with the help of few basic formulas, my excel does the work in a second.

Pasting the formulas I'm using below, assuming data is copied into A2 and below.
Cell B2
=ISERROR(SEARCH(" cr",$A2,LEN($A2)-3))

Cell C2
=TRIM(RIGHT(SUBSTITUTE(SUBSTITUTE(A2," Cr","")," ",REPT(" ",255)),255))

Cell D2 (For Date)
=DATE(MID($A2,7,4),MID($A2,4,2),LEFT($A2,2))

Cell E2 (For Narration/Description)
=MID($A2,12,IF($B2,LEN($A2)-LEN($C2)-12,LEN($A2)-LEN($C2)-12-3))

Cell F2 (For Purchase transactions)
=NUMBERVALUE(IF($B2,$C2,0))

Cell G2 (For Payment transactions)
=NUMBERVALUE(IF($B2,0,$C2))

Hope this helps.
Pushkar

@sethia
Copy link

sethia commented Dec 19, 2021

@nikhilweee same error about min() arg in my system too on a Windows machine.
Have you been able to find a solution?

@pushkaryadav good to know an accountant automating mundane stuff, what other solutions have you tried?

You can use Tabula to extract from PDF and also for HDFC CC, get a YESS statement from bank which consolidates all of your CC statements and easier for accounting

@malharlakdawala
Copy link

@pushkaryadav
thanks on sharing the excel codes..can you change the excel formula for Cell E2 (For Narration/Description)
Its the same as for date

@pushkaryadav
Copy link

pushkaryadav commented Dec 21, 2021

@malharlakdawala
Updated the formula.

These formulas were supposed to be temporary but fortunately they are still working after 6 months and I am yet to see an error/incorrect output. If I ever change the formula, aka make the formulas more flexible I'll share them here.

One good example of making the sheet better is in the formula of B2. Instead of looking for the " cr" in the whole sentence, I should rather look for " cr" using the RIGHT function.

Alternative formula for cell B2
=NOT(RIGHT($A2,3)=" Cr")

Regards.

@pushkaryadav
Copy link

@sethia
Thank you for the suggestion but IIRC, I did try tabula or something similar, it was too much work and didn't really work. My current setup is much easier and user friendly, at least for me.
So currently, I have no need to look for further options.

Bank statement is already being updated daily since it is a necessity. So a combined statement won't help much as I'll have to get in touch with the bank once again and they are not working at their full potential for many months now. It took multiple reminder phone calls and three weeks just to switch our banking e-statement from monthly to daily.

@mntolia
Copy link

mntolia commented Jan 10, 2022

Thanks for this @nikhilweee

@cstambay
Copy link

cstambay commented May 9, 2022

For windows, I modified the code as below and ran it on Jupyter. I just directly mentioned the filepaths in the code, instead of feeding through the function, which was causing the issues. You can change the actual paths as per your own system. I used a sample folder on the desktop to check if this works. Thanks @nikhilweee :)

import os
import argparse
import camelot
import pandas as pd
from collections import defaultdict


def extract_df(path, password=None):
    # The default values from pdfminer are M = 2.0, W = 0.1 and L = 0.5
    laparams = {'char_margin': 2.0, 'word_margin': 0.2, 'line_margin': 1.0}

    # Extract all tables using the lattice algorithm
    lattice_tables = camelot.read_pdf(path, password=password, 
        pages='all', flavor='lattice', line_scale=50, layout_kwargs=laparams)

    # Extract bounding boxes
    regions = defaultdict(list)
    for table in lattice_tables:
        bbox = [table._bbox[i] for i in [0, 3, 2, 1]]
        regions[table.page].append(bbox)

    df = pd.DataFrame()

    # Extract tables using the stream algorithm
    for page, boxes in regions.items():
        areas = [','.join([str(int(x)) for x in box]) for box in boxes]
        stream_tables = camelot.read_pdf(path, password=password, pages=str(page),
            flavor='stream', table_areas=areas, row_tol=5, layout_kwargs=laparams)
        dataframes = [table.df for table in stream_tables]
        dataframes = pd.concat(dataframes)
        df = df.append(dataframes)
    
    return df


pdf_path = 'C:\\Users\\<Your_name>\\Desktop\\HDFCCredit\\HDFCJan.PDF'
print(f'Processing: {pdf_path}')
df = extract_df(pdf_path)
excel_name = 'export.xlsx'
excel_path = 'C:\\Users\\<Your_name>\\Desktop\\HDFCCredit\\export.xlsx'
df.to_excel(excel_path)
print(f'Processed : {excel_path}')

@nikhilweee
Copy link
Author

@cstambay glad that I could be of help :)

@kputtur
Copy link

kputtur commented May 28, 2022

I had to install Ghostscript along with camelot-py[cv] after that it worked perfectly fine - Thank you @nikhilweee

@manishpatwari-dev
Copy link

manishpatwari-dev commented Sep 25, 2022

@simplyrahul I got the same error
ValueError: min() arg is an empty sequence
in my Windows system. After some debugging, I found a fix by changing the _text_bbox(t_bbox) method in camelot-py's file 'stream.py' as below. It works for me now. Hope it helps.

`

def _text_bbox(t_bbox):
"""Returns bounding box for the text present on a page.

    Parameters
    ----------
    t_bbox : dict
        Dict with two keys 'horizontal' and 'vertical' with lists of
        LTTextLineHorizontals and LTTextLineVerticals respectively.

    Returns
    -------
    text_bbox : tuple
        Tuple (x0, y0, x1, y1) in pdf coordinate space.

    """
    xmin = 0
    ymin = 0
    xmax = 0
    ymax = 0
    if len([t.x0 for direction in t_bbox for t in t_bbox[direction]])>0:
        xmin = min([t.x0 for direction in t_bbox for t in t_bbox[direction]])
        ymin = min([t.y0 for direction in t_bbox for t in t_bbox[direction]])
        xmax = max([t.x1 for direction in t_bbox for t in t_bbox[direction]])
        ymax = max([t.y1 for direction in t_bbox for t in t_bbox[direction]])
    text_bbox = (xmin, ymin, xmax, ymax)
    return text_bbox

`

@rdsoze
Copy link

rdsoze commented Jun 3, 2023

Thanks a lot, @nikhilweee for this! Been using it from the last three years.

While running on a new system, got the following error:

PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead

Refer camelot-dev/camelot#339

What worked for me

python3.8 -m pip uninstall PyPDF2
python3.8 -m pip install PyPDF2~=2.0

@fidelis-Ck
Copy link

The code is not working, kindly check...
SAVE_20230618_200714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment