Last month I had to parse a huge amount of excel files to run some analytics and draw a couple of graphs.
This is a common case: users fill excel reports to document data, they could be documenting inspection reports, surveys, calculating something, you name it. There is a wide variety of stuff you could be doing with excel. Then the user will upload this file to your system, so it gets stored, shared and ready to download. Sooner or later excel files will get into your system.
In my situation users were uploading excel files but nothing was being done with them, yet. Now it was time to parse those excel files, get the data into a database, analyze it and display some neat graphs.
I needed to parse many excel files already uploaded to AWS, and I needed to keep parsing newly added files as they are upload.
I used ActiveJob to write a job that parsed a single file, next I could enqueue all existing files and let Sidekiq do the hard work. Finally I could queue a new job everytime a file is uploaded.
I used Roo to parse the excel spreadsheets, this gem was very straigthforward to use. And it was able to open each file from an url, which is great because I did'nt have to download them.
You can do the following with Roo:
roo_parser = Roo::Spreadsheet.open(excel_url)
, select the sheet you need and access cells roo_parser.sheet("SheetYouNeed").cell("A,4")
. As you can see it's very straigthforward. You can read the documentation to see everything you can do.
The job that needed to be done was simple. Every excel file is linked to an ActiveRecord model, which stores the url file, name, etc. The ActiveJob would take the excel_report_model
as a parameter and it will scrap it.
Let's generate the job:
rails generate job excel_file_parser
The job should do the following:
- Open the excel file
- Scrap the data you need and store it in the database
- Close the excel file to free memory
- Mark the excel file as parsed for future reference.
The code looks like this:
class ExcelFileParserJob < ActiveJob::Base
queue_as :excel_files_parsers
def perform(excel_report_model)
@excel_report_model = excel_report_model
open_excel_file
scrap_data
close_excel_file
mark_excel_file_as_parsed
end
private
def open_excel_file
@excel_report_file = Roo::Spreadsheet.open(excel_report_url)
end
def scrap_data
#Here you can scrap whatever you need
cell_data = @excel_report_file.cell("N", 49)
#Store it in the database!
@excel_report_model.data = cell_data
@excel_report_model.save
end
def close_excel_file
@excel_report_file.close
end
def excel_report_url
# If you use carrierwave, you can access the url with this
@excel_report_model.filename.file.url
end
def mark_excel_file_as_parsed
@excel_report_model.update(parsed: true)
end
end
So, now the only thing left to do is to queue all the existing excel files:
ExcelFilesReports.all.each do |excel_report|
ExcelFileParserJob.perform_later(excel_report)
end
This assumes that you already have Sidekiq up and running.
That would take care of parsing all the existing files, and for every new excel file we can add a create callback to queue the parser job.
class ExcelFileReport < ActiveRecord::Base
after_create :queue_excel_file_parser
private
def queue_excel_file_parser
ExcelFileParserJob.perform_later(self)
end
end
And that's it. Happy parsing!