Skip to content

Instantly share code, notes, and snippets.

@walreyes
Created January 19, 2017 18:38
Show Gist options
  • Save walreyes/2fe8c38a0b9be8b2aa21b95565c7a9d3 to your computer and use it in GitHub Desktop.
Save walreyes/2fe8c38a0b9be8b2aa21b95565c7a9d3 to your computer and use it in GitHub Desktop.

Parsing Excel files with ActiveJob

Last month I had to parse a huge amount of excel files to run some analytics and draw a couple of graphs.

Excel graph

This is a common case: users fill excel reports to document data, they could be documenting inspection reports, surveys, calculating something, you name it. There is a wide variety of stuff you could be doing with excel. Then the user will upload this file to your system, so it gets stored, shared and ready to download. Sooner or later excel files will get into your system.

In my situation users were uploading excel files but nothing was being done with them, yet. Now it was time to parse those excel files, get the data into a database, analyze it and display some neat graphs.

The Approach

I needed to parse many excel files already uploaded to AWS, and I needed to keep parsing newly added files as they are upload.

I used ActiveJob to write a job that parsed a single file, next I could enqueue all existing files and let Sidekiq do the hard work. Finally I could queue a new job everytime a file is uploaded.

I used Roo to parse the excel spreadsheets, this gem was very straigthforward to use. And it was able to open each file from an url, which is great because I did'nt have to download them.

You can do the following with Roo:

roo_parser = Roo::Spreadsheet.open(excel_url), select the sheet you need and access cells roo_parser.sheet("SheetYouNeed").cell("A,4"). As you can see it's very straigthforward. You can read the documentation to see everything you can do.

The Job

The job that needed to be done was simple. Every excel file is linked to an ActiveRecord model, which stores the url file, name, etc. The ActiveJob would take the excel_report_model as a parameter and it will scrap it.

Let's generate the job:

rails generate job excel_file_parser

The job should do the following:

  1. Open the excel file
  2. Scrap the data you need and store it in the database
  3. Close the excel file to free memory
  4. Mark the excel file as parsed for future reference.

The code looks like this:

class ExcelFileParserJob < ActiveJob::Base
  queue_as :excel_files_parsers

  def perform(excel_report_model)
    @excel_report_model = excel_report_model
    open_excel_file
    scrap_data
    close_excel_file
    mark_excel_file_as_parsed
  end

  private

    def open_excel_file
      @excel_report_file = Roo::Spreadsheet.open(excel_report_url)
    end

    def scrap_data
      #Here you can scrap whatever you need
      cell_data = @excel_report_file.cell("N", 49)

      #Store it in the database!
      @excel_report_model.data = cell_data
      @excel_report_model.save
    end

    def close_excel_file
      @excel_report_file.close
    end

    def excel_report_url
      # If you use carrierwave, you can access the url with this
      @excel_report_model.filename.file.url
    end

    def mark_excel_file_as_parsed
      @excel_report_model.update(parsed: true)
    end
end

So, now the only thing left to do is to queue all the existing excel files:

ExcelFilesReports.all.each do |excel_report|
  ExcelFileParserJob.perform_later(excel_report)
end

This assumes that you already have Sidekiq up and running.

That would take care of parsing all the existing files, and for every new excel file we can add a create callback to queue the parser job.

class ExcelFileReport < ActiveRecord::Base
  after_create :queue_excel_file_parser

  private
    def queue_excel_file_parser
      ExcelFileParserJob.perform_later(self)
    end
end

Summary

And that's it. Happy parsing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment