Skip to content

Instantly share code, notes, and snippets.

@JesseAldridge
Last active October 23, 2018 01:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JesseAldridge/40f4b7d14628dd9b592fb082062604ab to your computer and use it in GitHub Desktop.
Save JesseAldridge/40f4b7d14628dd9b592fb082062604ab to your computer and use it in GitHub Desktop.
NanoParser - Oct 18, 2018

I've decided to focus on data extraction for now.

I talked to Luke from ModelOptic a lot and we agreed that I would develop a parsing service for him that will take Excel spreadsheets containing financial modelling information and spit out json formatted according to a specification he will provide. We agreed that he would pay me an hourly fee to write the code and that we would co-license the code I write. This means he can do whatever he wants with my code and I can use it as a basis for a more generalized data processing service.

I've also been talking to Anuraag from Manifest and he has expressed interest in setting up a similar deal. So out of the seven or so companies in my Startup School group two of them are willing to pay for a document parsing service. This is pretty exciting and suggests that there is a ton of demand and opportunity here.

I registered nanoparse.com and nanoparser.com and am planning on using the name NanoParse to encapsulate this idea of a document processing company.

I did a bunch of research into the data processing world. It is quite large, complicated, and growing very quickly. One of the subjects most interesting to me is Data Lakes. The idea there is companies basically throw all of their documents (Excel files, emails, PDFs, whatever) into a single repository and a combination of engineers, analysts, and computers take those documents, processes them, and stick the resulting data into something like a data warehouse where it can then be used for all sorts of purposes. Informatica is a big company that does data processing; the "Enterprise Data Lake" is the most popular product listed on their product page. AWS has published a guide explaining how to build such a system with their products. These examples demonstrate a clear path from a document processing service to a full enterprise data management solution (and beyond).

Did you know there are about 40 zettabytes of data in the world? That's 10^21, or 9 billion terabytes. And it's doubling in size every 2.5 years. And 80% of it is unstructured which means it needs to be processed before it can be utilized. It seems like the data economy, as big as it is, is still just getting started.

One of the closest services to NanoParse that I found is called Trifacta. The core idea is you import a spreadsheet and you apply transformations to it via a GUI. You can see previews that show what the result of those transformations will be. I tried the service myself and found it to be largely unusable. Yet they still have a valuation of $258 million dollars. The founders are academics from Stanford and they wrote a paper describing the techniques they use: http://vis.stanford.edu/files/2011-Wrangler-CHI.pdf This paper seems to be an excellent catalog of the issues involved in the what they refer to as the field of data wrangling.

Coming back to specifics, I looked into adding PDF parsing to NanoParse: http://repoq.com/lists/pdf_to_text

The problem with PDFs is they are generally unstructured. They are basically a sequence of commands to render a particular character at a particular coordinate. In order to extract words from them you basically have to guess which letters form words by looking at which letters are next to it.

This makes it especially hard to extract tables. I found some attempts to do this:

They both kind of work. But I think to make PDF table extraction work well, the best approach will be to simply convert it to lines of text using something like PDFBox or Poppler and then write a parser that consumes the text stream and uses some custom logic to turn it into structured data.

Now, between writing code for these two clients and doing my own research I am basically out of bandwidth. If I am going to increase revenue in the short term I think the only way will be to hire developers to help me write parsers and stuff. And then I could maybe hire some salespeople to help me find new clients. And if I'm doing that then maybe I should raise money to start a data processing startup. But I guess that's all a couple of steps down the road.

On an entirely different topic, I ran across this debate about Communism: https://www.youtube.com/watch?v=BcXLLuL78yQ One of the guys in that debate is seriously arguing for the execution of the wealthiest business owners in the United States. I found his position, and the fact that most of the people observing the debate seemed to support him, somewhat alarming. If I actually succeed in building a system that replaces a significant amount of humans with computers, are fanatical communist luddites going to murder me? More importantly, there seem to be a lot of unresolved philosophical questions regarding the development of intelligent machines. Automation is clearly going to vastly increase an already huge wealth disparity around the world. It is not clear to me what the majority of humanity will think and feel about this. I do not have the answers, but I guess the point is moot anyway. Since automation seems to be an unstoppable force -- for better or worse I will continue working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment