Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save justinfargnoli/3f11c29864fe1a37f4e83ec076eedc2b to your computer and use it in GitHub Desktop.
Save justinfargnoli/3f11c29864fe1a37f4e83ec076eedc2b to your computer and use it in GitHub Desktop.

Building a Search Engine from Scratch - WIP

Context

In the summer of 2020 I worked with Paul Oullette in a Research Experience for Undergraduates (REU) lead by Professor Fatemeh Nargesian. Our goal was to build a dataset search engine that would show off some of Professor Nargesian's research.

Goals

  1. Build a publicly availible dataset search engine.
  2. Allow users to use keyword search to find datasets.
  3. When viewing a table in a dataset, show columns in other datasets that can be joined with columns in said table. This would show off Professor Nargesian's LSH Ensemble paper.

Who did what?

I focused on the keyword search engine while Paul focused on the joinability engine.

What happened afterwards?

At the end of the summer I left the project to spend more time learning about programming langauges and compilers.

Paul stuck with it. He removed the keyword search engine because it didn't perform well (as described in the Weaknesses section), and replaced it with the faiss search engine. He added the ability to generate a directory structure over the datasets based on this paper from Professor Nargesian. His work eventually culminated in a demo at VLDB '21.

Building a Keyword Search Engine

The purpose of the keyword search engine was to help users find datasets. We wanted the user experience to be similar to that of Google's dataset search engine.

Background

fastText

Simhash

LSHForest

A Keyword Search Engine

Weaknesses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment