Skip to content

Instantly share code, notes, and snippets.

@jvani
Last active April 22, 2022 20:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jvani/57200744e1567f33041130840326d488 to your computer and use it in GitHub Desktop.
Save jvani/57200744e1567f33041130840326d488 to your computer and use it in GitHub Desktop.

Sayari Data Task

Context

Sayari collects public data from around the globe including: corporate registries, civil litigation registries, customs and import/export data, land and real property ownership, official gazettes, and more. This data powers our products and is leveraged for due diligence, risk management, and financial intelligence and compliance.

In order for the data to be useful, Sayari often runs entity resolution on the data we collect. This allows us to detect when a single company or person is mentioned in two different web pages. For this task you will collect some public data and perform some simple entity resolution on it.

Task

The Secretary of State of North Dakota provides a business search web app that allows users to search for businesses by name. Your task:

  1. Play around with the site and figure out how to query companies by name.
    • Hint: Your browser's dev tools are good for this.
  2. Download information for all active companies whose names start with the letter "X" (e.g., Xtreme Xteriors LLC) including their Commercial Registered Agent, Registered Agent, and/or Owners. Save the crawled data in the file format of your choice.
    • Hint: scrapy is a suitable web-crawling framework.
  3. Create and plot a graph of the companies, registered agents, and owners.
    • Hint: NetworkX is a suitable graph library that plays nice with matplotlib.
    • Hint: You may consider names as sufficiently unique to identify each node in the graph.
    • Hint: An example plot output is included below.

Please submit a link to a public Github repository that includes both your data and plot.

Notes

  • As of 2019/08/21 there are 193 such companies (there may be fewer). Please do not spam the web app.
  • Together, using scrapy and NetworkX, your crawling and graph code should not go well beyond 100 lines of PEP8 code.

Example Plot from NetworkX Tutorial

Example Plot from NetworkX Tutorial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment