Sayari collects public data from around the globe including: corporate registries, civil litigation registries, customs and import/export data, land and real property ownership, official gazettes, and more. This data powers our products and is leveraged for due diligence, risk management, and financial intelligence and compliance.
In order for the data to be useful, Sayari often runs entity resolution on the data we collect. This allows us to detect when a single company or person is mentioned in two different web pages. For this task you will collect some public data and perform some simple entity resolution on it.
The Secretary of State of North Dakota provides a business search web app that allows users to search for businesses by name. Your task:
- Play around with the site and figure out how to query companies by name.
- Hint: Your browser's dev tools are good for this.
- Download information for all active companies whose names start with the letter "X" (e.g., Xtreme Xteriors LLC) including their Commercial Registered Agent, Registered Agent, and/or Owners. Save the crawled data in the file format of your choice.
- Hint: scrapy is a suitable web-crawling framework.
- Create and plot a graph of the companies, registered agents, and owners.
- Hint: NetworkX is a suitable graph library that plays nice with matplotlib.
- Hint: You may consider names as sufficiently unique to identify each node in the graph.
- Hint: An example plot output is included below.
Please submit a link to a public Github repository that includes both your data and plot.
- As of 2019/08/21 there are 193 such companies (there may be fewer). Please do not spam the web app.
- Together, using scrapy and NetworkX, your crawling and graph code should not go well beyond 100 lines of PEP8 code.