Skip to content

Instantly share code, notes, and snippets.

@johnguirgis
Last active December 17, 2021 19:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnguirgis/a96630c6e1d83b2d5ea2c071600abfd7 to your computer and use it in GitHub Desktop.
Save johnguirgis/a96630c6e1d83b2d5ea2c071600abfd7 to your computer and use it in GitHub Desktop.

Domain Crawl

Generating usable data from a domain crawl is done in 3 main steps:

  1. Running the crawler on the sites of interest
  2. Running util scripts to improve the crawled output, this is done in 2 further steps (their order is irrelevant)
    1. Running the getUrls tool to extract all of the found URLs in the page
    2. Running the metascraper on the files to get their dates
  3. Running the post-processor on the improved files

After this, the resulting output JSONs can be further processed into CSVs or visualizations.

1. Running the domain crawler

To run the domain crawler, clone the mediacat-domain-crawler into /media/data/batch-crawl in the Graham Instance, then follow the instructions in the README to run the crawl. Be sure to run the "clean tmp" script in the background to avoid storage issues if necessary.

2. Running the utility scripts

Clone the mediacat-backend and run the following scripts from the /utils folder.

i. Running getUrls

Following the instructions here, run the getUrls tool on the crawled files to generate a set of files containing all the URLs that were found on the page

ii. Running metascraper

Following the instructions here, run the metascraper on the files created by step i to generate a new set of files containing the publication date of each page (if possible)

3. Running the post-processor

You can now run the post-processor on the data generated from step 2, follow the instructions here. Be sure to follow the instructions for advanced usage

Twitter Crawl

Generating data from a twitter crawl is done in 2-3 steps

  1. Run the twitter crawler, if you are using the expanding feature here, then skip step 2
  2. If you did not use the expanding feature in step 1, then you must run the urlExpander on the data from step 1, otherwise, omit this step
  3. Run the post-processor on the data from step 1 or 2 (if applicable) After this, the resulting output JSONs can be further processed into CSVs or visualizations.

1. Running the twitter crawler

Following the instructions here, run the twitter crawler on the desired scope. Note that there is currently an issue with the Twint module that is used by the crawler that is preventing it from getting all the tweets. You can track this issue here

2. urlExpander

If you did not use the URL expanding feature during the crawl, you must do this step, otherwise you may skip it. Clone the mediacat-backend repository and run the urlExpander script on the data from step 1.

3. Running the post-processor

You can now run the post-processor on the data generated from step 2 (or step 1 if step 2 was not performed), follow the instructions here. Be sure to follow the instructions for advanced usage

Running processes and file locations

Jewish Currents

This has been crawled and completely processed, the final output can be found in /media/data/processing/jewishCurrents/Post-Processor/Output

Jewish Journal

Still being crawled under screen "jewishJournal" screen "getInScopeLinks" (for some reason, it was renamed from "jewishJournal" to "getInScopeLinks"). As of now, 109817 JSONs have been produced. You can continue to monitor this by checking the number of files in /media/data/batch-crawl/jewishJournal/newCrawler/Results/https___jewishjournal_com/

Electronic Infatada

This has been crawled and completely processed, the final output can be found in /media/data/processing/electronic_infatada/Post-Processor/Output

Tablet Mag

This has been crawled and completely processed, the final output can be found in /media/data/processing/tabletMag/Post-Processor/Output

Mondoweiss

Has been crawled and and URLs have been extracted, still going through the metascraper under the screen "mondoweissDates". Seems to be taking a while, may need to be restarted every once in a while. Currently 36802 files have been processed into /media/data/processing/mondoweiss/utils/metascraper/DatedOutput. After completion, needs to be put into through the post-processor.

Additional notes

Memory issues

Running the post-processor uses a lot of memory, to ensure that the VM does not run out of memory, use the command free -h to see how much memory is "Available". If the "Available" memory gets too low, kill any resource intensive processes to prevent the instance from stopping. If the instance is stopped, go to the dashboard, sign in, click on instances on the left side, then click on "Start Instance".

Using screens

Since the crawls and post-processing may take several days to complete, you should use screens whenever you want to run a crawl or post-process some data in order to be able to log out of the instance without killing the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment