Generating usable data from a domain crawl is done in 3 main steps:
- Running the crawler on the sites of interest
- Running util scripts to improve the crawled output, this is done in 2 further steps (their order is irrelevant)
- Running the getUrls tool to extract all of the found URLs in the page
- Running the metascraper on the files to get their dates
- Running the post-processor on the improved files
After this, the resulting output JSONs can be further processed into CSVs or visualizations.
To run the domain crawler, clone the mediacat-domain-crawler into /media/data/batch-crawl
in the Graham Instance, then follow the instructions in the README to run the crawl. Be sure to run the "clean tmp" script in the background to avoid storage issues if necessary.
Clone the mediacat-backend and run the following scripts from the /utils
folder.
Following the instructions here, run the getUrls tool on the crawled files to generate a set of files containing all the URLs that were found on the page
Following the instructions here, run the metascraper on the files created by step i to generate a new set of files containing the publication date of each page (if possible)
You can now run the post-processor on the data generated from step 2, follow the instructions here. Be sure to follow the instructions for advanced usage
Generating data from a twitter crawl is done in 2-3 steps
- Run the twitter crawler, if you are using the expanding feature here, then skip step 2
- If you did not use the expanding feature in step 1, then you must run the urlExpander on the data from step 1, otherwise, omit this step
- Run the post-processor on the data from step 1 or 2 (if applicable) After this, the resulting output JSONs can be further processed into CSVs or visualizations.
Following the instructions here, run the twitter crawler on the desired scope. Note that there is currently an issue with the Twint module that is used by the crawler that is preventing it from getting all the tweets. You can track this issue here
If you did not use the URL expanding feature during the crawl, you must do this step, otherwise you may skip it. Clone the mediacat-backend repository and run the urlExpander script on the data from step 1.
You can now run the post-processor on the data generated from step 2 (or step 1 if step 2 was not performed), follow the instructions here. Be sure to follow the instructions for advanced usage
This has been crawled and completely processed, the final output can be found in /media/data/processing/jewishCurrents/Post-Processor/Output
Still being crawled under screen "jewishJournal" screen "getInScopeLinks" (for some reason, it was renamed from "jewishJournal" to "getInScopeLinks"). As of now, 109817 JSONs have been produced. You can continue to monitor this by checking the number of files in /media/data/batch-crawl/jewishJournal/newCrawler/Results/https___jewishjournal_com/
This has been crawled and completely processed, the final output can be found in /media/data/processing/electronic_infatada/Post-Processor/Output
This has been crawled and completely processed, the final output can be found in /media/data/processing/tabletMag/Post-Processor/Output
Has been crawled and and URLs have been extracted, still going through the metascraper under the screen "mondoweissDates". Seems to be taking a while, may need to be restarted every once in a while. Currently 36802 files have been processed into /media/data/processing/mondoweiss/utils/metascraper/DatedOutput
. After completion, needs to be put into through the post-processor.
Running the post-processor uses a lot of memory, to ensure that the VM does not run out of memory, use the command free -h
to see how much memory is "Available". If the "Available" memory gets too low, kill any resource intensive processes to prevent the instance from stopping. If the instance is stopped, go to the dashboard, sign in, click on instances on the left side, then click on "Start Instance".
Since the crawls and post-processing may take several days to complete, you should use screens whenever you want to run a crawl or post-process some data in order to be able to log out of the instance without killing the process.