Skip to content

Instantly share code, notes, and snippets.

Avatar

Shawn Graham shawngraham

View GitHub Profile
View johnadams.csv
id date text
1 1753-06-08 At Colledge. A Clowdy ; Dull morning and so continued till about 5 a Clock when it began to rain ; moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
2 1753-06-09 At Colledge the weather still remaining Clowdy all Day till 6 o'Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
3 1753-06-10 At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation.
4 1753-06-11 At Colledge a fair morning and pretty warm. About 2 o'Clock there appeared some symptoms of an approaching shower attended with some thunder and lightning.
5 1753-06-12 At Colledge a Clowdy morning heard Dr. Wigglesworth Preach from the 20 Chapter of exodus 8 9 and 10th. Verses.
6 1753-06-13 At Colledge a Cloudy morning about 10 o'Clock the Sun shone out very warm but abo
View John-Adams-Diaries.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View text-analysis-and-topic-model-from-scraping-one-set-of-diaries.r
#let's fix the first column in scrape
#i want to remove the first three characters, leaving us with a date
#or at least something that looks like a date
#this removes the diary metadata from the date
scrape$id <- substring(scrape$id, 4)
#this creates a new column with just the month extracted
month <- str_sub(scrape$id, 5, 6)
scrape['month'] <- month
View topic-model-from-one-diary-scrape.r
#let's fix the first column in scrape
#i want to remove the first three characters, leaving us with a date
#or at least something that looks like a date
scrape$id <- substring(scrape$id, 4)
library(tm)
View scraping-one-set-of-diaries.r
library("rvest")
library(dplyr)
#https://francojc.github.io/2017/11/02/acquiring-data-for-language-research-web-scraping/
#modified
webpage <- "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php"
html <- read_html(webpage) # read the raw html
View diaries-to-topicmodels.r
setwd("~/diaries")
library(tm)
#turn entries into a corpus object
docs <- Corpus(VectorSource(entries))
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs,content_transformer(tolower))
#Strip digits
View diary-scraper.r
#after https://francojc.github.io/2015/03/01/web-scraping-with-rvest-in-r/
library(rvest)
library(dplyr)
base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")
# Get link URLs
@shawngraham
shawngraham / diary-scrape.r
Last active Nov 11, 2019
why is this not working? the loop is the problem
View diary-scrape.r
library(rvest)
base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")
# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
# Get link text
View getpics.py
"""
pip instal fitz
pip install PyMuPDF
"""
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
View golems-in-the-city-with-colds.nlogo