Skip to content

Instantly share code, notes, and snippets.

Shawn Graham shawngraham

Block or report user

Report or block shawngraham

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
View topic-model-john-adams-diaries.r
# Topic Modeling John Adams' Diaries
# slightly modified version of
# by Andreas Niekler, Gregor Wiedemann
# go get the diaries
# these were scraped from
View johnadams.csv
id date text
1 1753-06-08 At Colledge. A Clowdy ; Dull morning and so continued till about 5 a Clock when it began to rain ; moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
2 1753-06-09 At Colledge the weather still remaining Clowdy all Day till 6 o'Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
3 1753-06-10 At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation.
4 1753-06-11 At Colledge a fair morning and pretty warm. About 2 o'Clock there appeared some symptoms of an approaching shower attended with some thunder and lightning.
5 1753-06-12 At Colledge a Clowdy morning heard Dr. Wigglesworth Preach from the 20 Chapter of exodus 8 9 and 10th. Verses.
6 1753-06-13 At Colledge a Cloudy morning about 10 o'Clock the Sun shone out very warm but abo
View John-Adams-Diaries.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View text-analysis-and-topic-model-from-scraping-one-set-of-diaries.r
#let's fix the first column in scrape
#i want to remove the first three characters, leaving us with a date
#or at least something that looks like a date
#this removes the diary metadata from the date
scrape$id <- substring(scrape$id, 4)
#this creates a new column with just the month extracted
month <- str_sub(scrape$id, 5, 6)
scrape['month'] <- month
View topic-model-from-one-diary-scrape.r
#let's fix the first column in scrape
#i want to remove the first three characters, leaving us with a date
#or at least something that looks like a date
scrape$id <- substring(scrape$id, 4)
View scraping-one-set-of-diaries.r
webpage <- ""
html <- read_html(webpage) # read the raw html
View diaries-to-topicmodels.r
#turn entries into a corpus object
docs <- Corpus(VectorSource(entries))
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs,content_transformer(tolower))
#Strip digits
View diary-scraper.r
base_url <- ""
# Load the page <- read_html(x = "")
# Get link URLs
shawngraham / diary-scrape.r
Last active Nov 11, 2019
why is this not working? the loop is the problem
View diary-scrape.r
base_url <- ""
# Load the page <- read_html(x = "")
# Get link URLs
urls <- %>% # feed `` to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
# Get link text
pip instal fitz
pip install PyMuPDF
import fitz
doc ="file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
You can’t perform that action at this time.