Skip to content

Instantly share code, notes, and snippets.

Robin Edwards geotheory

View GitHub Profile
@geotheory
geotheory / html_similarity.R
Created Mar 23, 2019
R utility to measure the similarity of HTML documents
View html_similarity.R
# Measuring the similarity of style and structure between HTML documents
# An R implementation of https://github.com/matiskay/html-similarity
# Dependencies: magrittr, rvest, stringr, xml2, dplyr
require(magrittr)
jaccard_similarity = function(x, y){
x = unique(x)
y = unique(y)
if(length(x) + length(y) == 0) return(1)
View jaccard_similarity.R
jaccard_similarity = function(x, y){
x = unique(x)
y = unique(y)
if(length(x) + length(y) == 0) return(1)
xy = length(intersect(x, y))
xy / (length(x) + length(y) - xy)
}
View Social Media Buzz What people are saying about the Las Vegas shooting (LIVE FEED) | Page 12102.html
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><base href="https://embed.scribblelive.com/Embed/v7.aspx?Id=2678695&amp;Page=12101&amp;overlay=false"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id="bN015htcoyT__google-cache-hdr"><div><span>This is Google's cache of <a hr
View readability.js
let Readability = require('readability');
var fs = require('fs')
var JSDOM = require('jsdom').JSDOM;
var url = 'https://www.example.com/the-page-i-got-the-source-from';
var filename = '~/Downloads/test.html';
View alexa-national-top50-sites.csv
We can't make this file beautiful and searchable because it's too large.
id,site,description,daily_time_on_site,daily_pageviews_per_visitor,percent_of_traffic_from_search,total_sites_linking_in,country
1,google.com,"Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology. ",7.85,9.85,1.8,2581731,AF
2,youtube.com,"YouTube is a way to get your videos to the people who matter to you. Upload, tag and share your videos worldwide! ",8.83333333333333,5.05,11.9,2014485,AF
3,facebook.com,"A social utility that connects people, to keep up with friends, upload photos, share links and videos. ",9.66666666666667,4.01,7.4,5292468,AF
4,google.com.af,NA,6.05,6.12,2,988,AF
5,acbar.org,"ACBAR.org - Your No. 1 job site in Afghanistan, get the latest job vacancy announcement from more than 1500 employers (UN, Government Agencies, NGOs & Private Sector). ACBAR.org is the most visited website in Afghanistan and is also popular in the Southeast Aisa and Europe. ",8.93333333333333,7.16,12,272,AF
View poly-test.csv
vx vy
1e3 21
1071.519305237606 23
1148.1536214968828 27
1230.268770812381 25
1318.2567385564075 28
1412.537544622754 29
1513.5612484362086 32
1621.8100973589299 35
1737.8008287493763 33
@geotheory
geotheory / publicsuffix.R
Last active Mar 18, 2019
Parse the Mozilla-initiated Public Suffix List list of TLDs and public subdomains into a useable R data.frame. With function to return root private subdomain
View publicsuffix.R
# see https://publicsuffix.org/
require(stringr)
require(dplyr)
ps = readLines('https://publicsuffix.org/list/public_suffix_list.dat') %>% paste(collapse='$') %>%
str_extract('(?<=BEGIN ICANN DOMAINS===).*(?=// ===END PRIVATE DOMAINS)') %>% str_split('[$]') %>% .[[1]] %>%
enframe(name = NULL) %>% rename(subdom = value) %>% mutate(subdom = str_trim(subdom)) %>%
filter(subdom != '') %>% filter(!str_detect(subdom, '^//')) %>%
mutate(tld = subdom %>% str_remove('.*[.]'),
View modes.R
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
View R-crash-course-script.R
## A CRASH COURSE IN [R] PROGRAMMING
## Robin Edwards (geotheory.co.uk), March 2018
## In RStudio run through line-by-line using Ctrl + Enter
# basic R environmental functions
x=3.14159; y='hello world'; z=TRUE # create some objects. In RStudio they'll appear in 'Workspace'
ls() # list the objects in the Workspace
print(y) # print information to R 'Console'
rm(y) # remove an object
rm(list=ls()) # remove all
You can’t perform that action at this time.