Skip to content

Instantly share code, notes, and snippets.

View donfanning's full-sized avatar
🗃️
Holding up libraries and robbing them blind for the future.

Don Fanning donfanning

🗃️
Holding up libraries and robbing them blind for the future.
View GitHub Profile
@donfanning
donfanning / sitecrawler.js
Created August 15, 2018 12:02 — forked from martinjacobs/sitecrawler.js
Site crawler
var phantom = require('phantom');
var Crawler = require("simplecrawler");
var mycrawler = Crawler.crawl("http://www.example.com/");
mycrawler.maxDepth = 3;
mycrawler.interval = 500;
mycrawler.addFetchCondition(function(parsedURL) {
if (parsedURL.path.match(/\.(css|jpg|pdf|docx|js|png|ico)/i)) {
// console.log("Ignored ",parsedURL.path);
return false;
require_relative "xpath_crawler"
require_relative "parser"
module ShareCrawler
class << self
def get(crawler)
xpath_crawler = XPathCrawler.new(crawler["url"])
parsed = { "warning" => 0 }
parsed["value"] = Parser.parse_value(xpath_crawler.parse(crawler["xpath_value"]))
parsed["date"] = Parser.parse_date(xpath_crawler.parse(crawler["xpath_date"]))
@donfanning
donfanning / file0.txt
Created August 15, 2018 12:05 — forked from YSRKEN/file0.txt
実例で分かるデザインパターン ~Webスクレイピングツールを例にして~ ref: https://qiita.com/YSRKEN/items/30654cd7f2f628649d6c
# Webスクレイピングのサンプル(るりまサーチを例にして)
require 'open-uri' # ダウンロード用のライブラリ
require 'nokogiri' # パース用のライブラリ
Encoding.default_external = "UTF-8" # 内部のエンコーディングをUTF-8にしておく
keyword = "include" # 検索キーワード
# 検索用URLを作成
url = "https://docs.ruby-lang.org/ja/search/query:#{keyword}/"
# ダウンロード処理(charsetに対象サイトのエンコーディングが入る)
@donfanning
donfanning / crawler.js
Created August 15, 2018 12:05 — forked from kruny1001/crawler.js
crawler by userAccount
// Create Docker
// Run Docker
// Grab Twitter
// save to firebase
// send link
// shut it down
const TwitterCrawler = require('twitter-crawler');
const fs = require('fs');
const log = require('winston');
@donfanning
donfanning / gist:c9696905a38e72315d1ce625926a78ad
Created August 15, 2018 12:08 — forked from javelin/gist:a6b5201ae8327e1338210829baeb8797
Crawl facebook feeds in nodejs with node-simplecrawler
var Crawler = require("simplecrawler");
var Url = require("url");
var target = "https://graph.facebook.com/ledzeppelin/feed?access_token=1597581200507009%7Ce749be55ea86249f92ae56b081c37b38&fields=from%2Cmessage%2Ccreated_time%2Ctype%2Clink%2Ccomments.summary(true)%2Clikes.summary(true)%2Cshares&since=2016-07-11&until=2016-07-14&limit=10";
var url = Url.parse(target);
var crawler = new Crawler(url.host);
crawler.initialPath = url.path;
crawler.initialPort = 443;
crawler.initialProtocol = "https";
@donfanning
donfanning / crawler.rb
Created August 15, 2018 12:11 — forked from jcf/crawler.rb
Hacky crawler using Mechanize
#!/usr/bin/env ruby
require 'uri'
require 'nokogiri'
require 'mechanize'
require 'logger'
trap('INT') { @crawler.report; exit }
class Crawler
var cheerio = require('cheerio');
var Crawler = require('simplecrawler');
var initialTopic = 'SpaceX';
var blacklist = ["#", "/w/", "/static/", "/api/", "/beacon/", "File:",
"Wikipedia:", "Template:", "MediaWiki:", "Help:", "Special:",
"Category:", "Portal:", "Main_Page", "Talk:", "User:",
"User_talk:", "Template_talk:", "Module:"]; //useless special cases from wikipedia
var url = '/wiki/' + initialTopic;
var cheerio = require('cheerio');
var Crawler = require('simplecrawler');
var initialTopic = 'SpaceX';
var blacklist = ["#", "/w/", "/static/", "/api/", "/beacon/", "File:",
"Wikipedia:", "Template:", "MediaWiki:", "Help:", "Special:",
"Category:", "Portal:", "Main_Page", "Talk:", "User:",
"User_talk:", "Template_talk:", "Module:"]; //useless special cases from wikipedia
var url = '/wiki/' + initialTopic;
@donfanning
donfanning / crawler.rb
Created August 15, 2018 12:14 — forked from alpha-netzilla/crawler.rb
crawling
#!/usr/bin/ruby
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
require 'nokogiri'
require 'open-uri'
Capybara.configure do |config|
@donfanning
donfanning / README.md
Created August 15, 2018 12:15 — forked from sic2/README.md
Basic crawler for Facebook posts and events

This crawler gets all posts of a given Facebook group plus all events from a set of given Facebook pages.

Things todo:

  • crawl multiple Facebook groups