Skip to content

Instantly share code, notes, and snippets.

@unRob
Created June 15, 2014 18:06
Show Gist options
  • Save unRob/14b3b69042336cb0222a to your computer and use it in GitHub Desktop.
Save unRob/14b3b69042336cb0222a to your computer and use it in GitHub Desktop.
Cool no sabe hacer crawlers de la RAE
zoquete
sagaz
inconsistente
#!/usr/bin/env ruby
# encoding: utf-8
# usage: ./lemas_rae path/to/word.list
require 'typhoeus'
require 'nokogiri'
file = ARGV[0]
begin
lista = File.read(file).split("\n")
rescue Exception => e
$stderr << "necesito un archivo con palabras para procesar\r\n"
$stderr << e
exit 1;
end
def url_para_lema (l) "http://buscon.rae.es/drae/srv/search?type=3&val=#{l}&val_aux=&origen=REDRAE" end
def parse_lema body
dom = Nokogiri::HTML(body)
defs = []
dom.css('.b').each do |d|
defs << d.text.strip
end
defs.join('|')
end
$hydra = Typhoeus::Hydra.new(max_concurrency: 100)
$h = {'Content-Type'=>'application/x-www-form-urlencoded'}
$body = {
'TS014dfc77_id' => 3,
'TS014dfc77_cr' => 'cf5aba3b3fbec6a7c3dac7226fb1159e:zvwz:cTr5TNpP:1926928421',
'TS014dfc77_76' => 0,
'TS014dfc77_md' => 1,
'TS014dfc77_rf' => 0,
'TS014dfc77_ct' => 0,
'TS014dfc77_pd' => 0
}.map {|k,v|
"#{k}=#{v}"
}.join('&')
lista.each do |lema|
url = url_para_lema(lema)
req = Typhoeus::Request.new( url, method: 'post', body: $body, timeout: 60, headers: $h)
req.on_complete do |res|
if res.success?
$stdout << "#{lema}: "+parse_lema(res.body)
$stdout << "\r\n"
else
$stderr << "ERROR: #{res.code}\n"
end
end
$hydra.queue req
end
$hydra.run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment