Skip to content

Instantly share code, notes, and snippets.

@en45masao
Created January 16, 2011 05:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save en45masao/781581 to your computer and use it in GitHub Desktop.
Save en45masao/781581 to your computer and use it in GitHub Desktop.
Yahoo! Japanのインターネットドリルの問題をダウンロードしてCSVファイル化するスクリプト
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use WWW::Mechanize;
use Text::CSV_XS;
use Web::Scraper;
sub remap {
tr/\x{301c}\x{2212}/\x{ff5e}\x{ff0d}/;
return $_;
}
my $url = $ARGV[0];
die "Invalid URL" unless defined $url && $url =~ m{^http://stepup\.yahoo\.co\.jp/drill/drill\.html\?};
$url =~ /[\?&]di=([0-9]+)/;
my $di = $1;
my $mech = new WWW::Mechanize(autocheck => 0);
$mech->get($url);
die unless $mech->success();
my $scrap_tmp = scraper {
process '//div[@id="drilltop-entitle"]/h1/cite', 'title' => 'TEXT';
process '//div[@id="kiroku-watchwrap"]/ul/li/a[1]', 'href' => '@href';
}->scrape($mech->content, $mech->uri);
$mech->get($scrap_tmp->{href});
die unless $mech->success();
open my $file, '>:encoding(cp932)', "yahoodrill_${di}.csv" or die $!; # force output in CP932
print $file "$scrap_tmp->{title}\n";
my $scraper = scraper {
process '//div[@id="question-contents-area"]/div/p[1]', 'question' => ['TEXT', \&remap];
process '//div[@id="question-contents-area"]/div/ul[1]/li', 'choices[]' => ['TEXT', \&remap];
process '//dd[@id="kaito-correct"]', 'answer' => ['TEXT', \&remap];
process '//div[@id="answer-contents-area"]/p[@id="commentary-txt"]', 'description' => ['TEXT', \&remap];
};
my $csv = Text::CSV_XS->new({binary => 1});
for (my $number = 1;; $number++) {
sleep 1;
unless ($mech->follow_link(text => "問題No.$number", url_regex => qr/.*answer\.html.*/)) {
if ($mech->follow_link(text => "次へ", url_regex => qr/.*list\.html.*/)) {
next;
} else {
last;
}
}
print "Now scraping No.$number...\n";
my $scrap = $scraper->scrape($mech->content, $mech->uri);
$csv->combine($scrap->{question}, @{$scrap->{choices}}, $scrap->{answer}, $scrap->{description});
print $file $csv->string() . "\n";
$mech->back();
}
close $file;
=head1 NAME
makecsv_yahoodrill - A script to download "Yahoo! Japan Internet Drill" as a CSV file.
=head1 SYNOPSIS
makecsv_yahoodrill <URL>
=head1 EXAMPLES
makecsv_yahoodrill "http://stepup.yahoo.co.jp/drill/drill.html?co=23&di=23004&gi=03"
=head1 DESCRIPTION
このスクリプトを使って、Yahoo! Japanのインターネットドリル<L<http://stepup.yahoo.co.jp/drill/>>の問題をCSVファイルに書き出すことができます。
引数として指定するURLは、個々のドリルのトップページのURLを指定してください(「問題にチャレンジする」や「問題一覧を見る」のボタンがあるページです)。
=head1 AUTHOR
en45masao
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment