Last active
January 25, 2023 19:12
-
-
Save zunda/734129f361330075eb1bd701bd083239 to your computer and use it in GitHub Desktop.
ブラウザから保存したTogetterさんのページからツイートを抽出してタブ区切りファイルとして整形する
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby | |
# | |
# Copyright 2023 zunda <zundan at gmail.com> | |
# | |
# Permission is granted for use, copying, modification, distribution, | |
# and distribution of modified versions of this work as long as the | |
# above copyright notice is included. | |
# | |
require 'time' | |
Dir.glob("togetter-*.html") do |src| | |
File.read(src).scan(%r|<div class="list_box type_tweet impl_profile" data-index="(\d+)">(.*?)</div>|m).each do |i, entry| | |
idx = Integer(i) | |
username = entry.scan(%r|<span class="status_name">@(.*?)</span>|).flatten.first | |
time = Time.at(Integer(entry.scan(%r|<a class="link" .* data-timestamp="(\d+)"|).flatten.first)) | |
text = entry.scan(%r|<p class="tweet">(.*?)</p>|m).flatten.first.chomp | |
text.gsub!(%r|<img draggable="false" class="emoji" alt="(.*?)".*?>|, '\1') | |
text.gsub!(%r|<(\w+).*?>.*?</\1>|m, "") | |
text.gsub!(/>/, ">") | |
text.gsub!(/</, "<") | |
text.gsub!(/ |\t| /, " ") | |
text.gsub!(/\s+/, " ") | |
puts [idx, time.utc.iso8601, username, text].join("\t") | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment