Skip to content

Instantly share code, notes, and snippets.

@cabo
Created September 5, 2021 15:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cabo/8e99a77502e4920ee2c3d52227f97a1e to your computer and use it in GitHub Desktop.
Save cabo/8e99a77502e4920ee2c3d52227f97a1e to your computer and use it in GitHub Desktop.
Get recent minutes for an IETF WG as a big Wurst. Useful for mining and searching. Mining PDF not yet implemented.
#!/usr/bin/env ruby -Ku
Encoding.default_external = "UTF-8" # wake up, smell the coffee
require 'kramdown'
# FIX THIS dirname:
# You need to be in the root of the proceedings directory as obtained via rsync,
# e.g. in my case via: rsync -av rsync.ietf.org::proceedings ~/std/proceedings
# Warning: ~ 50 GB!
Dir.chdir(File.expand_path ("~/std/proceedings"))
# USAGE: minutes-wurst wgname > wgname.out
# Warning: Sloshing through the proceedings soup takes a while -- 10 seconds for me!
wgname = ARGV.shift
fn = Dir["**/minutes-*-#{wgname}-*"]
fnh = {}
fn.sort.each do |f|
case f
when /(\A.*)-\d\d(\.[a-z]+)$/
fnh[$1 + $2] = f
else
fnh[f] = f
end
end
fn = fnh.values
al = fn.map do |f|
case f
when /\A(\d+)\//
n = $1.to_i - 47
y = 2000 + n/3
m = n%3*4 + 3
tag = "%04d-%02d-%02d" % [y, m, 15]
contents = f
when /\Ainterim-(\d+)-#{wgname}-(\d+)\/.*(\d\d\d\d)(\d\d)(\d\d)(\d\d\d\d)-/
y = $3.to_i
fail f if y != $1.to_i
m = $4.to_i
d = $5.to_i
h = $6
tag = "%04d-%02d-%02d" % [y, m, d]
contents = h + " @@ " + f
else
p [:IGNORING, f]
end
case f
when /\.txt$/, /\.md$/
contents = File.read(f).scrub.gsub(/\r\n?/, "\n")
when /.html$/
kd = Kramdown::Document.new(File.read(f), input: 'html')
kd.to_remove_html_tags
contents = "<!-- @@@ #{f} -->\n" + kd.to_kramdown
else
p [:HUH, f]
end
[tag, f, contents] if tag
end.compact
# p al
# p al.map {_1[0]}
als = al.sort_by {_1[0]}
p als.map {_1[0]}
ct = Hash.new(0)
als.each { ct[_1[0]] += 1}
p ct.select {|k, v| v != 1}
p als.size
sep = "\n# @@@\n"
puts sep
puts als.map { |tag, f, contents|
"# #{tag} -- #{f}\n\n" << contents
}.join(sep)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment