Skip to content

Instantly share code, notes, and snippets.

@kyamaguchi
Last active November 20, 2015 01:53
Show Gist options
  • Save kyamaguchi/03d6f68b0d410b0ef471 to your computer and use it in GitHub Desktop.
Save kyamaguchi/03d6f68b0d410b0ef471 to your computer and use it in GitHub Desktop.
Extract highlights and notes which are exported from Good Reader app
#/usr/bin/env ruby
raise "Set input file. $ ruby #{__FILE__} input.txt" if ARGV.empty?
PAGE_SEPARATOR = %r{--- Page (\S+) ---}
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:}
lines = File.readlines(ARGV[0])
annotations = []
lines.map(&:chomp)
.reject{|l| l == "" }
.slice_before(PAGE_SEPARATOR)
.select{|page| page.any?{|line| line =~ PAGE_SEPARATOR} }
.each do |page|
page_no = page.shift.match(PAGE_SEPARATOR)[1]
page.slice_before(ANNOTATION_SEPARATOR).each do |annotation|
info = annotation.shift.match(ANNOTATION_SEPARATOR)
# text = annotation.map{|h| h.gsub(/[[:space:]]/, '') }.join(' ') # For Japanese
text = annotation.join(' ') # For English
annotations << {
type: info[:type],
page_no: page_no,
color: info[:color],
time: info[:time],
owner: info[:owner],
text: text,
}
end
end
### Output
# Change the following part as you like
## --- output as hash
# annotations.each do |a|
# puts a.inspect
# end
## --- group by type
# annotations.group_by{|a| a[:type] }.each do |type, group|
# puts "[#{type}]"
# group.each do |a|
# puts "#{a[:text]} (p#{a[:page_no]})"
# end
# puts
# end
## --- group by color
annotations.group_by{|a| a[:type] }.each do |type, group|
puts "[#{type}]"
group.group_by{|a| a[:color] }.each do |color, subgroup|
puts "--- #{color} ---"
subgroup.each do |a|
puts "#{a[:text]} (p#{a[:page_no]})"
end
puts
end
puts
end
## --- using rainbow
# require 'rainbow'
# annotations.each do |a|
# puts Rainbow(a[:text]).background(a[:color].to_sym) + "(p#{a[:page_no]})"
# end
## sample input.txt for testing
=begin
File: refactoring-ja-special-edition_p1_0.pdf
Annotation summary:
--- Page xi ---
Highlight (yellow), 2015/03/05 9:10:
2000 年に発行された『リファクタリング プログラミングの体質改善テクニック』
--- Page xix ---
Highlight (yellow), 2015/03/05 9:10:
リファクタリングの父は 2 人います。Ward Cunningham と Kent Beck です。
Highlight (yellow), 2015/03/05 9:10:
John Brant と Don Roberts は単に論文を書くのに止まらず、ツールの作成まで行いました。それが 「Refactoring Browser」 、すなわちリファクタリングを行うための Smalltalk のブラウザです。
--- Page 12 ---
Highlight (yellow), 2015/03/05 9:10:
変更がほんの少しであれば、それによって生じるエラーを見つけるのは簡単 なことです。
--- Page 66 ---
Highlight (red), 2015/03/06 14:56:
決してリファクタリングをしてはいけない場合もあります。第 1 の例は、変更するよりも最 初からの書き直した方が早いという場合です。
Highlight (yellow), 2015/03/06 14:56:
リファクタリングを避けるべき第 2 の例として、 期間が迫っている場合があります。 こうした状況では、リファクタリングをしても生産性の向上が見られるのは締め切り後であり、 時すでに遅しということになってしまいます。
Highlight (blue), 2015/03/06 14:56:
時間が足りなくなるというのは、たいて いの場合、リファクタリングが必要であることを示唆しているのです。
Note (yellow), 2015/03/06 14:56:
あいうえお
=end
@traveller22
Copy link

Forgive me for these total Ruby newby questions. But what you have here is the solution to my annotation extraction problem which I have been struggling with the last weeks!

Fyi, I first had to learn how to get the ruby file running and where to put the input.txt file. When I finally managed to run the .rb file in a Windows command prompt I ran the file exactly as you provided it.

It came back with the error:

C:\Users\Jochem\Desktop\Ruby test>ruby "C:\Users\Jochem\Desktop\Ruby test\Goodre
ader test.rb" input.txt
C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:21:in `block (2 levels) in

': undefined method`[]' for nil:NilClass (NoMethodError) from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in `<<' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in`each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in `each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in`block in ' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:14:in `each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:14:in`'

Can you see or guess what I am doing wrong? I study a lot and have loads of files on Goodreader of which I would like to extract the annotations.

Any help would be greatly appreciated. Thanks for posting this in the first place!

Kind regards,

Traveller

@kyamaguchi
Copy link
Author

Generally, you can inspect errors with adding debug print.

For instance, add puts info.inspect or puts annotation.inspect before line 18.
And run the program again.

The error says info is nil around lines 18-22.
This probably means Regular expressionANNOTATION_SEPARATOR doesn't match the text from your input.txt .

I don't know why it happens but I suspect something is different on Ruby on Windows. (I use Mac.)

One of the idea of fixing errors is changing the Regular expression.
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})(?:, (?<owner>.*))?:} is very strict.
You can loosen the expression with deleting some parts from the end.

For example, change it to ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})} and try it. (Remove the part from the end. You could also remove part next if you still get error.)

I could take a look if you give me the input.txt in email.

@kyamaguchi
Copy link
Author

Try ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:} .

The difference is (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2}) -> (?<time>.*).
The format of time is different. (Your format is "Highlight (yellow), 14 mrt. 2015 14:17:")

@traveller22
Copy link

This worked like a charm! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment