Skip to content

Instantly share code, notes, and snippets.

@matthieuheitz
Last active June 12, 2023 12:47
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b to your computer and use it in GitHub Desktop.
Save matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b to your computer and use it in GitHub Desktop.
djvu2pdf, a conversion script using ocrodjvu and pdfbeads
#!/bin/bash
# Method found here https://askubuntu.com/a/122604/423332
# Dependencies:
# On ubuntu, you can install ocrodjvu and pdfbeads with:
# sudo apt install ocrodjvu
# gem install pdfbeads
# The path and filename given can only contain ascii characters
f=$1
# Get filename
filename=$(basename -- "$f")
extension="${filename##*.}"
file_no_ext="${filename%.*}"
# Count number of pages
echo "f=$f"
p=$(djvused -e n "$f")
echo -e "The document contains $p pages.\n"
# Number of digits
pp=${#p}
echo "###############################"
echo "### Extracting page by page ###"
echo "###############################"
# For each page, extract the text, and the image
for i in $( seq 1 $p)
do
ii=$(printf %0${pp}d $i)
djvu2hocr -p $i "$f" | sed 's/ocrx/ocr/g' > pg$ii.html
ddjvu -format=tiff -page=$i "$f" pg$ii.tiff
done
echo ""
echo "##############################"
echo "### Building the final pdf ###"
echo "##############################"
# Build the final pdf
pdfbeads > "$file_no_ext".pdf
echo ""
echo "Done"
# Remove temp files
echo ""
read -p "Do you want to delete temp files ? (pg*.html, pg*.tiff, pg*.bg.jpg) " -n 1 -r
echo # (optional) move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
rm pg*.html pg*.tiff pg*.bg.jpg
fi
@marinnen
Copy link

marinnen commented Feb 6, 2020

I have a problem installing this script on my Ubuntu 16.04 and wonder if you could help.

Installing the pdfbeads current version 1.1.1 proved far from trivial. First, it required the installation of a newer version of ruby than the default version 2.3 available through apt on Ubuntu. So I had to use snap instead to install the latest version ruby 2.7. Next, the dependency rmagick was missing and had to be installed. That failed since it requires a newer C compiler version gcc-8 rather than the current Ubuntu default gcc-5. To be on the safe side, I installed multiple versions gcc-6, gcc-7, gcc-8 and gcc-9 so I can select the default version to be used by the system by running sudo update-alternatives --config gcc. After all that, using gcc-8 (v.8.3.0) I was able to successfully run gem instal pdfbeads. With ocrodjvu already installed, I was finally ready to run your script.

However, running the script produces the following error in its last step:

Building the final pdf
Traceback (most recent call last):
	4: from ~/.gem/bin/pdfbeads:23:in '<main>'
	3: from ~/.gem/bin/pdfbeads:23:in 'load'
	2: from ~/.gem/gems/pdfbeads-1.1.1/bin/pdfbeads:35:in '<top (required)>'
	1: from /snap/ruby/172/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in 'require'
/snap/ruby/172/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in 'require': 
cannot load such file -- iconv (LoadError)

Line 35 in pdfbeads requires iconv and ruby can't find it. Normally ruby tries to load it from the existing ruby LOAD_PATH, and when it fails, it searches the installed gems. Apparently, iconv was not installed in any of the previous steps. Gem iconv is a wrapper class for the UNIX 95 iconv() function family, which translates string between various encoding systems. Note, however, that iconv is not listed as a pdfbeads 1.1.1 dependency on its rubygem.org dependencies page. Also note that unpacking the pdfbeads 1.1.1 gem reveals that it was not created using the ruby bundler as it contains no Gemfile. Therefore, modifying the djvu2pdf.sh script to run pdfbeads with bundle-exec is of no help.

Here's where I run into the problem I can't solve. When I try installing the missing gem with gem install iconv I get numerous C compiler error messages too long to list here. I tried running this with each different version of gcc - 5, 6, 7, 8 and 9 and none of them succeeds. With version gcc 6.5.0 I got only one error message - make fails after this error:

$ cat mkmf.log 
"gcc -o conftest -I/snap/ruby/172/include/ruby-2.7.0/x86_64-linux -I/snap/ruby/172/include/ruby-2.7.0/ruby/backward 
-I/snap/ruby/172/include/ruby-2.7.0 -I. -O3 -ggdb3 -Wall -Wextra -Wdeprecated-declarations -Wduplicated-cond 
-Wimplicit-function-declaration -Wimplicit-int -Wmisleading-indentation -Wpointer-arith -Wwrite-strings 
-Wimplicit-fallthrough=0 -Wmissing-noreturn -Wno-cast-function-type -Wno-constant-logical-operand -Wno-long-long 
-Wno-missing-field-initializers -Wno-overlength-strings -Wno-packed-bitfield-compat -Wno-parentheses-equality 
-Wno-self-assign -Wno-tautological-compare -Wno-unused-parameter -Wno-unused-value -Wsuggest-attribute=format 
-Wsuggest-attribute=noreturn -Wunused-variable  -fPIC conftest.c  -L. -L/snap/ruby/172/lib -Wl,-rpath,/snap
/ruby/172/lib -L. -fstack-protector-strong -rdynamic -Wl,-export-dynamic -Wl,-rpath,/snap/ruby/172/lib 
-L/snap/ruby/172/lib -lruby  -lm   -lc"
gcc: error: unrecognized command line option ‘-Wimplicit-fallthrough=0’; did you mean ‘-Wno-fallthrough’?
checked program was:
/* begin */
1: #include "ruby.h"
2: 
3: int main(int argc, char **argv)
4: {
5:   return !!argv[argc];
6: }
/* end */

The unrecognized command line option ‘-Wimplicit-fallthrough=0’ is set within the configuration file /snap/ruby/169/lib/ruby/2.4.0/x86_64-linux/rbconfig.rb which lies on the disk partition which is mounted read-only:

$ grep "[[:space:]]ro[[:space:],]" /proc/mounts | grep ruby
/dev/loop15 /snap/ruby/170 squashfs ro,nodev,relatime 0 0
/dev/loop33 /snap/ruby/169 squashfs ro,nodev,relatime 0 0

With version gcc 8.3.0, this first gcc invocation does not cause any error, however the next one does::

"gcc -o conftest -I/snap/ruby/172/include/ruby-2.7.0/x86_64-linux -I/snap/ruby/172/include/ruby-2.7.0/ruby/backward 
-I/snap/ruby/172/include/ruby-2.7.0 -I.    -O3 -ggdb3 -Wall -Wextra -Wdeprecated-declarations -Wduplicated-cond 
-Wimplicit-function-declaration -Wimplicit-int -Wmisleading-indentation -Wpointer-arith -Wwrite-strings 
-Wimplicit-fallthrough=0 -Wmissing-noreturn -Wno-cast-function-type -Wno-constant-logical-operand -Wno-long-long 
-Wno-missing-field-initializers -Wno-overlength-strings -Wno-packed-bitfield-compat -Wno-parentheses-equality 
-Wno-self-assign -Wno-tautological-compare -Wno-unused-parameter -Wno-unused-value -Wsuggest-attribute=format 
-Wsuggest-attribute=noreturn -Wunused-variable -fPIC conftest.c -L. -L/snap/ruby/172/lib -Wl,-rpath,/snap/ruby/172/lib -L. 
-fstack-protector-strong -rdynamic -Wl,-export-dynamic -Wl,-rpath,/snap/ruby/172/lib -L/snap/ruby/172/lib -lruby -lm -lc"
//snap/core18/current/lib/x86_64-linux-gnu/libcrypt.so.1: undefined reference to '__open_nocancel@GLIBC_PRIVATE'
/snap/ruby/172/lib/libruby.so: undefined reference to 'getrandom@GLIBC_2.25'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '_IO_enable_locks@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '__mmap@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '__munmap@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libdl.so.2: undefined reference to '_dl_catch_error@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '__mprotect@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libcrypt.so.1: undefined reference to '__snprintf@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '__tunable_get_val@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libcrypt.so.1: undefined reference to '__read_nocancel@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/librt.so.1: undefined reference to '__close_nocancel@GLIBC_PRIVATE'
/snap/ruby/172/lib/libruby.so: undefined reference to 'copy_file_range@GLIBC_2.27'
//snap/core18/current/lib/x86_64-linux-gnu/libpthread.so.0: undefined reference to '__sigtimedwait@GLIBC_PRIVATE'
//snap/core18/current/lib/x86_64-linux-gnu/libdl.so.2: undefined reference to '_dl_signal_error@GLIBC_PRIVATE'
/snap/ruby/172/lib/libruby.so: undefined reference to '__explicit_bzero_chk@GLIBC_2.25'
collect2: error: ld returned 1 exit status
checked program was:
/* begin */
 1: #include "ruby.h"
 2: 
 3: #include <ruby/encoding.h>
 4: 
 5: /*top*/
 6: extern int t(void);
 7: int main(int argc, char **argv)
 8: {
 9:   if (argc > 1000000) {
10:     int (* volatile tp)(void)=(int (*)(void))&t;
11:     printf("%d", (*tp)());
12:   }
13: 
14:   return !!argv[argc];
15: }
16: int t(void) { void ((*volatile p)()); p = (void ((*)()))rb_enc_get; return !p; }
/* end */

followed by a whole bunch of other errors, too long to list.

Next, I tried earlier versions of ruby - 2.6, 2.5, 2.4 - and with gcc-8 (as required for RMagick 4.0.0 package compilation), last 2 versions produce slightly fewer errors. However, when using gcc 6.5.0 I get exactly the same error with ruby 2.4 and as with ruby 2.7.

Would you have any suggestion how to solve this problem?

@matthieuheitz
Copy link
Author

Hi !

I'm afraid I'm not going to be of much help, because it worked directly on my computer, and I know a lot less Ruby than what you seem to...
I can retry this script on my computer and tell you the version of Ruby I installed.

Maybe you could send a message to the @zetah on that page https://askubuntu.com/questions/46233/converting-djvu-to-pdf/122604#122604, because I did that script from their answer.

@matthieuheitz
Copy link
Author

matthieuheitz commented Feb 11, 2020

I have :
ocrodjvu v0.9.1-1
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]
gem 2.5.2.1
pdfbeads 1.1.1

It seems that I was able to install pdfbeads with ruby 2.3, I didn't need ruby 2.7.
Can you show me the message that prompted you to install ruby 2.7 ?

@marinnen
Copy link

marinnen commented Feb 11, 2020

At this point it would be quite a task to consistently roll back all the software updates I listed to recreate the original environment when I started. My environment was quite similar to what you have listed above. However, it was missing rmagick and, if I remember correctly, that was what opened the Pandora's box. The only option available with apt was adding the current version RMagick 4.0.0 required a newer version of ruby and gcc, and that led me down the garden path I described above. If I knew that will be the case I would have made an effort to find the way to install an older version. In any case, I'm surprised that the installation of this ruby gem would be so sensitive to various software updates that are subsequent to the versions in existence at the time of its release.
I wonder what version of rmagick do you get when you run gem list rmagick?
Thanks for the suggestion to reach out to @zetah. However, that does not look promising - according to his profile, the last time he was active was in April, 2017, and his last contributions are dated in 2014.

@AntonIrish
Copy link

FYI, the script worked almost perfectly[1] on my Ubuntu 18.04 box with the following package installation commands.

$ sudo apt-get install ruby-dev ruby-rmagick ocrodjvu
$ sudo gem install pdfbeads iconv
$ gem list rmagick iconv pdfbeads
rmagick (2.16.0)
iconv (1.0.8)
pdfbeads (1.1.1)

[1] The output pdf contains "W: `require 'RMagick'` is deprecated, please change to `require 'rmagick'`" at the beginning because pdfbeads contains require 'RMagick'. tail -n +2 output.pdf > fixed.pdf is necessary to delete the line.

@matthieuheitz
Copy link
Author

I get :

$ gem list rmagick
*** LOCAL GEMS ***
rmagick (4.0.0)

@rbrito
Copy link

rbrito commented May 15, 2020

I packaged pdfbeads (with patches) to work on Debian without warnings (including the RMagick vs. rmagic thing) and with all the dependencies set to be pulled in. It should work on a sufficiently new Ubuntu version (I don't know how much, since I don't follow Ubuntu releases that closely). That being said, if I introduce that package on Debian, then getting it to work on Ubuntu should be relatively simple.

I can TRY TO provide a precompiled version of it on a PPA that I have (where I have other tools that I find useful).

In the mean time, the unfinished (but working) package is at: https://github.com/rbrito/pkg-pdfbeads

It works very well for me and I will try this script to see how well things go when we mix everything together.

@davidlieberman
Copy link

Used it just now and everything worked perfectly. Only quirk was I had to roll back gem update --system 3.0.8 to get rmagick to install properly and stop complaining that constant Gem::ConfigMap is deprecated (issue and fix discussed here).

> gem list rmagick iconv pdfbeads

*** LOCAL GEMS ***
rmagick (4.2.5, 2.16.0)
iconv (1.0.8)
pdfbeads (1.1.3)

Thanks so much for this helpful script!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment