Skip to content

Instantly share code, notes, and snippets.

@parezcoydigo
Created December 12, 2012 00:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save parezcoydigo/4263714 to your computer and use it in GitHub Desktop.
Save parezcoydigo/4263714 to your computer and use it in GitHub Desktop.
Extract text layer from a pdf, clean it up with pandoc.

From Vim's command mode, if I enter this line I get just the text of the pdf, cleaned by pandoc, in a new un-named buffer

:read !pdf2txt <filename> | pandoc -t markdown

Trying to translate this to a line in .vimrc, instead I get a text version of the pdf's headers and metadata, with my new text appended into it.

au *.pdf read !pdf2txt % | pandoc -t markdown

If you just try to open a pdf from vim, you'll get a text conversion of the binary including the full text if there is a layer. Here's a snippet:

%PDF-1.4
%âãÏÓ
215 0 obj
<</Metadata 213 0 R/Names 217 0 R/OpenAction 216 0 R/Pages 209 0 R/Type/Catalog>>
endobj
213 0 obj
<</Length 1486/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.389687, 2009/06/02-13:20:35        ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

etc.


Using the autocmd, you'd get this:

%PDF-1.4
%âãÏÓ
The Potosı´ principle: religious prosociality fosters self-organization
of larger communities under extreme natural and economic conditions
................................................................................................
............................................................ Juan Luis
Sua´ rez and Shiddarta Va´ squez The CulturePlex Lab, Faculty of Arts
and Humanities, University of Western Ontario, 1151 Richmond Street,
London, ON, Canada N6A 3K7

Where, the new text has been appended to the pdf file, rather than replacing it in the buffer.

@wcaleb
Copy link

wcaleb commented Dec 12, 2012

The problem may be with your autocmd syntax. You don't have a "trigger" event after autocmd.

http://vimdoc.sourceforge.net/htmldoc/autocmd.html#autocmd-events

@daveworth
Copy link

after much futzing, and knowing why I never mess with my own Vim configs, I came to this...

au BufReadPost,FileReadPost *.pdf %!pdf2txt | pandoc -t markdown

I wish it didn't pause to mention the filtering but it seems closer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment