Skip to content

Instantly share code, notes, and snippets.

@IllDepence
Created October 7, 2018 22:22
Show Gist options
  • Save IllDepence/6349871f94d1205b05f8c8665cac7a0e to your computer and use it in GitHub Desktop.
Save IllDepence/6349871f94d1205b05f8c8665cac7a0e to your computer and use it in GitHub Desktop.
existing software
  • plastex — largish, last commit 1y ago, python
    • in addition to requirements.txt also $ pip install Pillow Unidecode for less noisy output
    • TEST: neat output when it works, but failed on all tested arXiv papers and it's own documentation
  • TexSoup — smallish, recent commits, python
    • doesn't handle definitions (\dev) source
    • stackoverflow question specifically asking about arXiv TeX (by someone using TeXSoup)
    • TEST: works on most tested; fails with e.g. 1607.00138
    • find citations
      soup = TexSoup(tex_str)
      c = soup.find_all(name='cite')
      
    • output text
      soup = TexSoup(tex_str)
      for cntnt in soup.document.contents:
          if type(cntnt) == str:
              print(cntnt)
      
  • grabcite
    • installation
      • required additional packages libpcre++-dev, libpq-dev, libghc-hdbc-odbc-dev (+500 MB of dependencies)
    • unpack arXiv dump
      • file given for param --arxiv-meta-xml has do be in cwd (giving a path results in grabcite-datagen: InvalidRelFile "...")
      • arXiv: "Note: Many of the formats above are served gzipped (Content-Encoding: x-gzip). Your browser may silently uncompress after downloading so the files you see saved may appear uncompressed."
      • grabcite:
        • expects arXiv sources with .gz extension in input folder
        • has gzHandler and tarGzHandler (see src/GrabCite/Arxiv.hs)
        • → arXiv sources manually downloaded are mere tar archives w/o file extension → need to be gzipped and renamed to have .gz file extension (not .tar.gz)
    • generate data set
      • tries to connect to papergrep.com (expired namecheap.com registration that probably once hosted this)
  • opendetex — ?, recent commits, compiled
    • specifically for getting plain text
    • TEST: seems to leave in more control sequences than on system detex
  • LaTeXMLvery mature, recent commits, perl
  • Tralics — ?, 2015, C++
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment