plastex — largish, last commit 1y ago, python- in addition to requirements.txt also
$ pip install Pillow Unidecode
for less noisy output - TEST: neat output when it works, but failed on all tested arXiv papers and it's own documentation
- in addition to requirements.txt also
- TexSoup — smallish, recent commits, python
- doesn't handle definitions (
\dev
) source - stackoverflow question specifically asking about arXiv TeX (by someone using TeXSoup)
- TEST: works on most tested; fails with e.g.
1607.00138
- find citations
soup = TexSoup(tex_str) c = soup.find_all(name='cite')
- output text
soup = TexSoup(tex_str) for cntnt in soup.document.contents: if type(cntnt) == str: print(cntnt)
- doesn't handle definitions (
- grabcite
- installation
- required additional packages
libpcre++-dev
,libpq-dev
,libghc-hdbc-odbc-dev
(+500 MB of dependencies)
- required additional packages
- unpack arXiv dump
- file given for param
--arxiv-meta-xml
has do be in cwd (giving a path results ingrabcite-datagen: InvalidRelFile "..."
) - arXiv: "Note: Many of the formats above are served gzipped (Content-Encoding: x-gzip). Your browser may silently uncompress after downloading so the files you see saved may appear uncompressed."
- grabcite:
- expects arXiv sources with
.gz
extension in input folder - has
gzHandler
andtarGzHandler
(seesrc/GrabCite/Arxiv.hs
) - → arXiv sources manually downloaded are mere
tar
archives w/o file extension → need to be gzipped and renamed to have.gz
file extension (not.tar.gz
)
- expects arXiv sources with
- file given for param
- generate data set
- tries to connect to papergrep.com (expired namecheap.com registration that probably once hosted this)
- installation
- opendetex — ?, recent commits, compiled
- specifically for getting plain text
- TEST: seems to leave in more control sequences than on system detex
- LaTeXML — very mature, recent commits, perl
- LaTeX→XML
- existing (dead?) project on on arXiv data (active LaTeXML fork)
- Tralics — ?, 2015, C++
- LaTeX→XML
- apparently fast
Created
October 7, 2018 22:22
-
-
Save IllDepence/6349871f94d1205b05f8c8665cac7a0e to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment