Skip to content

Instantly share code, notes, and snippets.

View PonteIneptique's full-sized avatar
🏠
Working from home

Thibault Clérice PonteIneptique

🏠
Working from home
View GitHub Profile
import json
from collections import defaultdict
import os
from MyCapytain.common.utils import xmlparser
# Logging related dependencies
import logging
import time
import math
# Multi Proc
from multiprocessing import Pool
@PonteIneptique
PonteIneptique / FichesOutils.md
Last active October 14, 2016 09:13
Brouillons fiches Groupe Outils Humanistica

API CTS

  • Type : Standard d'API
  • Language : URL, XML
  • Difficulté d'utilisation : Basse
  • Tadirah :
    • Research Activities > 7_Dissemination > Sharing
    • Research Activities > 7_Dissemination > Publishing
    • Research Activities > 1_Capture > Discovering
    • Research Activities > 1_Capture > Discovering
  • Description courte : Une API CTS donne la capacité de citer des passages de textes en utilisant des identifiants logiques
@PonteIneptique
PonteIneptique / getvalidreff.cts.xml
Last active April 19, 2017 11:31
Expression and density of responses
This file has been truncated, but you can view the full file.
<GetValidReff xmlns="http://chs.harvard.edu/xmlns/cts">
<request>
<requestName>GetValidReff</requestName><requestUrn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2</requestUrn><requestLevel>2</requestLevel>
</request>
<reply>
<reff><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.2</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.3</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.4</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.5</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.6</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.7</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.8</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.9</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.10</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.11</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.12</urn><urn>urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.13<
@PonteIneptique
PonteIneptique / bnfcrawler.py
Last active December 8, 2023 20:36
BNF Crawler
import requests
import os
import shutil
from argparse import ArgumentParser
parser = ArgumentParser(description="Download Full Quality sets of pages from the BNF")
parser.add_argument("text", type=str, help="ID of the text. In http://gallica.bnf.fr/ark:/12148/btv1b53084829z/, this would be btv1b53084829z")
parser.add_argument("--start", type=int, default=1, help="Page to start from")
parser.add_argument("--end", type=int, default=None, help="Page to end at")
@PonteIneptique
PonteIneptique / PerseusCitationConversionTable.md
Last active June 29, 2018 13:39
Building a Citation Conversion table for Mayhoff Perseus' Naturalis Historia

Sometime, Perseus XML files contains two concurrent citation sytems, one will be used for passage matching on the web interface and one will be marked but not used. It might be you are looking at a secondary source using another citation scheme. Here is a simple XSL and an example output based on Pliny the Elder Mayhoff-Perseus digitized Edition

To do that, I simply applied the XSL below in Oxygen to the XML file. It results in a simple CSV file with all references and the equivalences.

@PonteIneptique
PonteIneptique / hocr_to_kraken_transcribe.xsl
Last active March 21, 2020 11:25
XSL for transforming (need Saxon-EE > 9.8) HOCR from tesseract to transcribing file for Kraken (à la ketos prefill)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:my="foo.bar"
exclude-result-prefixes="xs my saxon uuid"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
version="2.0"
xmlns:uuid="java:java.util.UUID">
@PonteIneptique
PonteIneptique / fix.py
Created February 6, 2019 15:24
Attempt at a small function for lxml parser that fix illformed xml when possible
from lxml import etree as ET
import re
def fix_xml(xml_string: str) -> str:
""" Given an illformated xml, try to fix it
:param xml_string: XML that is faulty
:return: xml that should not be faulty
"""
ENDPOEM
Carminis incompti lusus lecture procaces,
conueniens Latio pone supercilium.
non soror hoc habitat Phoebi, non uesta sacello,
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://hipster-philology.github.io/protogenie/protogenie/schema.rng"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<config>
<output column_marker="TAB">
<header name="order">
<key>token</key>
<key>lemma</key>
<key>pos</key>
<key>Dis</key>
@PonteIneptique
PonteIneptique / viz.xsl
Created May 21, 2020 07:13
XSL for dataviz of TEI MSD informations
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:tei="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="xs"
xpath-default-namespace="http://www.tei-c.org/ns/1.0"
version="2.0">
<xsl:output encoding="UTF-8" method="html" ></xsl:output>
<xsl:variable name="textchunk" select="'ab'"/>
<xsl:variable name="chunkTitle" select="'Priapea'"/>