- Muestra: seleccione un lote de páginas de documentos (digamos 100). El corpus debe reflejar tipos comunes de documentos de la colección. Dividir en conjuntos de tren y prueba.
- Predicción: transcripción automática con el mejor modelo actual usando Trainer (¿comenzando con Vision o Araucania?)
- Cargar: cargue los archivos de imagen y las transcripciones con Fetcher
- Corregir los errores en eScriptorium.
- Ajustar el mejor modelo actual con los nuevos datos.
- Evaluar la mejora utilizando datos de prueba. Genere métricas de error de caracteres de palabras y tasa de error de palabras.
- Evaluar transcripciones de modelos para tareas de investigación. Registre los problemas y áreas que requieren mejora.
- Sample: Curate a batch of document pages (say 100). The corpus should reflect common kinds of documents in the collection. Split into train and test sets.
- Predict: Auto-transcribe with the current best model using Trainer (starting with Vision or Araucania?)
- Upload: Upload the image files and transcriptions with Fetcher
- Correct the errors in eScriptorium
- Fine-tune the current best model on the new data
- Assess improvement using test data. Generate word character error and word error rate metrics.
- Evaluate model transcriptions for research tasks. Record issues and areas that require improvement.
title: "Demo Project Workflow. From images to research data in Obsidian" | |
description: > | |
This project offers a workflow to process historical documents from the Circuit Court of Istmina, Chocó, Colombia. | |
https://eap.bl.uk/project/EAP1477 | |
In this project, we will: | |
- Fetch the IIIF Images and metadata from the British Library | |
- Segment the images with Kraken | |
- Transcribe the images using Google Vision | |
- Upload the images to eScriptorium where the transcriptions can be corrected |
Log in here with pennkey (no @upenn.edu) | |
use pennkey password | |
https://rdp-lab.library.upenn.edu/maps | |
In the remote computer, open the browser and search for ExcelAlmaLookup, is should give you this ulr: | |
https://github.com/pulibrary/ExcelAlmaLookup/#readme | |
In the readme find the link to download the exe file. | |
Tinker with Windows to download and open the exe file to install the app. |
September 2023
This is a tutorial on how to create a local web server to serve static websites. We will re-purpose a wifi router to serve data over wifi to the browser on local devices such as phones and tablets. This is a great way to share a digital archive with people in locations with limited internet access.
At the end of this tutorial we will have:
- A working wifi router running OpenWRT (Linux)
- A static website with search using PageFind
App URL: https://nexis.pennds.org/UpennWSK/homepage/
Repo: https://github.com/upenn-libraries/lexis-wsk
To run the app
docker-compose up
To access logs
docker logs app
- The main problem is that textFileDict (a dictionary that connects the human readable titles of texts to the Python file for that text either in data/Greek or data/Latin) gets deleted
- textFileDict is needed for the selection of text sections.
Good housekeeping
- To connect to the Bridge server, open terminal and enter:
ssh bmulliga@64.227.97.179
- The server is just like any other computer. You need to keep it up to date so that people can't hack into it and use it to mine bitcoin. Whenever you log in, it's a good practice to run two commands:
sudo apt update
to update the computer andsudo apt upgrade
.
This tutorial is a quick introduction to FastAPI, which is a simple Python web-framework for creating REST APIs, static HTML pages and many other web applications. Sebastián Ramírez, the creator of FastAPI has excellent documentation and gitter forum.
FastAPI, in many respects, is an updated version of Flask. It's built with the features and capabilities of Python3 in mind, particularly type hints for data validation. It also embraces asyncronous functions and other features of modern web design.
In the following sections, I'll share several use cases for FastAPI. I am particularly fond of FastAPI as a general toolkit that can be used for building simple static HTML or serving advanced machine learning models. It's minimal and simple, but capable of growing as your project evolves and becomes more complex.
I am currently working on issue 291 to highligh the search query in the search results. Search uses DocumentSearchView, which has a get_queryset method that returns the results and renders corpus/document_list.html
The result's description is line 27-28 in document_result.html
{# description #}
<p class="description">{{ document.description.0|truncatewords:25 }}</p>
Given that document.description is just a string, the simplest solution would be to add mark tags around the query in the description.