mromanello/.gitignore

## .gitignore
.python-version
__pycache__/
images/
runs/

## apt.txt
libgl1
git-lfs

## demo.ipynb
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Demo: page layout analysis of classical commentaries with YOLOv5"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to demonstrate the usage of [pre-trained YOLOv5 models](https://github.com/AjaxMultiCommentary/layout-yolo-models) for classical commentaries described in the following publication: \n",
    "\n",
    "> Najem-Meyer, Sven, and Matteo Romanello. 2022. ‘Page Layout Analysis of Text-Heavy Historical Documents: A Comparison of Textual and Visual Approaches’. In *Proceedings of the Conference on Computational Humanities Research 2022*, 36–54. Antwerp: CEUR-WS. https://ceur-ws.org/Vol-3290/long_paper8670.pdf.\n",
    "\n",
    "It covers the necessary steps to extract page regions from any classical commentary available at the Internet Archive. \n",
    "\n",
    "🤓 In order to keep it readable, some functions were moved to a separate file, [lib.py](lib.py).\n",
    "\n",
    "🐍 Supported Python version: `3.8`. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load pre-trained models\n",
    "\n",
    "First thing first, we need to load the pre-trained YOLOv5 models. \n",
    "\n",
    "⚠️ If you're running this notebook via `myBinder` you don't have to worry about it, as the [`postBuild`](./postBuild) file already does that automatically for you. However, if you're running this notebook locally, you'll need to run `git clone https://github.com/AjaxMultiCommentary/layout-yolo-models` in order to get a local copy of the models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from pathlib import Path\n",
    "from lib import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change this path if you are running this notebook locally\n",
    "models_basedir = Path(\"./layout-yolo-models/models/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yolo_coarse_model_path = str(models_basedir / \"coarse_labels.pt\")\n",
    "yolo_fine_model_path = str(models_basedir / \"fine_labels.pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yolo_coarse_model = torch.hub.load('ultralytics/yolov5', 'custom', path=yolo_coarse_model_path, force_reload=True)\n",
    "yolo_fine_model = torch.hub.load('ultralytics/yolov5', 'custom', path=yolo_fine_model_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get commentary images via IIIF\n",
    "\n",
    "As a test, let's take a commentary from the Internet Archive that the models have not seen during training: it's the commentary by J. Cassen to Thucydides' *History*, books 7-8 ([IA link](https://archive.org/details/thukydides7v8thuc/page/n92/mode/1up)). \n",
    "\n",
    "The unique identifier for this book in the IA is `thukydides7v8thuc`, and we can use it to access the page images via the IA's IIIF API.\n",
    "It's enough to plug this identifier in the following string template `https://iiif.archivelab.org/iiif/<identifier>/manifest.json` to infer its IIIF link: [https://iiif.archivelab.org/iiif/thukydides7v8thuc/manifest.json](https://iiif.archivelab.org/iiif/thukydides7v8thuc/manifest.json).\n",
    "\n",
    "At this point we can use IIIF, together with a few lines of code, to download all commentary images and then process them with YOLO."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if you don't care about Thuc., do feel free to pick another classical commentary\n",
    "# and plug its IA identifier in the line below\n",
    "book_id = 'thukydides7v8thuc'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# by removing the `sample` and `start_at` parameters, the entire\n",
    "# book will be downloaded (it will take a bit longer, so just grab a coffee while you wait)\n",
    "download_IA_book(book_id, './images/', sample=5, start_at=9)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function we just executed also creates a zip archive of all page images here: `images/{book_id}.zip`; this archive will come in handy later when exporting the data to perform further manual annotation via the external tool [eScriptorium](https://gitlab.com/scripta/escriptorium/). "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perform PageLayoutAnalysis"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a local copy of the images, we can run them through our model. Here below we use the fine-grained model (`yolo_fine_model`), but it's possible to use for prediction the coarse model (`yolo_coarse_model`), trained with coarser region classes (see our paper for the details). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "images_df, predictions_df = pla_process_images(f\"images/{book_id}\", yolo_fine_model, save_predictions=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🤓 If you passed `save_predictions=True` to the function above, you'll find a folder containing the original images with highlighted the regions identified by YOLO. This can be very useful to get a quick idea about the model's performance on a certain commentary. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first `DataFrame` returned by this function contains information about the input images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "images_df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second `DataFrame`, instead, contains YOLO's predictions (region name, bounding box coordinates, etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions_df.head(10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export to Alto/XML"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we may want to export the automatic annotations and to curate them in an external tool like [eScriptorium](https://gitlab.com/scripta/escriptorium/). To be able do so, we need to convert YOLO predictions into Alto/XML, one of the formats supported by eScriptorium. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alto_export(book_id, images_df, predictions_df)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function we just executed creates a zip archive of all Alto/XML files here: `alto/{book_id}.zip`; this archive can be imported into eScriptorium."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Describing the steps to load the data into eScriptorium is beyond the scope of this notebook. But detailed instructions can be found in [this eScriptorium tutorial]: see section 1.2-1.3 (for the creation of a document and import of local images) and section 1.5.1 (for the import of annotations from Alto/XML documents).  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ajmc-yolo-pla-demo-py38",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "481c0f0aa31053a275fc3a78ee64bea404793d6587c1d754d22bb0117218f535"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## lib.py
import os
import cv2
import shutil
import pandas as pd

from jinja2 import Environment, BaseLoader
from typing import List, TypedDict, Dict, Type, Tuple

from skimage import io
from pathlib import Path
from piffle.presentation import IIIFPresentation

SEGMONTO_VALUE_IDS = {
  'MainZone:commentary': 'BT01',
  'MainZone:primaryText': 'BT02',
  'MainZone:preface': 'BT03',
  'MainZone:translation': 'BT04',
  'MainZone:introduction': 'BT05',
  'NumberingZone:textNumber': 'BT06',
  'NumberingZone:pageNumber': 'BT07',
  'MainZone:appendix': 'BT08',
  'MarginTextZone:criticalApparatus': 'BT09',
  'MainZone:bibliography': 'BT10',
  'MarginTextZone:footnote': 'BT11',
  'MainZone:index': 'BT12',
  'RunningTitleZone': 'BT13',
  'MainZone:ToC': 'BT14',
  'TitlePageZone': 'BT15',
  'MarginTextZone:printedNote': 'BT16',
  'MarginTextZone:handwrittenNote': 'BT17',
  'CustomZone:other': 'BT18',
  'CustomZone:undefined': 'BT19',
  'CustomZone:line_region': 'BT20',
  'CustomZone:weird': 'BT21',
}

class PageDict(TypedDict):
    filename: str
    width: int
    height: int

def predictions_to_alto(
        page_dict: PageDict,
        predictions: pd.DataFrame,
        output_path: str,
        segmonto_mappings: Dict[str, str] = SEGMONTO_VALUE_IDS
    ) -> None:
  """
  This function takes a list of YOLO predictions stored in a DataFrame (bounding boxes + region name)
  and serializes them according to the Alto/XML format.
  """


  template_string = """<?xml version="1.0" encoding="UTF-8"?>
  <alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="http://www.loc.gov/standards/alto/ns-v4#"
        xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
      <Description>
          <MeasurementUnit>pixel</MeasurementUnit>
          <sourceImageInformation>
              <fileName>{{page_dict['filename']}}</fileName>
          </sourceImageInformation>
      </Description>

      <Tags>
          {% for key in segmonto_value_ids.keys() %}
              <OtherTag ID="{{ segmonto_value_ids[key] }}" LABEL="{{ key }}" DESCRIPTION="block type {{ key }}"/>
          {% endfor %}
      </Tags>

      <Layout>
          <Page WIDTH="{{ page_dict['width'] }}"
                HEIGHT="{{ page_dict['height'] }}"
                ID="{{ page_dict['id'] }}"
                PHYSICAL_IMG_NR="">
              <PrintSpace HPOS="0" VPOS="0" WIDTH="{{ page_dict['width'] }}" HEIGHT="{{ page_dict['height'] }}">
                      {% for pred in predictions %}
                        {% set region_id = 'r_' ~ loop.index %}
                          <TextBlock ID="{{region_id}}"
                                    HPOS="{{ pred['hpos'] }}" VPOS="{{ pred['vpos'] }}"
                                    WIDTH="{{ pred['width'] }}" HEIGHT="{{ pred['height'] }}"
                                    TAGREFS="{{ segmonto_value_ids[pred['class']] }}">
                          </TextBlock>
                      {% endfor %}
              </PrintSpace>
          </Page>
      </Layout>
  </alto>
  """
  template = Environment(loader=BaseLoader).from_string(template_string)
  alto_xml_data = template.render(page_dict=page_dict, predictions=predictions, segmonto_value_ids=segmonto_mappings)
  output_path.write_text(alto_xml_data, encoding='utf-8')
  return

def pla_process_images(target_folder: str, yolo_model: object, save_predictions: bool = False) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    This function runs a set of images stored in a folder through a YOLOv5 model.
    It returns two DataFrames: one with information about the processed images and another one with the YOLO predictions.
    """

    # get the list of images to process
    input_files = list(Path(target_folder).glob('*.[jp][pn][g]'))
    print(len(input_files))

    image_dimensions = []
    for inpfile in input_files:
        img = cv2.imread(inpfile)
        img_height, img_width = img.shape[:2]
        image_dimensions.append({
            'id': inpfile,
            'filename': os.path.basename(inpfile),
            'height': img_height,
            'width': img_width
        })

    images_df = pd.DataFrame(image_dimensions).set_index('id')

    # run images through the model
    predictions = yolo_model(input_files)
    if save_predictions:
        predictions.save()

    temp = []
    for image, predictions_df in zip(input_files, predictions.pandas().xyxy):
        predictions_df['image_filename'] = os.path.basename(image)
        predictions_df['image_path'] = image
        temp.append(predictions_df)

    predictions_df = pd.concat(temp).reset_index()
    return images_df, predictions_df

def download_IA_book(book_id: str, target_folder: str, sample: int = None, start_at : int = None) -> None:

    book_target_folder = os.path.join(target_folder, book_id)
    Path(book_target_folder).mkdir(parents=True, exist_ok=True)

    iiif_manifest_link = f'https://iiif.archivelab.org/iiif/{book_id}/manifest.json'
    #print(iiif_manifest_link)

    manifest = IIIFPresentation.from_url(iiif_manifest_link)

    if sample and start_at:
        canvases = manifest.sequences[0].canvases[start_at:start_at + sample]
    elif sample and (start_at is None):
        canvases = manifest.sequences[0].canvases[:sample]
    else:
        canvases = manifest.sequences[0].canvases

    print(f"{len(canvases)} images will be downloaded...")

    for canvas in canvases:
        image_id = canvas.images[0].resource.id
        image_filename = f"{canvas.id.split('/')[-2]}.jpg"
        target_path = Path(book_target_folder) / image_filename
        img = io.imread(image_id)
        io.imsave(target_path, img)
        print(f"Image {image_id} was downloaded to {target_path}")
    print("Done.")

    shutil.make_archive(f'{book_target_folder}', 'zip', f'{book_target_folder}')
    print(f'A zip file containing {len(canvases)} image files was created at {book_target_folder}.zip')
    return

def alto_export(book_id: str, images_df: pd.DataFrame, predictions_df: pd.DataFrame) -> None:

    alto_basedir = Path(f'alto/{book_id}')
    alto_basedir.mkdir(parents=True, exist_ok=True)

    for idx, row in images_df.reset_index().iterrows():

        page_dict = row.to_dict()

        predictions = []
        for idx, row in predictions_df[predictions_df.image_filename == page_dict['filename']].iterrows():
            hpos = int(row['xmin'])
            vpos = int(row['ymin'])
            width = int(row['xmax']) - int(row['xmin'])
            height = int(row['ymax']) - int(row['ymin'])
            predictions.append(
                {
                    "class": row['name'],
                    "hpos": hpos,
                    "vpos": vpos,
                    "width": width,
                    "height": height
                }
            )
        predictions_to_alto(page_dict, predictions, alto_basedir / page_dict['filename'].replace('.jpg', '.xml'))

    print('The following region types were recognised:')
    print("\n".join(predictions_df.name.unique().tolist()))
    shutil.make_archive(f'alto/{book_id}', 'zip', f'alto/{book_id}/')
    print(f'A zip file containing {images_df.shape[0]} Alto/XML files was created at alto/{book_id}.zip')

## postBuild
git lfs install
git clone https://github.com/AjaxMultiCommentary/layout-yolo-models

## requirements.txt
absl-py==1.4.0
appnope==0.1.3
asttokens==2.2.1
attrdict==2.0.1
backcall==0.2.0
cached-property==1.5.2
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==3.0.1
comm==0.1.2
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.6
decorator==5.1.1
executing==1.2.0
fonttools==4.38.0
gitdb==4.0.10
GitPython==3.1.31
google-auth==2.16.1
google-auth-oauthlib==0.4.6
grpcio==1.51.3
idna==3.4
imageio==2.26.0
importlib-metadata==6.0.0
importlib-resources==5.12.0
ipdb==0.13.11
ipykernel==6.21.2
ipython==8.10.0
jedi==0.18.2
Jinja2==3.1.2
jupyter_client==8.0.3
jupyter_core==5.2.0
kiwisolver==1.4.4
lxml==4.9.2
Markdown==3.4.1
MarkupSafe==2.1.2
matplotlib==3.7.0
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
networkx==3.0
numpy==1.24.2
oauthlib==3.2.2
opencv-python==4.7.0.72
packaging==23.0
pandas==1.5.3
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
piffle==0.4.0
Pillow==9.4.0
platformdirs==3.0.0
prompt-toolkit==3.0.37
protobuf==4.22.0
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
Pygments==2.14.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.7.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.0
requests==2.28.2
requests-oauthlib==1.3.1
rsa==4.9
scikit-image==0.19.3
scipy==1.10.1
seaborn==0.12.2
sentry-sdk==1.15.0
six==1.16.0
smmap==5.0.0
stack-data==0.6.2
tensorboard==2.12.0
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
thop==0.1.1.post2209072238
tifffile==2023.2.3
tomli==2.0.1
torch==1.13.1
torchvision==0.14.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
typing_extensions==4.5.0
ultralytics==8.0.47
urllib3==1.26.14
wcwidth==0.2.6
Werkzeug==2.2.3
zipp==3.15.0

## runtime.txt
python-3.8
	{
	"cells": [
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Demo: page layout analysis of classical commentaries with YOLOv5"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The goal of this notebook is to demonstrate the usage of [pre-trained YOLOv5 models](https://github.com/AjaxMultiCommentary/layout-yolo-models) for classical commentaries described in the following publication: \n",
	"\n",
	"> Najem-Meyer, Sven, and Matteo Romanello. 2022. ‘Page Layout Analysis of Text-Heavy Historical Documents: A Comparison of Textual and Visual Approaches’. In Proceedings of the Conference on Computational Humanities Research 2022, 36–54. Antwerp: CEUR-WS. https://ceur-ws.org/Vol-3290/long_paper8670.pdf.\n",
	"\n",
	"It covers the necessary steps to extract page regions from any classical commentary available at the Internet Archive. \n",
	"\n",
	"🤓 In order to keep it readable, some functions were moved to a separate file, [lib.py](lib.py).\n",
	"\n",
	"🐍 Supported Python version: `3.8`. "
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Load pre-trained models\n",
	"\n",
	"First thing first, we need to load the pre-trained YOLOv5 models. \n",
	"\n",
	"⚠️ If you're running this notebook via `myBinder` you don't have to worry about it, as the [`postBuild`](./postBuild) file already does that automatically for you. However, if you're running this notebook locally, you'll need to run `git clone https://github.com/AjaxMultiCommentary/layout-yolo-models` in order to get a local copy of the models."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import torch\n",
	"from pathlib import Path\n",
	"from lib import *"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# change this path if you are running this notebook locally\n",
	"models_basedir = Path(\"./layout-yolo-models/models/\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"yolo_coarse_model_path = str(models_basedir / \"coarse_labels.pt\")\n",
	"yolo_fine_model_path = str(models_basedir / \"fine_labels.pt\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"yolo_coarse_model = torch.hub.load('ultralytics/yolov5', 'custom', path=yolo_coarse_model_path, force_reload=True)\n",
	"yolo_fine_model = torch.hub.load('ultralytics/yolov5', 'custom', path=yolo_fine_model_path)"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Get commentary images via IIIF\n",
	"\n",
	"As a test, let's take a commentary from the Internet Archive that the models have not seen during training: it's the commentary by J. Cassen to Thucydides' History, books 7-8 ([IA link](https://archive.org/details/thukydides7v8thuc/page/n92/mode/1up)). \n",
	"\n",
	"The unique identifier for this book in the IA is `thukydides7v8thuc`, and we can use it to access the page images via the IA's IIIF API.\n",
	"It's enough to plug this identifier in the following string template `https://iiif.archivelab.org/iiif/<identifier>/manifest.json` to infer its IIIF link: [https://iiif.archivelab.org/iiif/thukydides7v8thuc/manifest.json](https://iiif.archivelab.org/iiif/thukydides7v8thuc/manifest.json).\n",
	"\n",
	"At this point we can use IIIF, together with a few lines of code, to download all commentary images and then process them with YOLO."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# if you don't care about Thuc., do feel free to pick another classical commentary\n",
	"# and plug its IA identifier in the line below\n",
	"book_id = 'thukydides7v8thuc'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# by removing the `sample` and `start_at` parameters, the entire\n",
	"# book will be downloaded (it will take a bit longer, so just grab a coffee while you wait)\n",
	"download_IA_book(book_id, './images/', sample=5, start_at=9)"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The function we just executed also creates a zip archive of all page images here: `images/{book_id}.zip`; this archive will come in handy later when exporting the data to perform further manual annotation via the external tool [eScriptorium](https://gitlab.com/scripta/escriptorium/). "
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Perform PageLayoutAnalysis"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now that we have a local copy of the images, we can run them through our model. Here below we use the fine-grained model (`yolo_fine_model`), but it's possible to use for prediction the coarse model (`yolo_coarse_model`), trained with coarser region classes (see our paper for the details). "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"images_df, predictions_df = pla_process_images(f\"images/{book_id}\", yolo_fine_model, save_predictions=True)"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"🤓 If you passed `save_predictions=True` to the function above, you'll find a folder containing the original images with highlighted the regions identified by YOLO. This can be very useful to get a quick idea about the model's performance on a certain commentary. "
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The first `DataFrame` returned by this function contains information about the input images:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"images_df.head()"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The second `DataFrame`, instead, contains YOLO's predictions (region name, bounding box coordinates, etc.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"predictions_df.head(10)"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Export to Alto/XML"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"At this point, we may want to export the automatic annotations and to curate them in an external tool like [eScriptorium](https://gitlab.com/scripta/escriptorium/). To be able do so, we need to convert YOLO predictions into Alto/XML, one of the formats supported by eScriptorium. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"alto_export(book_id, images_df, predictions_df)"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The function we just executed creates a zip archive of all Alto/XML files here: `alto/{book_id}.zip`; this archive can be imported into eScriptorium."
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Describing the steps to load the data into eScriptorium is beyond the scope of this notebook. But detailed instructions can be found in [this eScriptorium tutorial]: see section 1.2-1.3 (for the creation of a document and import of local images) and section 1.5.1 (for the import of annotations from Alto/XML documents). "
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "ajmc-yolo-pla-demo-py38",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.0"
	},
	"orig_nbformat": 4,
	"vscode": {
	"interpreter": {
	"hash": "481c0f0aa31053a275fc3a78ee64bea404793d6587c1d754d22bb0117218f535"
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}
	import os
	import cv2
	import shutil
	import pandas as pd

	from jinja2 import Environment, BaseLoader
	from typing import List, TypedDict, Dict, Type, Tuple

	from skimage import io
	from pathlib import Path
	from piffle.presentation import IIIFPresentation

	SEGMONTO_VALUE_IDS = {
	'MainZone:commentary': 'BT01',
	'MainZone:primaryText': 'BT02',
	'MainZone:preface': 'BT03',
	'MainZone:translation': 'BT04',
	'MainZone:introduction': 'BT05',
	'NumberingZone:textNumber': 'BT06',
	'NumberingZone:pageNumber': 'BT07',
	'MainZone:appendix': 'BT08',
	'MarginTextZone:criticalApparatus': 'BT09',
	'MainZone:bibliography': 'BT10',
	'MarginTextZone:footnote': 'BT11',
	'MainZone:index': 'BT12',
	'RunningTitleZone': 'BT13',
	'MainZone:ToC': 'BT14',
	'TitlePageZone': 'BT15',
	'MarginTextZone:printedNote': 'BT16',
	'MarginTextZone:handwrittenNote': 'BT17',
	'CustomZone:other': 'BT18',
	'CustomZone:undefined': 'BT19',
	'CustomZone:line_region': 'BT20',
	'CustomZone:weird': 'BT21',
	}

	class PageDict(TypedDict):
	filename: str
	width: int
	height: int

	def predictions_to_alto(
	page_dict: PageDict,
	predictions: pd.DataFrame,
	output_path: str,
	segmonto_mappings: Dict[str, str] = SEGMONTO_VALUE_IDS
	) -> None:
	"""
	This function takes a list of YOLO predictions stored in a DataFrame (bounding boxes + region name)
	and serializes them according to the Alto/XML format.
	"""


	template_string = """<?xml version="1.0" encoding="UTF-8"?>
	<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns="http://www.loc.gov/standards/alto/ns-v4#"
	xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
	<Description>
	<MeasurementUnit>pixel</MeasurementUnit>
	<sourceImageInformation>
	<fileName>{{page_dict['filename']}}</fileName>
	</sourceImageInformation>
	</Description>

	<Tags>
	{% for key in segmonto_value_ids.keys() %}
	<OtherTag ID="{{ segmonto_value_ids[key] }}" LABEL="{{ key }}" DESCRIPTION="block type {{ key }}"/>
	{% endfor %}
	</Tags>

	<Layout>
	<Page WIDTH="{{ page_dict['width'] }}"
	HEIGHT="{{ page_dict['height'] }}"
	ID="{{ page_dict['id'] }}"
	PHYSICAL_IMG_NR="">
	<PrintSpace HPOS="0" VPOS="0" WIDTH="{{ page_dict['width'] }}" HEIGHT="{{ page_dict['height'] }}">
	{% for pred in predictions %}
	{% set region_id = 'r_' ~ loop.index %}
	<TextBlock ID="{{region_id}}"
	HPOS="{{ pred['hpos'] }}" VPOS="{{ pred['vpos'] }}"
	WIDTH="{{ pred['width'] }}" HEIGHT="{{ pred['height'] }}"
	TAGREFS="{{ segmonto_value_ids[pred['class']] }}">
	</TextBlock>
	{% endfor %}
	</PrintSpace>
	</Page>
	</Layout>
	</alto>
	"""
	template = Environment(loader=BaseLoader).from_string(template_string)
	alto_xml_data = template.render(page_dict=page_dict, predictions=predictions, segmonto_value_ids=segmonto_mappings)
	output_path.write_text(alto_xml_data, encoding='utf-8')
	return

	def pla_process_images(target_folder: str, yolo_model: object, save_predictions: bool = False) -> Tuple[pd.DataFrame, pd.DataFrame]:
	"""
	This function runs a set of images stored in a folder through a YOLOv5 model.
	It returns two DataFrames: one with information about the processed images and another one with the YOLO predictions.
	"""

	# get the list of images to process
	input_files = list(Path(target_folder).glob('*.[jp][pn][g]'))
	print(len(input_files))

	image_dimensions = []
	for inpfile in input_files:
	img = cv2.imread(inpfile)
	img_height, img_width = img.shape[:2]
	image_dimensions.append({
	'id': inpfile,
	'filename': os.path.basename(inpfile),
	'height': img_height,
	'width': img_width
	})

	images_df = pd.DataFrame(image_dimensions).set_index('id')

	# run images through the model
	predictions = yolo_model(input_files)
	if save_predictions:
	predictions.save()

	temp = []
	for image, predictions_df in zip(input_files, predictions.pandas().xyxy):
	predictions_df['image_filename'] = os.path.basename(image)
	predictions_df['image_path'] = image
	temp.append(predictions_df)

	predictions_df = pd.concat(temp).reset_index()
	return images_df, predictions_df

	def download_IA_book(book_id: str, target_folder: str, sample: int = None, start_at : int = None) -> None:

	book_target_folder = os.path.join(target_folder, book_id)
	Path(book_target_folder).mkdir(parents=True, exist_ok=True)

	iiif_manifest_link = f'https://iiif.archivelab.org/iiif/{book_id}/manifest.json'
	#print(iiif_manifest_link)

	manifest = IIIFPresentation.from_url(iiif_manifest_link)

	if sample and start_at:
	canvases = manifest.sequences[0].canvases[start_at:start_at + sample]
	elif sample and (start_at is None):
	canvases = manifest.sequences[0].canvases[:sample]
	else:
	canvases = manifest.sequences[0].canvases

	print(f"{len(canvases)} images will be downloaded...")

	for canvas in canvases:
	image_id = canvas.images[0].resource.id
	image_filename = f"{canvas.id.split('/')[-2]}.jpg"
	target_path = Path(book_target_folder) / image_filename
	img = io.imread(image_id)
	io.imsave(target_path, img)
	print(f"Image {image_id} was downloaded to {target_path}")
	print("Done.")

	shutil.make_archive(f'{book_target_folder}', 'zip', f'{book_target_folder}')
	print(f'A zip file containing {len(canvases)} image files was created at {book_target_folder}.zip')
	return

	def alto_export(book_id: str, images_df: pd.DataFrame, predictions_df: pd.DataFrame) -> None:

	alto_basedir = Path(f'alto/{book_id}')
	alto_basedir.mkdir(parents=True, exist_ok=True)

	for idx, row in images_df.reset_index().iterrows():

	page_dict = row.to_dict()

	predictions = []
	for idx, row in predictions_df[predictions_df.image_filename == page_dict['filename']].iterrows():
	hpos = int(row['xmin'])
	vpos = int(row['ymin'])
	width = int(row['xmax']) - int(row['xmin'])
	height = int(row['ymax']) - int(row['ymin'])
	predictions.append(
	{
	"class": row['name'],
	"hpos": hpos,
	"vpos": vpos,
	"width": width,
	"height": height
	}
	)
	predictions_to_alto(page_dict, predictions, alto_basedir / page_dict['filename'].replace('.jpg', '.xml'))

	print('The following region types were recognised:')
	print("\n".join(predictions_df.name.unique().tolist()))
	shutil.make_archive(f'alto/{book_id}', 'zip', f'alto/{book_id}/')
	print(f'A zip file containing {images_df.shape[0]} Alto/XML files was created at alto/{book_id}.zip')
	git lfs install
	git clone https://github.com/AjaxMultiCommentary/layout-yolo-models
	absl-py==1.4.0
	appnope==0.1.3
	asttokens==2.2.1
	attrdict==2.0.1
	backcall==0.2.0
	cached-property==1.5.2
	cachetools==5.3.0
	certifi==2022.12.7
	charset-normalizer==3.0.1
	comm==0.1.2
	contourpy==1.0.7
	cycler==0.11.0
	debugpy==1.6.6
	decorator==5.1.1
	executing==1.2.0
	fonttools==4.38.0
	gitdb==4.0.10
	GitPython==3.1.31
	google-auth==2.16.1
	google-auth-oauthlib==0.4.6
	grpcio==1.51.3
	idna==3.4
	imageio==2.26.0
	importlib-metadata==6.0.0
	importlib-resources==5.12.0
	ipdb==0.13.11
	ipykernel==6.21.2
	ipython==8.10.0
	jedi==0.18.2
	Jinja2==3.1.2
	jupyter_client==8.0.3
	jupyter_core==5.2.0
	kiwisolver==1.4.4
	lxml==4.9.2
	Markdown==3.4.1
	MarkupSafe==2.1.2
	matplotlib==3.7.0
	matplotlib-inline==0.1.6
	nest-asyncio==1.5.6
	networkx==3.0
	numpy==1.24.2
	oauthlib==3.2.2
	opencv-python==4.7.0.72
	packaging==23.0
	pandas==1.5.3
	parso==0.8.3
	pexpect==4.8.0
	pickleshare==0.7.5
	piffle==0.4.0
	Pillow==9.4.0
	platformdirs==3.0.0
	prompt-toolkit==3.0.37
	protobuf==4.22.0
	psutil==5.9.4
	ptyprocess==0.7.0
	pure-eval==0.2.2
	pyasn1==0.4.8
	pyasn1-modules==0.2.8
	Pygments==2.14.0
	pyparsing==3.0.9
	python-dateutil==2.8.2
	pytz==2022.7.1
	PyWavelets==1.4.1
	PyYAML==6.0
	pyzmq==25.0.0
	requests==2.28.2
	requests-oauthlib==1.3.1
	rsa==4.9
	scikit-image==0.19.3
	scipy==1.10.1
	seaborn==0.12.2
	sentry-sdk==1.15.0
	six==1.16.0
	smmap==5.0.0
	stack-data==0.6.2
	tensorboard==2.12.0
	tensorboard-data-server==0.7.0
	tensorboard-plugin-wit==1.8.1
	thop==0.1.1.post2209072238
	tifffile==2023.2.3
	tomli==2.0.1
	torch==1.13.1
	torchvision==0.14.1
	tornado==6.2
	tqdm==4.64.1
	traitlets==5.9.0
	typing_extensions==4.5.0
	ultralytics==8.0.47
	urllib3==1.26.14
	wcwidth==0.2.6
	Werkzeug==2.2.3
	zipp==3.15.0