Skip to content

Instantly share code, notes, and snippets.

@msoutopico
Last active April 11, 2024 08:36
Show Gist options
  • Save msoutopico/21ead4613cde11374779c0212260581c to your computer and use it in GitHub Desktop.
Save msoutopico/21ead4613cde11374779c0212260581c to your computer and use it in GitHub Desktop.
PISA 2025 -- list of tech tasks

PISA 2025 -- list of tech tasks

Period: 18 December 2023 -- 7 January 2024
Backups: Gergoe, Kos, Adrien
Updates: check revisions of this document

History

Date Task Comment
2024-01-29 Task 2a Updated instructions to update the en-ZZ base TM
2024-03-19 Task 2b Updated instructions to update the fr-ZZ base TM
2024-03-23 Task 2a Updated instructions to update the en-ZZ base TM

Table of contents

General info

Tasks

  1. Source files technical signoff -- Manuel
  2. Update base versions -- Adrien
  3. Source updates
  4. Initialization and setup of OmegaT projects -- Gergoe
  5. Testing MoM -- Manuel
  6. UI Translations -- Kos/Adrien
  7. Trend Transfer -- Kos
  8. Move target files to final repo -- Gergoe
  9. Helpdesk -- @all
  10. Set up reconciliation project -- Kos/Gergoe
  11. Add files to batch folders -- Manuel
  12. Arrange TMs after batch transition -- Gergoe

General info

Batches

Files are organized and travel through workflows in batches. Batches are defined in this monitoring sheet PISA2025ft-batches. This ile must be considered as the source of truth about how files and batches are named as well as what batch each unit belongs to.

File source/files.yaml in the common repo should reflect the information in the monitoring sheet, and should be updated if any changes are made to the above (updates are done manually, for now -- script to automate it welcome).

Workflow steps

There are different workflow types and they have different steps. They are defined in any of these two:

Again, the file workflow_steps.yaml (which is a config file for our app) must be in sync with the monitoring above.

Team projects and repos

Production (FT) team projects are hosted in AWS CodeCommit, in domain https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/. Repo names start with pisa_2025ft_ and file names start with PISA_2025FT_.

Testing (staging) team projects are hosted in AWS CodeCommit, in the same domain as production team projects. Unlike in production, repo names in staging start with pisa_2025stg_ and file names start with PISA_2025STG_. If any files from production (FT) must be used in staging for testing, they must be renamed accordingly.

There is one main repo for each step for each locale. They have the following URL template: pisa_2025ft_translation_{LOCALE}_{STEP}.git .

Each main repository hosts an OmegaT team project, which pulls source files, config files and language assets from the common repo: pisa_2025ft_translation_common.git.

For the purposes of persistent previews, final target files are to be pushed to the final repo: pisa_2025ft_translation_final.git.

For the inital phase in the trend transfer task, we use our own team projects, hosted on Github (organization: capstanlqc-pisa).

PB adaptation project

This project is a bit special in a few regards:

  • the target language is not really a language (pb stands for "paper based")
  • the source text is the computer-based version (CBA) and the target version is the paper-based version (PBA)
  • files have been added one by one (one repository mapping per file) rather than by adding a full batch (in order to remove a file from the project, the repository mapping for that file must be removed or commented out in the project settings file, i.e. omegat.project)
  • Dara requests to move target files from this project often to the final repo.

Tasks

TASK 1. Source files technical signoff

Responsible person: Manuel

Whenever there are updates in the source folder of the common repo, a number of actions are required for the technical signoff of the source files. In this context, "updates" means files being pushed to that repo, under /source/batch1, and those files can be new files that are being released to the repo after being authored, or a new version of already released files.

The technical signoff involves the following steps:

  1. reviewing the new files or the new parts in re-released files
  2. fixing (or linting) the identified issues (scripts are ready for that)
  3. copying the files to the correct batch folder

We can skip step 1 in this handover, and be confident that there will be no more issues other than the ones that have been already identified.

Then, to fix the issues that have already been identified in previous reviews of already released files, we have a script that runs a series of string subtitutions based on the selected configuration file. All the necessary code and config files are available here: https://github.com/capstanlqc/source-xml-linter

  • If there are updates in files belonging to batch 05_QQA_N, the linting script must be run with config config_qqa_zwsp.xlsx

  • If there are updates in files belonging to any "new" batch (i.e. any batch starting with 01 .. 06 and ending with _N), the normal config config.xlsx must be used.

  • If there are updates in "trend" files, the config config_trend.xlsx must be used. Trend files belong to batches ending with _T and can be recognized because they normally have a unit ID that follows pattern _P?[RMS]\d{3}

    where the optional P indicates that it's the PBA version and R/M/S is the initial of the domain (reading, math, science).

What I normally do to run the script above is:

  • Create a folder and copy there only the files that I want to lint, e.g. tolint
  • Run the script using the path to the tolint folder as the input argument
  • As output argument, use the path to the folder where I want to write the linted files, e..g linted
  • Review the action of the script just to make sure no unexpected damage happened (the best way to do this check is to open the file in OmegaT, and if it's a new version of an already released file, then it's handing to add the file to a project containing the translations of all segments in the previous version to see what changes and becomes untranslated -- other than that, a diff comparison is useful)
  • If everything is okay in the linted files, copy them or move them from the linted folder into the corresponding batch folder (according to the info in files.yaml).

In other words, after activating the virtual environment and installing dependencies, I do:

app=/path/to/local/repo
tolint=/path/to/the/files/tolint
linted=/path/to/the/files/linted
python $app/str_subs.py -i $tolint -o $linted -c $app/config.xlsx

Additionally, there's a separate script for a different kind of issue with with named entities and escaped hex entity references:

  • If there are updates in "trend" files, any eventual entity issues must be removed with script decode_entities.sh (which uses entities.json as config)

In the video below, to avoid confusion please skip or ignore the part between 11:05 and 14:37. Sorry about that.

Watch the video


TASK 2. Update base versions

This task also includes generating the target files from the prepare-files step of adapting versions (en-*, fr-*, zh-*).

a) en-ZZ

This must be done when a new batch is released or the files in an already released batch are updated.

Action point for @Eli or @Tanya: mention in our Skype's PISA25 TWG chat group that a new batch is released and tag @Adrien and @Manuel

Some countries which have English locales adapt the English master, e.g. en-PS. en-* projects have a repository mapping in their settings that adds file tm/auto/base/en-ZZ.tmx. The remote version of the file is in the common repo, on path assets/base/en-ZZ.tmx.zip. This file needs to be updated with every new batch released.

So, every time a new batch is released to countries:

  1. Add the new batch (mapping) to the pisa_2025ft_translation_en-ZZ_prepare-files project (note: from the capstanlqc-pisa github organization).
  2. Pack the project as pisa_2025ft_translation_en-ZZ_prepare-files_OMT to have an offline version. Unpack the offline version of the projec and close it.
  3. Run the following command:
    java -jar /path/to/omegat/build/install/OmegaT/OmegaT.jar /path/to/omegat/project --config-dir=/path/to/config/dir --mode=console-createpseudotranslatetmx --pseudotranslatetmx=/path/to/omegat/project/tm/auto/en-ZZ.tmx --pseudotranslatetype=equal
    
  4. Re-open the project to confirm that all segments are pre-translated with the source text. You can search for regex ^(.+)\ue000(?!\1).+$ in both source and target to find any segments where the translation is different from the source.
  5. Zip en-ZZ.tmx and commit the new base TM en-ZZ.tmx.zip to pisa_2025ft_translation_common/assets/base/en-ZZ.tmx.zip overwriting the file there (use commit message "Update English master base TM").

Finally, run code/commit_target_files.sh for en-* locales and for the new/updated batch.

Caveats

b) fr-ZZ

  1. Add new batch to project: https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_fr-ZZ_signoff.git
  2. Make sure that all segments are pre-translated and press Ctrl+D to generate the master TM
  3. Remove any changeid, changedate, creationid and/or creationdate properties from entries in the new master TM.

    Tip: replace (<tuv lang="(?:en|fr|zh-Hant)-ZZ")[^>]+ with $1 (first captured group)

  4. Rename pisa_2025ft_translation_fr-ZZ_signoff-omegat.tmx as fr-ZZ.tmx and zip it. You need both fr-ZZ.tmx and fr-ZZ.tmx.zip.
  5. Replace both fr-ZZ.tmx and fr-ZZ.tmx.zip in pisa_2025ft_translation_common/assets/base/ with the files generated in the preview step.

Finally, run code/commit_target_files.sh for fr-* locales.

c) zh-Hant-ZZ

  1. Add new batch to project: https://github.com/capstanlqc-pisa/pisa_2025ft_translation_zh-Hant-ZZ_signoff.git
  2. Make sure that all segments are pre-translated and press Ctrl+D to generate the master TM
  3. Remove any changeid, changedate, creationid and/or creationdate properties from entries in the new master TM.

    Tip: replace (<tuv lang="(?:en|fr|zh-Hant)-ZZ")[^>]+ with $1 (first captured group)

  4. Rename pisa_2025ft_translation_zh-Hant-ZZ_signoff-omegat.tmx as zh-Hant-ZZ.tmx and zip it.
  5. Replace pisa_2025ft_translation_common/assets/base/zh-Hant-ZZ.tmx.zip with the version generated in the preview step.

Finally, run code/commit_target_files.sh for `zh-* locales.

TASK 3. Source updates

Source files might be updated for whatever reason during the project -- that means that a new file is pushed from TAO to the common repo inside source/batch1, overwriting a previous version if it exists. For example, this may happen after errata are fixed.

Any new files need to be linted and signed off again as described in task 1 above, just as it was done with the original version. Then the base versions need to be updated too as explained in task 2.

  1. Signoff / lint source files [task #1]
  2. Update en-ZZ base version [task #2]
  3. Add batch again to fr-ZZ final-proofreading and zh-Hant-ZZ proofreading projects
  4. Update fr-ZZ and zh-Hant-ZZ base versions [task #2]

Step 3 above is done by adding the repository mapping in the project settings file (e.g. omegat.project) of those two projects and it's necessary if the batch containing the updated files was already proofread some time ago and therefore removed from those two projects. Only if the batch is added will the proofreader have access to the files and be able to edit the translations.

TASK 4. Initialization and setup of OmegaT projects

This is an application to create and/or set up OmegaT team projects according to the information indicated in the translation workflow monitoring sheet.

The readme file in the repo explains how to use it. The docs folder contains links that explain how to set up setting up package git-remote-codecommit.

TASK 5. Testing MoM

Responsible: @Kos

If this task is requested, let's discuss it.

https://github.com/capstanlqc/its-filter-validation/

TASK 6. UI Translations

Responsible: @Kos

Nothing else to do unless there are issues or any unforeseen additional request.

https://rentry.org/ui_translation_repos

TASK 7. Trend transfer

Responsible: @Gergoe / @Kos?

Our linguists are working on our team projects hosted on Github. When the trend transfer is complete for one locale, we must transfer those translations to the AWS repos.

General info:

ACER has created the repos and we must create the OmegaT projects in them. That has been done only for 5 locales: ar-IL, de-AT, ja-JP, ru-UZ, th-TH.

  • pisa_2025ft_translation_ar-IL_trend-prepare-files
  • pisa_2025ft_translation_de-AT_trend-prepare-files
  • pisa_2025ft_translation_ja-JP_trend-prepare-files
  • pisa_2025ft_translation_ru-UZ_trend-prepare-files
  • pisa_2025ft_translation_th-TH_trend-prepare-files

When the trend transfer is completed for one version, the steps now are (with examples on the command line for ja-JP):

  1. Clone the target AWS repo:

    git clone https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_ja-JP_trend-prepare-files.git
    
  2. If the team project is not created in the AWS repo, create it with OmegaT CLI and set it up correctly (the 5 projects above can be used as templates/models):

    /opt/omegat/OmegaT_5.7.2/jre/bin/java -jar /opt/omegat/OmegaT_5.7.2/OmegaT.jar team init en ja-JP
    

    Then, make the appropriate changes in the omegat.project file (e.g. addding repository mappings, etc. just like in the five projects that were created first, mentioned above).

  3. Download the github repo

    gh repo clone capstanlqc-pisa/pisa_2025ft_transfer_ja-JP_trend-prepp
    
  4. Copy the working TM (omegat/project_save.tmx) from the github repo to the AWS project as tm/auto/PISA_{LOCALE}_MS2022_trend25.tmx and commit changes on the AWS repo:

    cp pisa_2025ft_transfer_ja-JP_trend-prepp/omegat/project_save.tmx pisa_2025ft_translation_ja-JP_trend-prepare-files/tm/auto/PISA_ja-JP_MS2022_trend25.tmx
    cd pisa_2025ft_translation_ja-JP_trend-prepare-files
    git add . && git commit -m "Added TM with transferred trend version" && git push
    
  5. Download the team project on AWS and commit target files:

  6. Copy target files to the final repo (see below how to do this)

TASK 8. Move target files to final repo

Steps:

  1. Go to your local copy of the project repo
  2. Sync with remote version (git pull)
  3. Make a copy of the target folder, eg. final
  4. Change directory to that folder
  5. Flatten the structure of files so that all files are now at the same level
  6. Remove directories (which are now empty)
  7. Remove the locale extesion from all files
  8. Move all the renamed files to the location /translations/{LOCALE}/batch1/ in the final repo
  9. Go to the final repo and push the new files

To move target files to the final repo, you can use this script: https://gist.github.com/msoutopico/4bbe0ac90b71f709a4f5d8fc3bdf91c1

For other language versions, you can adapt the script above accordingly, or just read through it to confirm what steps are necessary.

TASK 9. Helpdesk

We might get tickets from users who get the upgrade wrong. Get familiar with the upgrading instructions: https://capstanlqc.github.io/omegat-guides/verification/install-and-setup/

New: no manual customization is needed for users who install OmegaT 5.7.2 from scratch. The configuration script is included in the installer and runs automatically when OmegaT is run if the scripts folder hasn't been customized yet and set to the user config folder.

TASK 10. Set up reconciliation project (zh-Hant-ZZ)

Responsible PM: Tanya Sonolenko
Resposible TT: @Kos (when he has creds, otherwise Gergeo to push the two TMs)

Two translators are producing zh-Hant-ZZ translations of a certain batch in offline projects. When they are done, they will hand back project packages.

Steps:

  1. Unpack those two projects
  2. In each of them, press Ctrl+D to produce the master TM
  3. Rename those two TMs as {BATCH}_zh-Hant-ZZ_T1.tmx and {BATCH}_zh-Hant-ZZ_T2.tmx respectively.
  4. Commit those two TMs to folder tm/rec/in the reconciliation project: https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_zh-Hant-ZZ_reconciliation.git

TASK 11. Add files to batch folders

Responsible person: Manuel

This only needs to be done after there are source updates and they go through technical signoff. The script below can be run to sort files in their correct batch folders:

Script: https://gist.github.com/msoutopico/72cee9a221860fedb9f876372ffc8e80

Improvements todo:

  • Parameters source_dir, root and config are currently hardcoded. It would be nice to add source_dir as a CLI argument (the other two parameters are based on that one) so that the script doesn't need to be edited when running by different people.

TASK 12. Arrange TMs after batch transition

Responsible person: Gergoe

This is an action that is expected from ACER, but they don't have a working implementation yet. In the meantime, we can do this manually. The sequence of steps must be the following:

  1. A batch transition (a batch is added to or remove from a certain step)
  2. We (TTT) arrange TMs at that step according to the new batches at that step
  3. The user can then download the project with the new TM arrangement

To performe step 2 above, follow these steps:

  1. After the batch transition, clone or sync the repository where the omegat project for that step is hosted
  2. Run the script arrange_tmx_files_with_extension on the repository, and push changes.

The script is available in two flavours: Python and Node.js (both have been provided to ACER but James' team will base their feature in the Node.js version).

Run as:

python arrange_tmx_files_with_extension.py /path/to/local/clone

or

node arrange_tmx_files_with_extension.js /path/to/local/clone

Remember to install dependencies first.

@msoutopico
Copy link
Author

@kosivantsov @amathot @gergoe You can leave comments here ;)

@msoutopico
Copy link
Author

TODO: Add info about the common repo and assets in capstanlqc-pisa to trend transfer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment