Period: 18 December 2023 -- 7 January 2024
Backups: Gergoe, Kos, Adrien
Updates: check revisions of this document
Date | Task | Comment |
---|---|---|
2024-01-29 | Task 2a | Updated instructions to update the en-ZZ base TM |
2024-03-19 | Task 2b | Updated instructions to update the fr-ZZ base TM |
2024-03-23 | Task 2a | Updated instructions to update the en-ZZ base TM |
- Source files technical signoff -- Manuel
- Update base versions -- Adrien
- Source updates
- Initialization and setup of OmegaT projects -- Gergoe
- Testing MoM -- Manuel
- UI Translations -- Kos/Adrien
- Trend Transfer -- Kos
- Move target files to final repo -- Gergoe
- Helpdesk -- @all
- Set up reconciliation project -- Kos/Gergoe
- Add files to batch folders -- Manuel
- Arrange TMs after batch transition -- Gergoe
Files are organized and travel through workflows in batches. Batches are defined in this monitoring sheet PISA2025ft-batches. This ile must be considered as the source of truth about how files and batches are named as well as what batch each unit belongs to.
File source/files.yaml
in the common repo should reflect the information in the monitoring sheet, and should be updated if any changes are made to the above (updates are done manually, for now -- script to automate it welcome).
There are different workflow types and they have different steps. They are defined in any of these two:
- 220202_PISA25_Workflow_master_CONS
- https://github.com/capstanlqc/mk-omegat-team-projs/blob/master/config/workflow_steps.yaml
Again, the file workflow_steps.yaml
(which is a config file for our app) must be in sync with the monitoring above.
Production (FT) team projects are hosted in AWS CodeCommit, in domain https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/
. Repo names start with pisa_2025ft_
and file names start with PISA_2025FT_
.
Testing (staging) team projects are hosted in AWS CodeCommit, in the same domain as production team projects. Unlike in production, repo names in staging start with pisa_2025stg_
and file names start with PISA_2025STG_
. If any files from production (FT) must be used in staging for testing, they must be renamed accordingly.
There is one main repo for each step for each locale. They have the following URL template: pisa_2025ft_translation_{LOCALE}_{STEP}.git
.
Each main repository hosts an OmegaT team project, which pulls source files, config files and language assets from the common repo: pisa_2025ft_translation_common.git
.
For the purposes of persistent previews, final target files are to be pushed to the final repo: pisa_2025ft_translation_final.git
.
For the inital phase in the trend transfer task, we use our own team projects, hosted on Github (organization: capstanlqc-pisa
).
This project is a bit special in a few regards:
- the target language is not really a language (
pb
stands for "paper based") - the source text is the computer-based version (CBA) and the target version is the paper-based version (PBA)
- files have been added one by one (one repository mapping per file) rather than by adding a full batch (in order to remove a file from the project, the repository mapping for that file must be removed or commented out in the project settings file, i.e.
omegat.project
) - Dara requests to move target files from this project often to the final repo.
Responsible person: Manuel
Whenever there are updates in the source folder of the common repo, a number of actions are required for the technical signoff of the source files. In this context, "updates" means files being pushed to that repo, under /source/batch1
, and those files can be new files that are being released to the repo after being authored, or a new version of already released files.
The technical signoff involves the following steps:
- reviewing the new files or the new parts in re-released files
- fixing (or linting) the identified issues (scripts are ready for that)
- copying the files to the correct batch folder
We can skip step 1 in this handover, and be confident that there will be no more issues other than the ones that have been already identified.
Then, to fix the issues that have already been identified in previous reviews of already released files, we have a script that runs a series of string subtitutions based on the selected configuration file. All the necessary code and config files are available here: https://github.com/capstanlqc/source-xml-linter
-
If there are updates in files belonging to batch 05_QQA_N, the linting script must be run with config
config_qqa_zwsp.xlsx
-
If there are updates in files belonging to any "new" batch (i.e. any batch starting with 01 .. 06 and ending with
_N
), the normal configconfig.xlsx
must be used. -
If there are updates in "trend" files, the config
config_trend.xlsx
must be used. Trend files belong to batches ending with_T
and can be recognized because they normally have a unit ID that follows pattern_P?[RMS]\d{3}
where the optional P indicates that it's the PBA version and R/M/S is the initial of the domain (reading, math, science).
What I normally do to run the script above is:
- Create a folder and copy there only the files that I want to lint, e.g.
tolint
- Run the script using the path to the
tolint
folder as the input argument - As output argument, use the path to the folder where I want to write the linted files, e..g
linted
- Review the action of the script just to make sure no unexpected damage happened (the best way to do this check is to open the file in OmegaT, and if it's a new version of an already released file, then it's handing to add the file to a project containing the translations of all segments in the previous version to see what changes and becomes untranslated -- other than that, a diff comparison is useful)
- If everything is okay in the linted files, copy them or move them from the
linted
folder into the corresponding batch folder (according to the info infiles.yaml
).
In other words, after activating the virtual environment and installing dependencies, I do:
app=/path/to/local/repo
tolint=/path/to/the/files/tolint
linted=/path/to/the/files/linted
python $app/str_subs.py -i $tolint -o $linted -c $app/config.xlsx
Additionally, there's a separate script for a different kind of issue with with named entities and escaped hex entity references:
- If there are updates in "trend" files, any eventual entity issues must be removed with script
decode_entities.sh
(which usesentities.json
as config)
In the video below, to avoid confusion please skip or ignore the part between 11:05 and 14:37. Sorry about that.
This task also includes generating the target files from the prepare-files step of adapting versions (en-*, fr-*, zh-*).
This must be done when a new batch is released or the files in an already released batch are updated.
Action point for @Eli or @Tanya: mention in our Skype's PISA25 TWG chat group that a new batch is released and tag @Adrien and @Manuel
Some countries which have English locales adapt the English master, e.g. en-PS. en-*
projects have a repository mapping in their settings that adds file tm/auto/base/en-ZZ.tmx
. The remote version of the file is in the common repo, on path assets/base/en-ZZ.tmx.zip
. This file needs to be updated with every new batch released.
So, every time a new batch is released to countries:
- Add the new batch (mapping) to the
pisa_2025ft_translation_en-ZZ_prepare-files
project (note: from the capstanlqc-pisa github organization). - Pack the project as
pisa_2025ft_translation_en-ZZ_prepare-files_OMT
to have an offline version. Unpack the offline version of the projec and close it. - Run the following command:
java -jar /path/to/omegat/build/install/OmegaT/OmegaT.jar /path/to/omegat/project --config-dir=/path/to/config/dir --mode=console-createpseudotranslatetmx --pseudotranslatetmx=/path/to/omegat/project/tm/auto/en-ZZ.tmx --pseudotranslatetype=equal
- Re-open the project to confirm that all segments are pre-translated with the source text. You can search for regex
^(.+)\ue000(?!\1).+$
in both source and target to find any segments where the translation is different from the source. - Zip
en-ZZ.tmx
and commit the new base TMen-ZZ.tmx.zip
topisa_2025ft_translation_common/assets/base/en-ZZ.tmx.zip
overwriting the file there (use commit message "Update English master base TM").
Finally, run code/commit_target_files.sh for en-*
locales and for the new/updated batch.
-
This project is hosted in the capstanlqc-pisa github organization but is pulling all files (except QQ units) from the pisa_2025ft_translation_common repo on AWS.
-
Questionnaire batches are added to
pisa_2025ft_translation_en-ZZ_prepare-files
directly in the source folder rather than through a mapping because the files in the pisa_2025ft_translation_common repo on AWS have filtering properties that would affect what is exposed foren-ZZ
. If the en-ZZ base TM must be updated, the files in the common repo must be copied to thepisa_2025ft_translation_en-ZZ_prepare-files
> source and modified to remove all filtering properties (e.g. remove everything matched by regexits:(localeFilterList|localeFilterType)="[^"]+"
).
- Add new batch to project:
https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_fr-ZZ_signoff.git
- Make sure that all segments are pre-translated and press Ctrl+D to generate the master TM
- Remove any
changeid
,changedate
,creationid
and/orcreationdate
properties from entries in the new master TM.Tip: replace
(<tuv lang="(?:en|fr|zh-Hant)-ZZ")[^>]+
with$1
(first captured group) - Rename
pisa_2025ft_translation_fr-ZZ_signoff-omegat.tmx
asfr-ZZ.tmx
and zip it. You need bothfr-ZZ.tmx
andfr-ZZ.tmx.zip
. - Replace both
fr-ZZ.tmx
andfr-ZZ.tmx.zip
inpisa_2025ft_translation_common/assets/base/
with the files generated in the preview step.
Finally, run code/commit_target_files.sh for fr-*
locales.
- Add new batch to project:
https://github.com/capstanlqc-pisa/pisa_2025ft_translation_zh-Hant-ZZ_signoff.git
- Make sure that all segments are pre-translated and press Ctrl+D to generate the master TM
- Remove any
changeid
,changedate
,creationid
and/orcreationdate
properties from entries in the new master TM.Tip: replace
(<tuv lang="(?:en|fr|zh-Hant)-ZZ")[^>]+
with$1
(first captured group) - Rename
pisa_2025ft_translation_zh-Hant-ZZ_signoff-omegat.tmx
aszh-Hant-ZZ.tmx
and zip it. - Replace
pisa_2025ft_translation_common/assets/base/zh-Hant-ZZ.tmx.zip
with the version generated in the preview step.
Finally, run code/commit_target_files.sh for `zh-* locales.
Source files might be updated for whatever reason during the project -- that means that a new file is pushed from TAO to the common repo inside source/batch1
, overwriting a previous version if it exists. For example, this may happen after errata are fixed.
Any new files need to be linted and signed off again as described in task 1 above, just as it was done with the original version. Then the base versions need to be updated too as explained in task 2.
- Signoff / lint source files [task #1]
- Update en-ZZ base version [task #2]
- Add batch again to fr-ZZ final-proofreading and zh-Hant-ZZ proofreading projects
- Update fr-ZZ and zh-Hant-ZZ base versions [task #2]
Step 3 above is done by adding the repository mapping in the project settings file (e.g. omegat.project
) of those two projects and it's necessary if the batch containing the updated files was already proofread some time ago and therefore removed from those two projects. Only if the batch is added will the proofreader have access to the files and be able to edit the translations.
This is an application to create and/or set up OmegaT team projects according to the information indicated in the translation workflow monitoring sheet.
The readme file in the repo explains how to use it. The docs folder contains links that explain how to set up setting up package git-remote-codecommit.
Responsible: @Kos
If this task is requested, let's discuss it.
https://github.com/capstanlqc/its-filter-validation/
Responsible: @Kos
Nothing else to do unless there are issues or any unforeseen additional request.
https://rentry.org/ui_translation_repos
Responsible: @Gergoe / @Kos?
Our linguists are working on our team projects hosted on Github. When the trend transfer is complete for one locale, we must transfer those translations to the AWS repos.
General info:
- The URLs of our repos on Github are listed here: https://rentry.org/github-trend-transfer-repos (created by Kos with this script).
- The URLs of the repos on AWS have the following name template: https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_{LOCALE}_trend-prepare-files.git
- The linguists follow these instructions: https://capps.capstan.be/doc/pisa2025_trend-transfer_guide.php
ACER has created the repos and we must create the OmegaT projects in them. That has been done only for 5 locales: ar-IL, de-AT, ja-JP, ru-UZ, th-TH.
- pisa_2025ft_translation_ar-IL_trend-prepare-files
- pisa_2025ft_translation_de-AT_trend-prepare-files
- pisa_2025ft_translation_ja-JP_trend-prepare-files
- pisa_2025ft_translation_ru-UZ_trend-prepare-files
- pisa_2025ft_translation_th-TH_trend-prepare-files
When the trend transfer is completed for one version, the steps now are (with examples on the command line for ja-JP
):
-
Clone the target AWS repo:
git clone https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_ja-JP_trend-prepare-files.git
-
If the team project is not created in the AWS repo, create it with OmegaT CLI and set it up correctly (the 5 projects above can be used as templates/models):
/opt/omegat/OmegaT_5.7.2/jre/bin/java -jar /opt/omegat/OmegaT_5.7.2/OmegaT.jar team init en ja-JP
Then, make the appropriate changes in the
omegat.project
file (e.g. addding repository mappings, etc. just like in the five projects that were created first, mentioned above). -
Download the github repo
gh repo clone capstanlqc-pisa/pisa_2025ft_transfer_ja-JP_trend-prepp
-
Copy the working TM (
omegat/project_save.tmx
) from the github repo to the AWS project astm/auto/PISA_{LOCALE}_MS2022_trend25.tmx
and commit changes on the AWS repo:cp pisa_2025ft_transfer_ja-JP_trend-prepp/omegat/project_save.tmx pisa_2025ft_translation_ja-JP_trend-prepare-files/tm/auto/PISA_ja-JP_MS2022_trend25.tmx cd pisa_2025ft_translation_ja-JP_trend-prepare-files git add . && git commit -m "Added TM with transferred trend version" && git push
-
Download the team project on AWS and commit target files:
- Download https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_ja-JP_trend-prepare-files.git in OmegaT
- Commit target files
-
Copy target files to the final repo (see below how to do this)
Steps:
- Go to your local copy of the project repo
- Sync with remote version (
git pull
) - Make a copy of the
target
folder, eg.final
- Change directory to that folder
- Flatten the structure of files so that all files are now at the same level
- Remove directories (which are now empty)
- Remove the locale extesion from all files
- Move all the renamed files to the location
/translations/{LOCALE}/batch1/
in the final repo - Go to the final repo and push the new files
To move target files to the final repo, you can use this script: https://gist.github.com/msoutopico/4bbe0ac90b71f709a4f5d8fc3bdf91c1
For other language versions, you can adapt the script above accordingly, or just read through it to confirm what steps are necessary.
We might get tickets from users who get the upgrade wrong. Get familiar with the upgrading instructions: https://capstanlqc.github.io/omegat-guides/verification/install-and-setup/
New: no manual customization is needed for users who install OmegaT 5.7.2 from scratch. The configuration script is included in the installer and runs automatically when OmegaT is run if the scripts folder hasn't been customized yet and set to the user config folder.
Responsible PM: Tanya Sonolenko
Resposible TT: @Kos (when he has creds, otherwise Gergeo to push the two TMs)
Two translators are producing zh-Hant-ZZ translations of a certain batch in offline projects. When they are done, they will hand back project packages.
Steps:
- Unpack those two projects
- In each of them, press Ctrl+D to produce the master TM
- Rename those two TMs as
{BATCH}_zh-Hant-ZZ_T1.tmx
and{BATCH}_zh-Hant-ZZ_T2.tmx
respectively. - Commit those two TMs to folder
tm/rec/
in the reconciliation project: https://git-codecommit.eu-central-1.amazonaws.com/v1/repos/pisa_2025ft_translation_zh-Hant-ZZ_reconciliation.git
Responsible person: Manuel
This only needs to be done after there are source updates and they go through technical signoff. The script below can be run to sort files in their correct batch folders:
Script: https://gist.github.com/msoutopico/72cee9a221860fedb9f876372ffc8e80
Improvements todo:
- Parameters
source_dir
,root
andconfig
are currently hardcoded. It would be nice to addsource_dir
as a CLI argument (the other two parameters are based on that one) so that the script doesn't need to be edited when running by different people.
Responsible person: Gergoe
This is an action that is expected from ACER, but they don't have a working implementation yet. In the meantime, we can do this manually. The sequence of steps must be the following:
- A batch transition (a batch is added to or remove from a certain step)
- We (TTT) arrange TMs at that step according to the new batches at that step
- The user can then download the project with the new TM arrangement
To performe step 2 above, follow these steps:
- After the batch transition, clone or sync the repository where the omegat project for that step is hosted
- Run the script
arrange_tmx_files_with_extension
on the repository, and push changes.
The script is available in two flavours: Python and Node.js (both have been provided to ACER but James' team will base their feature in the Node.js version).
Run as:
python arrange_tmx_files_with_extension.py /path/to/local/clone
or
node arrange_tmx_files_with_extension.js /path/to/local/clone
Remember to install dependencies first.
@kosivantsov @amathot @gergoe You can leave comments here ;)