Skip to content

Instantly share code, notes, and snippets.

@carestad
Last active January 13, 2023 16:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save carestad/96a996c0615cd00382ee418b9c0f073c to your computer and use it in GitHub Desktop.
Save carestad/96a996c0615cd00382ee418b9c0f073c to your computer and use it in GitHub Desktop.
TTML to SRT conversion. Written in PHP.

ttml2srt.php

This is a simple script for converting TTML subtitle files to SRT ones. Tested with TTML files on tv.nrk.no.

It assumes the data is structured like this:

<tt>
 <body>
  <div>
   <p>(...)</p>
   <p>(...)</p>
  </div>
 </body>
</tt>

Paragraphs might contain <span> elements that are only used for styling (only italic as I've seen really). As far as I know subrip doesn't support aligning the text (center, left, right), so no support for that as of now.

The script also assumes the TTML uses the begin and dur attributes, and not the end attribute, so the time codes are calculated in a function. Adding support for end attribute should be easy though.

Usage

ttml2srt.php <infile>

Outputs SRT data. Pipe to file if saving needed.

Inspired by: https://gist.github.com/jareware/7af17f2034931608e842 but gave up when libxmljs didn't install and everything went to shit with Node.

Integrate with youtube-dl

mkdir -p ~/.config/youtube-dl
nano ~/.config/youtube-dl/config

Add the following to the file:

--exec 'title=`echo {} | sed "s/\.[a-z0-9]\+$//"`; test -f "${title}"*.ttml && /path/to/ttml2srt.php "${title}"*.ttml > "${title}".srt;'

Replace /path/to/ttml2srt.php with the actual path to the file, and make sure it's executable!

Everytime you run youtube-dl now, it will check if a .ttml file with the same title exists in the folder and run the ttml2srt.php script on it. If you download subtitles with --all-subs and more than one is present, the above will likely only convert the last subtitle.

Requirements

sudo apt-get install php-{cli,xml}

Made by me.

#!/usr/bin/php
<?php
/**
* TTML (XML) to SRT subtitle converter.
* Only tested with subs from Akamai player really, but hopefully works on other stuff as well.
* Author: Alexander Karlstad
*/
// TTML file to parse
$in = $argv[1];
$cont = file_get_contents($in);
// Replace <br/>'s
$cont = str_replace(['<br/>', '<br />'], "\n", $cont);
$xml = simplexml_load_string($cont);
$subs = $xml->body->div->p;
$num = 1;
foreach ($subs as $i => $sub) {
$attrs = $sub->attributes();
$begin = $attrs['begin'];
$dur = $attrs['dur'];
// Do we have spans within? Typically for adding bold/italic text.
if ($sub->count() > 0) {
$text = $sub->asXML();
foreach ($sub->children() as $child) {
$child_text = trim($child);
$child_attrs = $child->attributes();
$child_style = isset($child_attrs['style']) ? $child_attrs['style'] : '';
$child_xml = $child->asXML();
if ('italic' == $child_style) {
$t = "<i>{$child_text}</i>\n";
}
else if ('bold' == $child_style) {
$t = "<b>{$child_text}</b>\n";
}
else {
$t = "{$child_text}\n";
}
$text = str_replace($child_xml, $t, $text);
}
// Only allow <b> and <i> for SRT compatibility. Don't mind <u>.
$text = strip_tags($text, '<b><i>');
}
else {
$text = (string) $sub;
}
$text = trim($text);
// remove weird spacings that sometimes come after a newline
// due to xml formatting and >1 newlines.
$text = preg_replace([',\n+[ ]+,', ',\n+,'], "\n", $text);
$timecode = calc_timecode($begin, $dur);
// Output in Subrip format
echo "${num}\n";
echo "${timecode}\n";
echo "${text}\n\n";
$num++;
}
// Subrip time code handling. God damn.
// Written for readability.
// Input: $orig - string with original start time from TTML (HH:MM:SS.XXX)
// Input: $add - string with original due time from TTML, which will be added onto $orig (HH:MM:SS.XXX)
function calc_timecode($orig, $add) {
// Split hours, minutes, seconds and ms
$orig = preg_split('/[:.,]+/', $orig);
$add = preg_split('/[:.,]+/', $add);
// A variable for each unit, for readability
$o_h = $orig[0];
$o_m = $orig[1];
$o_s = $orig[2];
$o_ms = $orig[3];
// A variable for each unit, for readability
$a_h = $add[0];
$a_m = $add[1];
$a_s = $add[2];
$a_ms = $add[3];
// Combine them
$r_h = $o_h + $a_h;
$r_m = $o_m + $a_m;
$r_s = $o_s + $a_s;
$r_ms = $o_ms + $a_ms;
// MS needs to be lt 1000, add to $r_s if gt 1000.
if (1000 <= $r_ms) {
$r_s += floor($r_ms/1000);
$r_ms = $r_ms%1000;
}
// S needs to be lt 60, add to $r_m if gt 60.
if (60 <= $r_s) {
$r_m += floor($r_s/60);
$r_s = $r_s%60;
}
// M needs to be lt 60, add to $r_h if gt 60.
if (60 <= $r_m) {
$r_h += floor($r_m/60);
$r_m = $r_m%60;
}
$r_h = ($r_h < 10) ? "0" . $r_h : $r_h;
$r_m = ($r_m < 10) ? "0" . $r_m : $r_m;
$r_s = ($r_s < 10) ? "0" . $r_s : $r_s;
$r_ms = (2 == strlen($r_ms)) ? "0" . $r_ms : ((1 == strlen($r_ms)) ? "00" . $r_ms : $r_ms);
$o = "{$o_h}:{$o_m}:{$o_s},{$o_ms}";
$r = "{$r_h}:{$r_m}:{$r_s},{$r_ms}";
return "{$o} --> {$r}";
}
@Diegus83
Copy link

Diegus83 commented Jul 24, 2019

Thanks for the script, saved me a lot of work since I have some ttml files that use the the 'begin" and 'end' attributes.

Found an issue when converting a subtitle that contained an & character, like this:
<p region="pop317" begin="00:07:40.133" end="00:07:45.800">BACK AT MASELLI & SONS.</span></p>

I solved it by escaping the & like this

$cont = str_replace(['&'],"'&amp;", $cont);
$xml = simplexml_load_string($cont);

It produces the expected output
318 00:07:40,133 --> 00:07:45,800 BACK AT MASELLI & SONS.

@carestad
Copy link
Author

Thanks for the script, saved me a lot of work since I have some ttml files that use the the 'begin" and 'end' attributes.

Glad to hear it!

Found an issue when converting a subtitle that contained an & character, like this:
<p region="pop317" begin="00:07:40.133" end="00:07:45.800">BACK AT MASELLI & SONS.</span></p>

Interesting. So the & was not encoded as &amp; in the TTML file in the first place? Usually XML files wants &s to be &amp; if I'm not entirely wrong.

Do you have an example TTML file you could link to or attach here where this is the case?

@Diegus83
Copy link

It wasn't encoded properly, which makes it an invalid XML file. It was downloaded by youtube-dl so I can't tell whether it was like that on the server side or it is a mistake by youtube-dl when it writes out the file.

It won't let me attach a ttml or zip file so I put it here https://pastebin.com/pVw1pb7p

Offending character is on line 2018 and interestingly, it breaks pastebin syntax highlighting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment