bertfrees/README.md

## README.md

      
    Raw
  

              README.md
            
          
    DAISY Pipeline Assignment

This page describes a work assignment based on the DAISY Pipeline framework and scripts.
It has been created by the DAISY Pipeline team as part of the recruitment process for new developers, in order to better understand the candidates’ technical approaches when facing a real DAISY Pipeline work task.
Completing the assignment described in this page is excepted to take around 4 hours.
This duration is of course only an estimate for information purpose only; the overall time spent on the assignment will largely depend on existing knowledge of the context, on technical familiarity with XSLT, on time required for reflexion, etc.
Introduction

EPUB 3 is now a widely adopted mainstream format. While it is perfectly possible to make accessible EPUB 3 content, older digital formats specialized for accessibility are still heavily used.
This is the case of the DAISY 2.02 format, which is used by many people around the world, especially by DAISY member organizations and their patrons.
The availability of a script to automatically convert from EPUB 3 to DAISY 2.02 is of high interest to the DAISY community.
The DAISY Pipeline project already contains an epub3-to-daisy202 conversion script.
However, this script is known to be limited and doesn’t work on as many EPUBs as one would hope.
One of these limitations is related to SMIL documents, i.e. the documents used to define text and audio synchronization.
When the epub3-to-daisy202 script is run on an an EPUB which has Media Overlays, the SMIL 3.0 content (as used in Media Overlays) are converted to SMIL 1.0 (as used in DAISY 2.02).
But more needs to be done: DAISY 2.02 has some special requirements related to the structure of SMIL content that may not be necessarily respected in SMIL content coming from valid EPUB Media Overlays.
The epub3-to-daisy202 script therefore needs to "fix" the converted SMIL documents to abide by the rules of DAISY 2.02.
The purpose of this assignment is to implement an XSLT conversion that takes a SMIL 1.0 document as input (along with its associated XHTML document available in the default collection of the XPath context), and produces a SMIL 1.0 document conforming to the rules established in the DAISY 2.02 specification.
Background information

This assignment is based on a real issue in the DAISY Pipeline project, see issue #86 of the pipeline-script project.
All EPUB 3 content consumed or produced by DAISY Pipeline scripts is currently expected to conform to the EPUB 3.0.1 specifications.
All DAISY 2.02 content produced by the DAISY Pipeline is expected to conform to the DAISY 2.0.2 specification and errata.
Getting started

Pre-requisites

You'll need recent versions of the following software:

Git
Maven

Checking out the source files


Clone the daisy/pipeline-script GitHub repository. The git URL is git@github.com:daisy/pipeline-scripts.git.
Check-out the super/epub3-to-daisy202 branch

In the epub3-to-daisy202 directory, you will find:

an XSLT document at src/main/resources/xml/xslt/augment-smil.xsl, which is the skeleton of the XSLT conversion you need to implement.
an XProcSpec document at src/test/xprocspec/augment-smil.xprocspec, which describes test cases that can be used to verify your XSLT implementation.

Running tests


mvn clean test (in the epub3-to-daisy202 directory): This runs the augment-smil.xprocspec test which is "focused", meaning all the other XProcSpec tests are skipped (see the focus attribute). If you remove this attribute all the tests will run but it will take longer. Once you get the augment-smil test to pass you may want to enable the create-ncc.xprocspec test too. Maven also runs the XSpec tests, but you can ignore this (unless you decide to add a new XSpec test of course.)
The XProcSpec test report can be viewed in target/xprocspec-reports/index.html.
Either you have an XML editor that highlights coding errors, or otherwise you can look at the test report and log file produced after running the test. For runtime errors, refer to the test report: it should include an error message and a stack trace all the way at the bottom. Compilation errors are currently not shown well in the test report: you will only get a message saying "Errors were reported during stylesheet compilation". In this case refer to the log file at target/test.log. Search for "Errors were reported during stylesheet compilation" and check the lines above it.

Don't worry if you don't get the tests to work. The most important for us is that your code makes some sense.
Tasks

You need to implement the augment-smil.xsl conversion so that:

new SMIL par elements are added for every heading (h1 - h6) in the HTML
the par elements are kept in reading order, i.e. in the document order of the corresponding HTML elements

Extras: depending on your fluidity with XSLT, the above could be finished in less than 4 hours. If you have some remaining time consider looking at these extra tasks. Note that this is currently not covered by tests yet.

existing SMIL par elements that reference segments within the same heading are merged in one unique par element
new par elements are generated also for page numbers

Hints

augment-smil.xsl already imports the utility library http://www.daisy.org/pipeline/modules/file-utils/library.xsl. This was done for a reason. It defines some XSLT functions that might be useful. There is no API documentation for this library at the moment, but you can view the source code and these XSpec tests demonstrate their functionality.
Getting help

Questions are welcome, by email at bertfrees@gmail.com (Bert Frees), with cc to rdeltour@gmail.com (Romain Deltour). If you think you are stuck, don't waste your time. You can ask us for more hints. If needed, we will open adhoc Slack channels for more direct communication.
References


EPUB 3.0.1 specifications
SMIL 1.0 specification
DAISY 2.02 specification
XProc 1.0 specification
XProcSpec project