tomcrane/av-api.md

## av-api.md

      
    Raw
  

              av-api.md
            
          
    This gist has been superseded by this Google Doc:

http://bit.ly/av-issues

What are Audio and Video Content APIs?

This document starts with a quick recap of IIIF Presentation and Image APIs, to help probe what is different for time-based media in the context of use cases and existing community practice for A/V today. It assumes that there is only one Presentation API, otherwise we can't have mixed-media manifests (a recurring use case and aspiration). It is based on the discussions in The Hague, in the AV Github issues and use cases, and on AV working group calls.
Introduction/recap - access to content via the Presentation API today

The Presentation API is how a viewer ensures a human sees the right content (image pixels, video bitstreams, text transcriptions and more) in the right place at the right time, to present the object described by the manifest. This is done by associating that content with one or more canvases in a manifest, via annotation. In the current Presentation API, a canvas is a 2D rectangular space with an aspect ratio. The height and width properties of a canvas define the aspect ratio and provide a simple coordinate space. This coordinate space allows the creator of the manifest to associate whole or parts of content with whole or parts of canvases, and for anyone else to make their own annotations.
An empty canvas...
{
  "@id": "http://example.org/iiif/paintings/breugel/babel/canvas/c0",
  "@type": "sc:Canvas",
  "label": "The Tower of Babel",
  // our coordinate space for annotations:
  "width": 10000, 
  "height": 7317    
}
...provides a surface for us and others to annotate:

We can annotate an image onto the canvas, filling the entire space by default because we have specified the whole canvas as a target (on) and not a particular region:
{
  "@id": "http://example.org/iiif/paintings/breugel/babel/canvas/c0",
  "@type": "sc:Canvas",
  "label": "The Tower of Babel",
  "width": 10000,
  "height": 7317,
  "images": [
      {
        "@id": "http://example.org/iiif/paintings/breugel/babel/annotations/a0",
        "@type": "oa:Annotation",
        "motivation": "sc:painting",
        "resource": {
            // A regular JPEG is the body of the annotation:
            "@id": "https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Pieter_Bruegel_the_Elder_-_The_Tower_of_Babel_%28Vienna%29_-_Google_Art_Project_-_edited.jpg/1280px-Pieter_Bruegel_the_Elder_-_The_Tower_of_Babel_%28Vienna%29_-_Google_Art_Project_-_edited.jpg",
            "@type": "dctypes:Image",
            "format": "image/jpeg",
            "width": 1280,
            "height": 937
        },
        "on": "http://example.org/iiif/paintings/breugel/babel/canvas/c0"
      }
   ]
}
In the above example, a JPEG at Wikimedia is painted onto the canvas.
We could even offer a choice of different size images for a client web application that maybe has a responsive layout, or allows the user to choose a zoom level:
{
  "@id": "http://example.org/iiif/paintings/breugel/babel/canvas/c0",
  "@type": "sc:Canvas",
  "label": "The Tower of Babel",
  "width": 10000,
  "height": 7317,
  "images": [
      {
        "@id": "http://example.org/iiif/paintings/breugel/babel/annotations/a0",
        "@type": "oa:Annotation",
        "motivation": "sc:painting",
        "resource": {
          // this is completely valid but NOT a common pattern... people use Image API
          "@type": "oa:Choice",
          "default": "rdf:nil", // don't want to suggest a default size
          "item": [
            {
              "@id": "https://upload.wikimedia.org/...babel-30000...jpg",
              "@type": "dctypes:Image",
              "format": "image/jpeg",
              "width": 30000,
              "height": 21952,
              "label": "huge image"
            },
            {
              "@id": "https://upload.wikimedia.org/...babel-1280...jpg",
              "@type": "dctypes:Image",
              "format": "image/jpeg",
              "width": 1280,
              "height": 937,
              "label": "regular image"
            },
            {
              "@id": "https://upload.wikimedia.org/...babel-640...jpg",
              "@type": "dctypes:Image",
              "format": "image/jpeg",
              "width": 640,
              "height": 468,
              "label": "small image"
            }
          ]
        },
        "on": "http://example.org/iiif/paintings/breugel/babel/canvas/c0"
      }
   ]
}
We can also annotate a video onto a canvas, because a video can be the body of an annotation as easily as a JPEG:
{
  "@id": "http://example.org/iiif/films/john-ford/stagecoach/canvas/c0",
  "@type": "sc:Canvas",
  "label": "Stagecoach",
  // provide a good coordinate space for later annotation
  "width": 4000,
  "height": 3000,
  "otherContent": [
      // ...technically this should NOT be inline in v2.1
      "@type": "AnnotationList",
      "resources": [
        {
            "@id": "http://example.org/iiif/films/john-ford/stagecoach/annotations/a0",
            "@type": "oa:Annotation",
            "motivation": "sc:painting",
            "resource": {
                "@id": "http://example.org/iiif/films/john-ford/stagecoach/content/stagecoach.webm",
                "@type": "dctypes:Video",  //__| The body of this 
                "format": "video/webm",    //  | anotation is a video 
                "width": 720,
                "height": 540
            },
            "on": "http://example.org/iiif/films/john-ford/stagecoach/canvas/c0"
        }
      ]
   ]
}
However, viewers might not be expecting this if they are just looking for images. So far, the image example and the video example are equivalent, save for the "syntactic sugar" and special treatment of the Presentation API's images property on the canvas. The current IIIF specs offer no further specific support for video, but they do for images - the IIIF Image API.
IIIF Image API

Although the image examples above give us interoperable representations of digital objects, very few people publishing manifests at scale do it with just a static image resource annotating each canvas. Nearly everyone attaches a service to each image resource - a IIIF Image Service that the viewer uses for Deep Zoom and access to derivatives:

The IIIF Image API specifies a web service that returns an image in response to a standard HTTP or HTTPS request. The URI can specify the region, size, rotation, quality characteristics and format of the requested image. A URI can also be constructed to request basic technical information about the image to support client applications. This API was conceived of to facilitate systematic reuse of image resources in digital image repositories maintained by cultural heritage organizations. It could be adopted by any image repository or service, and can be used to retrieve static images in response to a properly constructed URI.

Image servers have been around for a long time, and hundreds of different ways have been invented in web applications large and small to supply parameters to a web service to return a particular size image, from Flickr on down. Protocols such as Internet Imaging Protocol (IIP) or Djatoka's API define query string parameters, and tile protocols such as Seadragon DZI or Zoomify have allowed for deep zoom backed by pregenerated image pyramids or dynamically generated tiles.
The IIIF Image API took on board all this experience and gives us an API for interoperability, so we can make requests for images from hundreds of different places and expect them to work. An image service can be backed by static files or a dynamic image server, because the API depends on inserting values into a URL template rather than appending query string parameters. It supports a great range of service capabilities, and the client can determine supported features from a service description (the info.json). It works well for both deep zoom and more general derivative generation.

This specification concerns image requests by a client, but not management of the images by the server. It covers how to respond to the requests given in a particular URI syntax, but does not cover methods of implementation such as rotation algorithms, transcoding, color management, compression, or how to respond to URIs that do not conform to the specified syntax. This allows flexibility for implementation in domains with particular constraints or specific community practices, while supporting interoperability in the general case.

As very few people have so far implemented the Presentation API without attaching image services to the images annotating their canvases, viewer applications usually expect or even require that image services are available - using the manifest for structure and display metadata, but stepping over the canvas abstraction and then the image annotation, and reaching directly for the image service:
tileSource = myCanvas.images[0].resource.service["@id"] + "/info.json";
{
    "@id": "http://wellcomelibrary.org/iiif/b11765446/canvas/c0",
    "@type": "sc:Canvas",
    "label": "Still life of leaves and flowers, Hong Kong. Photograph by John Thomson, 1868/1871.",
    "height": 9070,
    "width": 10777,
    "images": [
      {
        "@id": "http://wellcomelibrary.org/iiif/b11765446/imageanno/c9b520cc-5b7f-4e41-a975-5ecea31685df",
        "@type": "oa:Annotation",
        "motivation": "sc:painting",
        "resource": {
            // viewers usually ignore the "static" resource image itself
            "@id": "https://dlcs.io/iiif-img/2/1/c9b520cc-5b7f-4e41-a975-5ecea31685df/full/!1024,1024/0/default.jpg",
            "@type": "dctypes:Image",
            "format": "image/jpeg",
            // ********************************************************
            // **** This is the bit rich clients are interested in ****
            "service": { 
                "@context": "http://iiif.io/api/image/2/context.json",
                "@id": "https://dlcs.io/iiif-img/2/1/c9b520cc-5b7f-4e41-a975-5ecea31685df",
                "profile": "http://iiif.io/api/image/2/level2.json"
            }
            // ********************************************************
        },
        "on": "http://wellcomelibrary.org/iiif/b11765446/canvas/c0"
      }
    ]
}
The service description that the client retrieves for info.json then acts as a tileSource for the client's deep zoom component. It specifies what tile requests can be made:
{
  "@context" : "http://iiif.io/api/image/2/context.json",
  "@id" : "https://dlcs.io/iiif-img/2/1/c9b520cc-5b7f-4e41-a975-5ecea31685df",
  "protocol" : "http://iiif.io/api/image",
  "width" : 10777,
  "height" : 9070,
  "tiles" : [
     { "width" : 256, "height" : 256, "scaleFactors" : [ 1,2,4,8,16,32,64 ] }
  ],
  "profile" : [
     "http://iiif.io/api/image/2/level2.json",
     {
       "formats" : [ "jpg", "png" ],
       "qualities" : [ "default","color","gray" ],
       "supports" : ["rotationArbitrary","mirroring"]
     }
  ]
}
A component like OpenSeadragon or leaflet.js (with IIIF support) uses the information in this service description (the info.json) to construct tile requests that it knows the service supports. A client could be a cropping tool, that constructs a new image URI for a particular region at a particular size, or a rich viewer like the Universal Viewer or Mirador, or some other application. The Image API defines a URI pattern into which the client inserts parameter values:
.../{region}/{size}/{rotation}/{quality}.{format}
...and the service returns a derivative, a distinct image. Any path under the service @id that is valid for the service description must return binary content: an image response.
Parameter space

The profile information in the service description defines a parameter space - in this context, the set of all possible valid URLs a client could construct against that service, all of which the service must support and return an image for. In the above example, the parameter space is very large, but not all Image API service descriptions have to result in a large parameter space.
The Image API specification takes care to support the notion of a "level zero" implementation, one in which the parameter space is small enough that all possible images that could be returned by the service can be generated in advance and stored on disk. For level zero implementations that support tiles, the number of possible images for a single endpoint (i.e., single source) can be thousands or tens of thousands, but that is manageable and can lead to very simple, static deployments that still provide a tiling image service.
The parameter space can be reduced dramatically if tile support is dropped, and only sizes remains:
{
  "@context": "http://iiif.io/api/image/2/context.json",
  "@id": "https://example.org/iiif/breugel/babel/info.json",
  "protocol": "http://iiif.io/api/image",
  "width": 4096,
  "height": 2997,
  "profile": ["http://iiif.io/api/image/2/level0.json"],
  "sizes" : [
    {"width" : 320, "height" : 234},
    {"width" : 640, "height" : 468},
    {"width" : 1024, "height": 749},
    {"width" : 2048, "height": 1499},
    {"width" : 4096, "height" : 2997}
  ]
}
Assuming consistently applied canonicalisation rules, there are only five possible image URLs that can be constructed that comply with this profile. If we added "formats": [ "jpg", "png" ] to the profile (adding png support to the required jpg) we'd double this number to 10. If we then added "qualities": [ "default", "color", "gray", "bitonal" ] to the profile, we'd end up with 40 possible URLs in total.
Examples of the 40 possible URLs:

https://example.org/iiif/breugel/babel/full/1024,/0/default.jpg
https://example.org/iiif/breugel/babel/full/640,/0/bitonal.png
https://example.org/iiif/breugel/babel/full/640,/0/default.jpg

If we can arrange to have all these files on disk at these paths then we can fulfill this simplest of level zero service descriptions without needing to compute derivatives on the fly. We can't provide crops of the canvas, arbitrary regions at different sizes, deep zoom tiles, rotations and so on, but we are still providing a fully compliant image service. The client is indifferent as to whether we are generating derivatives on the fly, or just acting as a basic HTTP server. It has a service definition, and a service endpoint that is fulfilling that definition.
If, however, we only had the bitonal versions in png format and not jpg, or didn't have the three largest sizes in gray, we could not provide a compliant service. Our parameter space, a multidimensional jagged array of possible image URLs, would then be only sparsely populated and the Image API has no facility for specifying holes in this array. The rules a client follows for constructing valid URLs would be much more complicated.
In practice, this possible limitation simply doesn't come up for pre-generated level zero tile implementations. For most use cases, default.jpg as the one and only {quality}.{format} is just fine. You just need to generate lots of JPEGs at the different sizes.
Quite a few people deploy level zero static implementations, especially when the total number of source images is small. But fully dynamic implementations that use an image server like Loris or IIPImage are even more common, because Loris and IIPImage can generate derivatives from JPEG2000 (and pyramidal tiff in the case of IIPImage) very quickly, and it's easier to look after one source file instead of thousands of tiles, and you get support for all the many use cases that involve arbitrary crops, zooms and rotations of images. Image Servers are fast. Image formats are relatively simple and don't change much. And there's no other efficient way in a browser of extracting a small region of an image through a standard HTTP request.
If you already have static tilesets, or "legacy image pyramids" - a set of images at different resolutions - you can convert them to a level zero IIIF service by moving them to the expected paths and defining an info.json. If moving them is not practical, you can shim them, or proxy their real locations with an image service that responds to the expected paths, possibly by redirecting to their real locations.
The Image API is a success because it addresses common use cases very simply. It standardises how to ask for binary derivatives for most people's needs. It doesn't do everything that IIP does, by a long stretch, but you are free to extend it with query string parameters or other techniques, while retaining a common interoperable base:

To allow for extensions, this specification does not define the server behavior when it receives requests that do not match either the base URI or one of the described URI syntaxes...

Video and Audio

In the AV work so far, the following points have been raised as possible factors that might lead to different design goals for an AV API:

People already have web-ready video in a wide variety of formats and want to reuse it [50]
There is no universal, simple ubiquitous video format that we can assume all browsers support (equivalent to jpg support as a requirement of the Image API)
Content is often hosted on a CDN or specialist video service [50]
Video "formats" are not necessarily single-binary-file containers like a JPEG returned from an image service. Adaptive streaming over HTTP, such as Apple HLS and MPEG-DASH, starts with a manifest that provides metadata describing available segments/adaptations for the client to make ongoing adaptive decisions about what to request. These video "formats" are themselves service-like. [57]
A browser-based client can access arbitrary segments of a single video file via byte-range requests. You don't need to create a new derivative on the fly to view a time segment of a video or audio resource over HTTP. For AV, HTTP on its own lets you do some of the things that for images, an image server would be required. More can be done on the client [ronallo-cs-tricks]
An institution might have content hosted on third party providers like YouTube, Vimeo or SoundCloud, and want to annotate this onto a canvas, display in their viewer, mix up in manifests with other video and image resources held elsewhere. [27], [46]
Even for single-file derivatives, there can be more variables for a client to consider when deciding what to ask for from what's available. Sizes, formats and codecs all need to be evaluated. [58]
Video formats are accompanied by additional metadata files such as WebVTT that can be used by client side players [55]

And some assumed differences about use cases:

There is no direct equivalent of "Deep Zoom" tiling for videos, but no real demand for it either [25]. Video tiles seem like an advanced scenario to defer.
On-the-fly transcoding of video from sources is very expensive and unlikely to be feasible in the near term for most institutions [4]
Most presentation, interoperability, citation and reuse use cases for A/V do not require realtime transcoding

The parameter space of an AV API

What has emerged from the working group so far is that a Video or Audio API that addresses the first set of bullet points above does not feel like slotting parameter values into URLs in the style of the Image API. There has been general consensus that "level0" implementations will be much more common because people have existing video, where "level0" means "static, pre-generated content" rather than a precise analogy of the Image API, where level 0 means static pre-generated content filling the parameter space of a path-segment parameterised API.
Consider an invented but not unreaonable scenario:
Institution A has an extensive sound and video archive, that has been digitised over the years into a variety of formats. Recently they have been standardising somewhat, and now a typical video resource in the archive is available in the following formats:

An MPEG-DASH manifest and segment files, hosted in AWS S3 but proxied via Cloudfront, offering 420, 720p and 1080p video
A single 720p video file (webm, codecs=vp8,vorbis) hosted on the institution's own infrastructure
A version hosted on YouTube

All these versions are derived from the same source; they share (with tiny variations from encoding differences) the same running time and aspect ratio. When the institution publishes its IIIF resources (manifests), it would like clients to evaluate what sources are available and make a runtime decision about what to display for a particular canvas. The IIIF resources need to convey that these three options are available. At runtime, when actually playing the video, the client might be flipping back and forth between 720p and 1080p because it can play the MPEG-DASH resource, whereas another client goes for the YouTube version, and a third client settles on the webm.
Institution B doesn't have any adaptive streaming. It has webm and mp4 containers, each at two different sizes. All its webm files use VP8 for video and Vorbis for audio, and all its mp4 files are H.264 for video and AAC for audio. It is able to proxy access to the video files, so could implement a level0-style API endpoint that relies on parameterised URLs, and rewrite these URLs to the stored locations of the mp4 and webm files. It has some videos on YouTube but doesn't want to reference them in its IIIF resources.
An AV API that has a service description that could convey to a client how to make parameterised URI requests that cover all three formats for institution A is next to impossible to define. If it worked like the Image API but allowed variation sufficent to cover all 3 of Institution A's formats, the potential parameter space would be very large and very sparsely populated, and attempting to map URL schemes to MPEG-DASH or HLS for proxying seems like a very deep rabbit hole [57]. The adaptive formats are already level0-like services that each provide their own manifest of available sources/segments. They sit outside any possible AV API that works like the Image API; they are in fact alternatives to it, in the way that .dzi or Zoomify are alternatives to the Image API for particular scenarios.
For this reason, attention has focussed on how to convey a set of sources for the client to choose from, without having to use a URI scheme with positional parameters like the Image API. The sample info.json documents in [50] show this approach:
{
  "@context": "http://iiif.io/api/video/0/context.json",,
  "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets"
  "attribution": "Brought to you by NCSU Libraries",
  "tracks": [
    {
      "language": "en",
      "kind": "captions",
      "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets/pets-captions-en.vtt"
    },
    {
      "language": "nl",
      "kind": "subtitles",
      "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets/pets-subtitles-nl.vtt"
    }
  ],
  "profile": "http://iiif.io/api/video/0/level0.json",
  "sources": [
    {
      "format": "webm",
      "height": 480,
      "width": 720,
      "size": "3360808",
      "duration": 35.627000,
      "type": "video/webm; codecs=\"vp8,vorbis\"",
      "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets/pets-720x480.webm"
    },
    {
      "format": "mp4",
      "frames": "1067",
      "height": 480,
      "width": 720,
      "size": "2924836",
      "duration": 35.627000,
      "type": "video/mp4; codecs=\"avc1.42E01E,mp4a.40.2\"",
      "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets/pets-720x480.mp4"
    },
    {
      "format": "mp4",
      "frames": "1067",
      "height": 240,
      "width": 360,
      "size": "1075972",
      "duration": 35.648000,
      "type": "video/mp4; codecs=\"avc1.64000D,mp4a.40.2\"",
      "id": "https://iiif-staging02.lib.ncsu.edu/iiifv/pets/pets-360x240.mp4"
    }
  ]
}
Although Institution A can't do it, Institution B could provide a level 0 video API, with 4 possible URLs in the parameter space of each service. That would not allow for variation by codec; the codec could be stated in the info.json for this service so a client can see what is in use, but does not have a slot in the parameterised URL:
{
  "@context": "http://iiif.io/api/av/1/context.json",
  "id": "http://example.edu/iiif/identifier",
  "height": 960,
  "width": 1440,
  "duration": 35.6,

  "sizes": [{"width": 720, "height": 480}, {"width": 360, "height": 240}]

  "profile": [
     "http://iiif.io/api/av/1/level0.json",
     {
        "maxWidth": 720,
        "formats": [
           {"value": "webm", "contentType": 'video/webm; codecs="vp8,vorbis"'},
           {"value": "mp4", "contentType": 'video/mp4; codecs="avc1.64000D,mp4a.40.2"'}
        ],
        "qualities": ["color"]
    }
  ],
  "seeAlso": {
    "id": "http://example.edu/nonIIIF/vtt/identifier.vtt",
    "format": "application/webvtt",
  }
}
(adapted from comment on #50)
Institution A could provide such a parameterised service only for its single .webm derivative, with a parameter space of 1, which may seem pointless to assert as a service.
This leads to a situation where:

some common scenarios could be adapted to an AVI API that works in a similar way to the Image API, and easily extends to dynamic transcoding using the same syntax for advanced scenarios. The AV API would have an info.json that feels like the Image API info.json, and returns binary content for all valid URLs. This could even support a future IIIF adaptive client that could make byte range requests to different valid URLs in the parameter space depending on ambient bandwidth conditions.
some equally common scenarios (adaptive streaming via existing standards, third party providers like YouTube) cannot possibly fit this API model. They are outside the scope of it, because they are really alternative services themselves. YouTube via its API, MPEG-DASH via the .mpd manifest; they both offer alternative methods of getting access to the most appropriate video stream for the client's needs, capabilities and bandwidth. We can't layer an Image API style AV API on top of them.

Benefits of the Image API approach

By implementing the Image API, even at level 0, you guarantee that certain JPEGs are available at certain URLs for a given image service. As a consumer of your services, I know I can get a JPEG from your endpoint if I construct a URL the right way. This works as a mandatory requirement because JPEG is ubiquitous. Ancient browsers support JPEG. Browsers of the far future will support JPEG. It is the obvious choice for the minumum requirement for the Image API (it is a MUST for any image service). This allows level0 implementations to vary by size, and keeps the parameter space small.
For video, if you have a parameter space of 2, or 4, or 6 (e.g., two containers with a particular codec, each at 3 sizes) you could express what you have available in a succinct info.json and leave the door open to extend that in future, either with dynamic services or an additional universal video format and codec if one suddenly emerges.
The question is where does the AV API start? If the AV API is like the Image API, and must return binary content or an info.json, then Institution A's MPEG-DASH and YouTube resources are outside its scope. We still have the ability to assert the availability of all the resources listed in the proposed info.json example above through a list of annotations on a canvas without leaving the Presentation API. The Presentation API already has the vocabulary to do this, and we can make it prettier by adding more @context syntactic JSON-LD sugar.
Can both institutions publish their material via IIIF, with Institution A able to list its sources so that clients that are able to could play MPEG-DASH video, or render a YouTube player controlled by the canvas-level player, and Institution B able to present a parameterised URL AV service that doesn't need to list sources because it defines a parameter space? Yes - but the problem is deciding which API is describing these resources.
Pushing Institution A's resources outside of the AV API, a canvas in a manifest might look like this (with some assumptions about a likely v3 Presentation API):
{
  "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/canvas/c0",
  "type": "Canvas",
  "label": "Stagecoach from institution A",
  // provide a good coordinate space for later annotation
  "width": 4000,
  "height": 3000,
  "duration": 5757,
  "media": [
     // "media" could be an alias for an inline annotation list, like "images" currently
     {
        "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/annotations/a0",
        "type": "Annotation",
        "motivation": "painting",
        "body": {
            "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/media",
            "type": "Choice",
            "items": [
                {
                    "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/content/stagecoach.mpd",
                    "type": "Video",
                    "format": "application/dash+xml",
                    "duration": 5760
                },
                {
                    "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/content/stagecoach.webm",
                    "type": "Video",
                    "format": "video/webm; codecs=\"vp8,vorbis\"",
                    "width": 720,
                    "height": 540,
                    "duration": 5757
                    // we could have a level 0 service here, but we've already stated the only available version so a client won't gain anything
                },
                {
                    "id": "https://www.youtube.com/watch?v=HEuCMRRLts8",
                    "type": "Video",
                    "format": "text/html", // for iFrame embed; check for API! - see #27
                    "service": {
                        "id": "http://youtube.com/video/api",
                        "profile": "uri-for-youtube-api"
                    }
                    "duration": 5764
                }
            ],
            "seeAlso": {
                "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/vtt/subtitles-en.vtt",
                "format": "application/webvtt",
                "label": "subtitles"
            }
        },
        "target": "http://institution-a.org/iiif/films/john-ford/stagecoach/canvas/c0"
      }
   ]
}
Institution B's canvas in a manifest might look like this:
{
  "id": "http://institution-b.org/iiif/films/nicholas-ray/johnny-guitar/canvas/c0",
  "type": "Canvas",
  "label": "Johnny Guitar from Institution B",
  // provide a good coordinate space for later annotation
  "width": 4000,
  "height": 3000,
  "duration": 5757,
  "media": [
     // "media" could be an alias for an inline annotation list, like "images" currently
     {
        "id": "http://institution-b.org/iiif/films/nicholas-ray/johnny-guitar/annotations/a0",
        "type": "Annotation",
        "motivation": "painting",
        "body": {
            "id": "https://institution-b.org/iiif/av-service/johnny-guitar/full/full/max/max/0/color.webm",
            "type": "Video",
            "format": "video/webm; codecs=\"vp8,vorbis\"",
            "width": 720,
            "height": 540,
            "duration": 5757,
            "service": {
                // this is a link to the info.json shown earlier
                "@context": "http://iiif.io/api/av/1/context.json",
                "id": "https://institution-b.org/iiif/av-service/johnny-guitar",
                "profile": "http://iiif.io/api/av/1/level0.json"
            }
        }
        "seeAlso": {
            "id": "http://institution-b.org/iiif/films/nicholas-ray/johnny-guitar/vtt/subtitles-en.vtt",
            "format": "application/webvtt",
            "label": "subtitles"
        },
        "target": "http://institution-b.org/iiif/films/nicholas-ray/johnny-guitar/canvas/c0"
      }
   ]
}
What about the info.json? What about an AV API?

Institution A has published a list of its sources, but it hasn't ended up supplying an actual AV Service, with an info.json that can be shared. It's referencing them using the Presentation API, in the same way that current manifests have an image resource, or choice of image resources, outside of the Image API.
What current manifests (with image annotations) never do is attempt to define a service for an image resource that points to a Zoomify or Seadragon DZI tile source:
{
    "@id": "http://wellcomelibrary.org/iiif/b11765446/canvas/c0",
    "@type": "sc:Canvas",
    "label": "Still life of leaves and flowers, Hong Kong. Photograph by John Thomson, 1868/1871.",
    "height": 9070,
    "width": 10777,
    "images": [
      {
        "@id": "http://wellcomelibrary.org/iiif/b11765446/imageanno/c9b520cc-5b7f-4e41-a975-5ecea31685df",
        "@type": "oa:Annotation",
        "motivation": "sc:painting",
        "resource": {
            // viewers usually ignore the "static" resource image itself
            "@id": "https://dlcs.io/iiif-img/2/1/c9b520cc-5b7f-4e41-a975-5ecea31685df/full/!1024,1024/0/default.jpg",
            "@type": "dctypes:Image",
            "format": "image/jpeg",
            // ********************************************************
            // **** NOBODY HAS EVER DONE THIS IN A IIIF MANIFEST!! ****
            // ********************************************************
            "service": { 
                "@id": "http://wellcomelibrary.org/dz/b11765446/0/c9b520cc-5b7f-4e41-a975-5ecea31685df.jp2.dzi",
                "label": "Hey! This isn't IIIF! It's a DZI!!!",
                "profile": "http://schemas.microsoft.com/deepzoom/2008"
            }
            // ********************************************************
        },
        "on": "http://wellcomelibrary.org/iiif/b11765446/canvas/c0"
      }
    ]
}
Nobody has a IIIF viewer right now that would deal with that. In fact, it would be trivial to add support for this to a viewer that uses OpenSeadragon, because you'd just construct a DZI tilesource instead of a IIIF one. But nobody has had this need, because they adopted the IIIF Image API at the same time.
With the Image API, people weren't bothered about re-using their existing zoomify or dzi content from the Presentation API. They saw that the interoperability gains of the IIIF Image API meant that it was worth the trouble of moving over, even if that meant new work and throwing away the old services and/or derivatives.
This may be a slight red herring, because the decision to adopt IIIF Image API often came first with Presentation API following, or they were adopted simultaneously as if they were both required. The Presentation API came after the Image API; if the Presentation API had come first, and had offered easy ways of linking to Zoomify, IIP and DZI options and everyone had just settled for that lesser ambition, we wouldn't have content interoperability. Interoperable Deep Zoom drove everything else, which is why you never see the above.
Many institutions dip a toe into IIIF by first providing only the Image API - making some level0 tiles, or configuring an image server. They don't have manifests, but they can publish Image API endpoints. Institution A doesn't get the benefit of this toe-dipping in the above model. It doesn't have an AV API endpoint to share, it doesn't get a IIIF AV badge - but it does get a Presentation API badge. It can't send an info.json to Europeana, but it has provided a surface for annotation of its content to the web of linked data. And it can still share the annotation choice body list. It can publish this as a video source list, but it's not a video API service and it duplicates some of the function of the Presentation API. It is in fact simply a list of web resources, published as JSON-LD and using the Presentation API's @context. It can appear inline in a manifest, and be dereferenceable for publication and sharing.
If a service description is defined as a succinct definition of the parameter space of that service, then only institution B can really have a video service that continues the design principles of the existing IIIF specifications.
If a service description is defined such that Institution A could publish a valid info.json for an AV service that included its MPEG-DASH and YouTube resources, then the service is acting as a resource list rather than a provider of derivatives via binary responses. In this case, it is doing double-duty with the Presentation API, which can already describe lists of web resources as annotation bodies.
Maybe services instead?

Here's another attempt at Institution A's canvas, this time using services:
{
  "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/canvas/c0",
  "type": "Canvas",
  "label": "Stagecoach from institution A",
  // provide a good coordinate space for later annotation
  "width": 4000,
  "height": 3000,
  "duration": 5757,
  "media": [
     {
        "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/annotations/a0",
        "type": "Annotation",
        "motivation": "painting",
        "body": {
            "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/content/stagecoach.webm",
            "type": "Video",
            "format": "video/webm; codecs=\"vp8,vorbis\"",
            "width": 720,
            "height": 540,
            "duration": 5757,
            "service" : [
              {
                "profile": "urn:mpeg:dash:schema:mpd:2011",
                "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/content/stagecoach.mpd"
              },
              {
                // needs more work...
                "id": "https://www.youtube.com/watch?v=HEuCMRRLts8",
                "profile": "uri-for-youtube-api"
              }
            ]
        },
        "seeAlso": {
            "id": "http://institution-a.org/iiif/films/john-ford/stagecoach/vtt/subtitles-en.vtt",
            "format": "application/webvtt",
            "label": "subtitles"
        },
        "target": "http://institution-a.org/iiif/films/john-ford/stagecoach/canvas/c0"
      }
   ]
}
There's no reason a IIIF Presentation API spec can't provide assistance for linking to non-IIIF services, even gathering popular and useful ones together in a service annexe just as it does today for geoJSON and physical dimensions. This isn't seen for images today (it's analagous to the .dzi service example given earlier) because nobody needs it. But for AV, maybe they do need it.
Institution A wants to get the object interoperability and description benefits of publishing its resources as manifests (and the discovery benefits too, later). But it can't adopt a "parameter space" IIIF AV API, at least not for now, because it can't put up a service that will return video resources to fill the parameter space. That doesn't stop it releasing manifests with canvases like the above.
Today, a IIIF rich client like Mirador or Universal Viewer needs to support the IIIF Presentation API and the IIIF Image API at least, and preferably Search and Auth too. It doesn't need to support additional services to render images because people don't put non-IIIF image services into their manifests.
For AV, Mirador, UV and other "off the shelf" rich IIIF clients might choose to add support for a handful of non-IIIF profiles for content services (but helpfully defined in a IIIF services annexe) because it makes them more attractive to the likes of Institution A. They don't need this for images, because the vast majority of image annotations have an Image API service on the image recource. Naturally, they all implement the parameter space AV API, and institutions provide it when they can.
What next?

The AV Working Group needs to agree on and state its design principles and criteria, including a definition of what constitutes a IIIF Content Service. Video is complicated, and to take advantage of existing and complex AV solutions a client might need to handle resources outside the scope of AV rather than try to bring those resources under the umbrella of AV.

Should we stop at the Presentation API's links to non-IIIF AV content, if we can't provide an Image API-like service for AV?
Or should we adopt a different approach for AV, and call a resource list a service?
Or is there another approach that hasn't been tried or discussed yet?

In terms of what information a rich client ends up being given by combined IIIF APIs, or what an institution is capable of asserting as available, these options are really all the same - the client gets to find out about an MPEG-DASH .mpd file, or a particular mp4-H264+AAC derivative. The difficulty lies in deciding where things are said, and how.
What is the minimum scenario for interoperability?
The baseline is that there's an institution that has some video and wants it to be rendered along with the metadata by off-the-shelf clients. The Image API defines a mandatory default representation - default.jpg, at at least one size.