j0sh/concurrency report.md Secret

## concurrency report.md

      
    Raw
  

              concurrency report.md
            
          
    Stream Concurrency Report

Some quick notes on the factors around stream concurrency which impact GPU transcoding time. See Test Methodology for details on how the tests were performed.
Takeaways

#[TOC]
Impact of Concurrency on Session Initialization

Concurrent streams on .a GPU have a disproportionate impact on stream startup time, taking up to a minute to start a new session on a GPU. Absent further investigation, it is unclear where the issue lies with FFmpeg or the nvidia drivers. The issue is not likely to be in LPMS itself (TODO sanity check this claim).

Impact of Concurrency on Steady-State Performance

Aside from stream startup time, the per-stream processing time is roughly proportionate to the amount of concurrent streams running on a GPU [1]. This indicates that any load allocation strategy should be heavily weighed towards sessions that have already been established on GPUs, rather than creating new sessions if the instantaneous load on a given card is low.
[1] Eg, if a GPU with a single session can process a given segment in realtime, then adding another stream with the same parameters will cause segments for both streams to run in roughly 2x realtime, if each segment were submitted to the GPU instantaneously.

It is more difficult to tell whether FFMpeg also suffers from the first-segment slowdown. But as a matter of course, FFmpeg generally runs slower than LPMS in processing the same input.

Test Methodology

Goal: Judge the impact that concurrent streams have on GPU processing time.
Use hls-bench example program. Run 16 streams on either:

Single GPU
Eight GPUs

Save results to CSV, import into SQL for analysis and Google Sheets for visualization.
Steps


Run 16 streams on a single GPU, save results to csv.

 ./hls-bench in/short_testharness.m3u8 out/16s1g_ 16 400 P720p30fps16x9,P576p30fps16x9,P360p30fps16x9,P240p30fps16x9 nv 0 | tee stats/16s_1g.csv


Run 16 streams on 8 GPUs, save results to csv.

 ./hls-bench in/short_testharness.m3u8 out/16s8g_ 16 400 P720p30fps16x9,P576p30fps16x9,P360p30fps16x9,P240p30fps16x9 nv 0,1,2,3,4,5,6,7 | tee stats/16s_8g.csv


Import results into sqlite.

sqlite> .mode csv
sqlite> .import stats/16s_1g.csv gpu1
sqlite> .import stats/16s_8g.csv gpu8


Extract timings for first segment.

sqlite> select stream, length from gpu1 where segment = 0 order by cast(stream as integer)
sqlite> select stream, length from gpu8 where segment = 0 order by cast(stream as integer)


Extract average timings, excluding first segment.

sqlite> select stream, avg(length) as runtime from gpu1 where segment > 0 group by stream order by cast(stream as integer);
sqlite> select stream, avg(length) as runtime from gpu8 where segment > 0 group by stream order by cast(stream as integer);


Import into Google Sheets for visualization.


FFmpeg CLI script used for the test. Start as many processes as needed, preferably using a driver script, and time the results.


#!/usr/bin/env bash

# This program takes the nvidia device as a cli arg
if [ -z "$1" ]
then
  echo "Expecting nvidia device id"
  exit 1
fi

trap exit SIGINT

inp=in/short_testharness.m3u8

ffmpeg -hwaccel_device $1 -hwaccel cuvid -c:v h264_cuvid -i $inp \
  -vf fps=30,scale_cuda=w=1280:h=720 -b:v 6000k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=1024:h=576 -b:v 1500k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=640:h=360 -b:v 1200k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=426:h=240 -b:v 600k -c:v h264_nvenc -an -f null - \
  -loglevel warning -hide_banner -y