Skip to content

Instantly share code, notes, and snippets.

@j0sh
Last active May 9, 2021 19:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save j0sh/ae9e5a97e794e364a6dfe513fa2591c2 to your computer and use it in GitHub Desktop.
Save j0sh/ae9e5a97e794e364a6dfe513fa2591c2 to your computer and use it in GitHub Desktop.
LPMS Concurrency Report

Stream Concurrency Report

Some quick notes on the factors around stream concurrency which impact GPU transcoding time. See Test Methodology for details on how the tests were performed.

Takeaways

#[TOC]

Impact of Concurrency on Session Initialization

Concurrent streams on .a GPU have a disproportionate impact on stream startup time, taking up to a minute to start a new session on a GPU. Absent further investigation, it is unclear where the issue lies with FFmpeg or the nvidia drivers. The issue is not likely to be in LPMS itself (TODO sanity check this claim).

image

Impact of Concurrency on Steady-State Performance

Aside from stream startup time, the per-stream processing time is roughly proportionate to the amount of concurrent streams running on a GPU [1]. This indicates that any load allocation strategy should be heavily weighed towards sessions that have already been established on GPUs, rather than creating new sessions if the instantaneous load on a given card is low.

[1] Eg, if a GPU with a single session can process a given segment in realtime, then adding another stream with the same parameters will cause segments for both streams to run in roughly 2x realtime, if each segment were submitted to the GPU instantaneously.

image

It is more difficult to tell whether FFMpeg also suffers from the first-segment slowdown. But as a matter of course, FFmpeg generally runs slower than LPMS in processing the same input.

image

Test Methodology

Goal: Judge the impact that concurrent streams have on GPU processing time.

Use hls-bench example program. Run 16 streams on either:

  • Single GPU
  • Eight GPUs

Save results to CSV, import into SQL for analysis and Google Sheets for visualization.

Steps

  1. Run 16 streams on a single GPU, save results to csv.
 ./hls-bench in/short_testharness.m3u8 out/16s1g_ 16 400 P720p30fps16x9,P576p30fps16x9,P360p30fps16x9,P240p30fps16x9 nv 0 | tee stats/16s_1g.csv
  1. Run 16 streams on 8 GPUs, save results to csv.
 ./hls-bench in/short_testharness.m3u8 out/16s8g_ 16 400 P720p30fps16x9,P576p30fps16x9,P360p30fps16x9,P240p30fps16x9 nv 0,1,2,3,4,5,6,7 | tee stats/16s_8g.csv
  1. Import results into sqlite.
sqlite> .mode csv
sqlite> .import stats/16s_1g.csv gpu1
sqlite> .import stats/16s_8g.csv gpu8
  1. Extract timings for first segment.
sqlite> select stream, length from gpu1 where segment = 0 order by cast(stream as integer)
sqlite> select stream, length from gpu8 where segment = 0 order by cast(stream as integer)
  1. Extract average timings, excluding first segment.
sqlite> select stream, avg(length) as runtime from gpu1 where segment > 0 group by stream order by cast(stream as integer);
sqlite> select stream, avg(length) as runtime from gpu8 where segment > 0 group by stream order by cast(stream as integer);
  1. Import into Google Sheets for visualization.

  2. FFmpeg CLI script used for the test. Start as many processes as needed, preferably using a driver script, and time the results.

#!/usr/bin/env bash

# This program takes the nvidia device as a cli arg
if [ -z "$1" ]
then
  echo "Expecting nvidia device id"
  exit 1
fi

trap exit SIGINT

inp=in/short_testharness.m3u8

ffmpeg -hwaccel_device $1 -hwaccel cuvid -c:v h264_cuvid -i $inp \
  -vf fps=30,scale_cuda=w=1280:h=720 -b:v 6000k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=1024:h=576 -b:v 1500k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=640:h=360 -b:v 1200k -c:v h264_nvenc -an -f null - \
  -vf fps=30,scale_cuda=w=426:h=240 -b:v 600k -c:v h264_nvenc -an -f null - \
  -loglevel warning -hide_banner -y
@ggnull35
Copy link

Which version of ffmpeg do you use ? What is your command line ?

I suggest you to try with:
Cuda version 10.1
Nvidia Driver 418.88
FFmpeg 4.1.x

@j0sh
Copy link
Author

j0sh commented Nov 20, 2019

@ggnull35 Thanks for the questions!

Which version of ffmpeg do you use

Using a fairly recent version forked off master with some changes here : https://github.com/livepeer/FFMpeg

$ ffmpeg -version
ffmpeg version N-95800-ge83a785229 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 7 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
configuration: --prefix=/home/josh/compiled --disable-static --enable-shared --enable-gpl --enable-libx264 --enable-cuda --enable-cuvid --enable-nvenc --enable-gnutls --disable-stripping

What is your command line ?

Updated the writeup, the test methodology now has the FFmpeg command line that's used. But the CLI program itself is not really at issue here, because LPMS (the transcoding library under consideration here) uses the libav APIs directly.

If there's a way to use the CLI to further confirm the main issue here - that the first segment of a stream is extremely slow to transcode - then I'd love to hear it.

I suggest you to try with:
Cuda version 10.1
Nvidia Driver 418.88

The CUDA version that I have here do seem slightly out of date, so I'll try with that here. However, I believe we still see this issue consistently with later drivers.

$ nvidia-smi
Wed Nov 20 17:03:19 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+

@j0sh
Copy link
Author

j0sh commented Nov 20, 2019

@ggnull35 Some quick tests with CUDA 10.2 (released last week) are showing essentially no change. Screenshots follow:

$ nvidia-smi
Wed Nov 20 18:37:57 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+

image
image
image

@ggnull35
Copy link

ggnull35 commented May 9, 2021

Dear Friend, have you repeated this test with 460.32 or above driver ? I see improvements. fyi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment