This assumes you're using a Linux system (or WSL) with modern python3
and sox
.
-
First, download the stream(s) you want to use for the dataset. The requirements are, it must be as clean as possible, and primarily the target voice talking a ton. Streams like Minecraft streams tend to be ideal. (This uses yt-dlp) Use the command below to download the stream(s):
yt-dlp -f ba -x --audio-format "wav" --audio-quality 0 --embed-metadata https://youtu.be/omegalul
-
Once you have the stream, you'll need to split it into multiple files because it's too computationally hard to process a 3 hour stream. Use the command below to split the file based off of the silence it detects. This may result in longer files if the background sounds/music are as loud as the speech in the source. This command assumes the input stream is
input.wav
:sox -V3 input.wav split.wav silence 1 5.0 0.1% 1 0.3 0.1% : newfile : restart