641i130/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Dataset Workflow

This assumes you're using a Linux system (or WSL) with modern python3 and sox.


First, download the stream(s) you want to use for the dataset. The requirements are, it must be as clean as possible, and primarily the target voice talking a ton. Streams like Minecraft streams tend to be ideal. (This uses yt-dlp)
Use the command below to download the stream(s):
yt-dlp -f ba -x --audio-format "wav" --audio-quality 0 --embed-metadata https://youtu.be/omegalul


Once you have the stream, you'll need to split it into multiple files because it's too computationally hard to  process a 3 hour stream. Use the command below to split the file based off of the silence it detects. This may result in longer files if the background sounds/music are as loud as the speech in the source. This command assumes the input stream is input.wav:
sox -V3 input.wav split.wav silence 1 5.0 0.1% 1 0.3 0.1% : newfile : restart


Next, you need to run this through UVR5 to remove all background sounds. Since we have the files, you can just select all the files as input for the software, and make a folder for the output. Use the following settings once you've set the input and output settings:

Choose Process Method

VR Architecture


Window Size

512


Aggression Setting

10


Choose VR Model (you download these in the settings of the program)

6_HP-Karaoke-UVR


Vocals Only, GPU Conversion

Then click Start Processing to begin. This can take a few minutes to hours depending on the GPU you have! My 3070 took 6.5 minutes to process ~ 2.5 hours of Minecraft stream audio.
There are chances that it fails when doing this step, just ignore it.


Once you have a directory with all the processed files,  we need to run it through sox to remove the silence from the lack of background music and sound effects.
When you're at this step, make sure you're in the folder with the processed wav files.
for file in *.wav; do sox "$file" "desilenced_${file}" silence -l 1 0.1 1% -1 0.1 1%; done 


Now that we have a bunch of audio files of speech, we need to run it through Openai's whisper to transcribe it into use-able data. To do this, install whisper to the system you're using (guide on this page). I've written a python script whisper_transcript.py that will do this.


In the directory, run this command (assuming the whisper_transcript.py is in the same directory)
python3 whisper_transcript.py


Note: In the python file, change the model according to the VRAM you have


Size
Parameters
English-only model
Multilingual model
Required VRAM
Relative speed


tiny
39 M
tiny.en
tiny
~1 GB
~32x


base
74 M
base.en
base
~1 GB
~16x


small
244 M
small.en
small
~2 GB
~6x


medium
769 M
medium.en
medium
~5 GB
~2x


large
1550 M
N/A
large
~10 GB
1x


Cool, now you have a dataset with the files beginning with out_ and a log with transcript.txt.


## whisper_transcript.py
import os, whisper, shutil

model = whisper.load_model("large")
directory = "."  # Replace with the actual directory path
log_file = "transcript.txt"
def wfile(content):
    with open(log_file, 'a') as file:
        file.write(content)

# Iterate over each file in the directory
counter = 0
for filename in os.listdir(directory):
    if filename.endswith(".wav") and filename.startswith("desilenced_"):
        # filepath
        filepath = os.path.join(directory, filename)
        new_filename = f"out_{counter}.wav"  # Replace with the desired new file name
        new_filepath = os.path.join(directory, new_filename)
        shutil.move(filepath, new_filepath)
        filepath = new_filepath
        print(new_filepath)

        # Load audio and pad/trim it to fit 30 seconds
        audio = whisper.load_audio(filepath)
        audio = whisper.pad_or_trim(audio)

        # Make log-Mel spectrogram and move to the same device as the model
        mel = whisper.log_mel_spectrogram(audio).to(model.device)

        # Detect the spoken language
        _, probs = model.detect_language(mel)
        print(f"Detected language for {filename}: {max(probs, key=probs.get)}")

        # Decode the audio
        options = whisper.DecodingOptions()
        result = whisper.decode(model, mel, options)

        # Print the recognized text
        out = f"{new_filename}|{result.text}\n"
        wfile(out)
        print(out)
        counter+=1
Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
	import os, whisper, shutil

	model = whisper.load_model("large")
	directory = "." # Replace with the actual directory path
	log_file = "transcript.txt"
	def wfile(content):
	with open(log_file, 'a') as file:
	file.write(content)

	# Iterate over each file in the directory
	counter = 0
	for filename in os.listdir(directory):
	if filename.endswith(".wav") and filename.startswith("desilenced_"):
	# filepath
	filepath = os.path.join(directory, filename)
	new_filename = f"out_{counter}.wav" # Replace with the desired new file name
	new_filepath = os.path.join(directory, new_filename)
	shutil.move(filepath, new_filepath)
	filepath = new_filepath
	print(new_filepath)

	# Load audio and pad/trim it to fit 30 seconds
	audio = whisper.load_audio(filepath)
	audio = whisper.pad_or_trim(audio)

	# Make log-Mel spectrogram and move to the same device as the model
	mel = whisper.log_mel_spectrogram(audio).to(model.device)

	# Detect the spoken language
	_, probs = model.detect_language(mel)
	print(f"Detected language for {filename}: {max(probs, key=probs.get)}")

	# Decode the audio
	options = whisper.DecodingOptions()
	result = whisper.decode(model, mel, options)

	# Print the recognized text
	out = f"{new_filename}\|{result.text}\n"
	wfile(out)
	print(out)
	counter+=1