Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Transcribing Thai YouTube video using Google Cloud

How to transcribe Thai speech in videos into text.


  • Google Cloud or Firebase project with billing enabled.

  • gcloud command line tool installed.

  • ffmpeg or Docker.

  • youtube-dl to download YouTube videos.

  • 30 Baht per 1 hour of input.

Step 1: Grab the audio track

Example, from YouTube, using youtube-dl:

youtube-dl -f bestaudio ''

Step 2: Convert

We need to convert a audio into a format that is supported by Google Cloud APIs. We will use OGG Opus.

docker run -v "$PWD:/data" jrottenberg/ffmpeg -i "/data/<FILENAME>.m4a" -c:a libopus -ar 16000 -ac 1 "/data/<FILENAME>.ogg"

To cut a portion of audio, put -ss <START TIME> -t <DURATION> before -i. For example, -ss 01:38:23 -t 00:30:00.

Step 3: Recognize

  1. Upload the ogg file to Google/Firebase Cloud Storage. After uploading, you will get a <STORAGE LOCATION> such as gs://<PROJECT><FILENAME>.ogg.

  2. Start the transcription:

    gcloud ml speech recognize-long-running "<STORAGE LOCATION>" --language-code=th --encoding=ogg-opus --include-word-time-offsets --sample-rate=16000 --async

    It will print out:

      "name": "5766027198115285298"

    This is your <OPERATION ID>.

  3. Wait for the operation to finish and write the results to the file.

    gcloud ml speech operations wait "<OPERATION ID>" > "<FILENAME>.json"

View the JSON file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.