Skip to content

Instantly share code, notes, and snippets.

View scgupta's full-sized avatar

Satish Chandra Gupta scgupta

View GitHub Profile

Beyond librispeech: About the amount of spoken content stored in Librivox

Overview

Given that LibriVox contains enough of english content for a speech processing corpus, LibriSpeech, to be built from it, I've wondered how much content LibriVox has in languages other than English.

I've downloaded the JSON API contents of Librivox, separated the audiobooks according to their language, and summed up their lengths, obtaining a language breakdown expressed in spoken time.

This gave results of over 60 thousand hours for english, thousands of hours each for German, Dutch, French, Spanish, and hundreds of hours for other languages.