Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Last active May 18, 2024 15:29
Show Gist options
  • Save cedrickchee/770277bd0d368f5e682389c36f3468c2 to your computer and use it in GitHub Desktop.
Save cedrickchee/770277bd0d368f5e682389c36f3468c2 to your computer and use it in GitHub Desktop.
Voice AI Research

Voice AI Research

Goal: make computers talk like human.

My notes from audio (voice + speech) AI research started in 2023.

Emerging Research

Audio-to-audio Models

Audio-to-audio or more precisely speech-to-speech models are trained end-to-end.

It's multimodal model = Audio Speech Recognition (ASR)/TTS + TTS + Language Model

TTS

Classic and modern Text-To-Speech (TTS) models and technologies.

STT

Classic and modern Speech-To-Speech (STT) models and technologies.

Voice Platform

User Interface, Cloud service providers.

Vapi

Open Source Softwares

Real-time Human Conversational AI Systems

  • Fixie.ai - scaling real-time human conversational AI systems. CTO: Justin Uberti, ex-Google, created WebRTC and Google Duo; tech lead for Stadia, Hangouts.

    Human conversations are fast, typically around 200ms between turns, and we think LLMs should be just as quick. This site, https://thefastest.ai/ provides reliable measurements for the performance of popular models.

Applications

Voice Cloning (Synthetic Voices)

Text-To-Audio

Music Generative AI:

Uncategorized

Communities

  • Are you using Voice AI?

    the tech is very good, but not quite there yet I don't think, it's extremely close though.

    While ElevenLabs seems the best, it's a shame it lacks the ability to edit the clips a little more like some of the other tools have, for speeding up certain words, making them louder or adding in some emotion. The other tools do this far better, however they sound robotic, i'm exploring if this could be achieved with some manual editing. I'd go over quota pretty quickly. I imagine the cost will come down.

    I’d like a TTS which is emotionally expressive and can be used for video game characters.

  • Are Voice AI Pipeline Platforms a Race to the Bottom?

    "VoiceAI" platform examples are Vapi.ai (the best in my opinion), Bland.ai, Toma.so, Retell AI, Infer.so, Marr Labs, Elto.ai Voice examples are - playht, elevenlabs, amazon polly

  • PlayHT2.0: State-of-the-Art Generative Voice AI Model for Conversational Speech (play.ht)

    Realtime Speech Generation Instant Voice Cloning Cross-language and Accent Cloning Directing Emotions

    • Self-proclaimed state of the art. A year ago, i would have been blown away, today, this is dramatically worse than Eleven Labs. Lower quality audio, strange cadence, pretty monotonic. It’s not what people sound like. I think it’s impressive, but i wouldn’t call it state of the art.
  • Looking for a 24-7 Real-Time Voice Transcription Tool

    if you do decide to, start with ggml/whisper

  • New models and developer products (openai.com) (Nov 2023)

    • As for all the surrounding stuff like detecting speech starting and stopping and listening for interruptions while talking, give my voice AI a try. It has a rough first pass at all that stuff, and it needs a lot of work but it's a start and it's fun to play with. Ultimately the answer is end-to-end speech-to-speech models, but you can get pretty far with what we have now in open source!

    • A few notes on pricing:

      • ElevenLabs is $0.24 per 1K characters while OpenAI TTS HD is $0.03 per 1K characters. Elevenlabs still has voice copying but for many use-cases it's no longer competitive.
    • The new TTS is much cheaper than eleven labs and better too. I don't know how the model works so maybe what i'm asking isn't even feasible but i wish they gave the option of voice cloning or something similar or at least had a lot more voices for other languages. The default voices tend to make other language output have an accent.

      • I'm not sure if the tts is better than eleven labs. English audio sounded really good, but the Spanish samples I've generated are off a bit. It definitely sounds human, but it sounds like an English native speaker speaking Spanish. Also I've noticed on inputs just a few sentences long, it will sometimes repeat, drop, or replace a word. The accent part I'm okay with, but the missing words is a big issue.
    • The TTS seems really nice, though still relatively expensive, and probably limited to English (?). I can’t wait until that level of TTS will become available basically for free, and/or self-hosted, with multi-language support, and ubiquitous on mobile and desktop.

      • It's not limited to English. The model at least. Doubt the API will be too. Expensive ? Compared to what? Eleven labs costs an arm and a leg in comparison.
  • Hume – voice AI with emotional intelligence (hume.ai)

    • I've been playing around with it for 15 minutes or so. It's like having a conversation with five or six different people. It's pretty awesome!
    • This should rank higher. Absolutely mind-blowing stuff
  • Universal Speech Model (research.google)

  • Reddit

Tweets

Funding

  • https://voice-ai-newsletter.krisp.ai/p/8-predictions-for-voice-ai-in-2024

    With the fast advancements in on-device STT and LLM technologies, it’s apparent that this technology will become part of our daily routine beyond call centers.

    We predict that the "second brain” sitting next to you and helping/coaching you during your meetings will already be a reality in 2024.

    Cloud Speech-to-text will get 2x cheaper The launch of Whisper disrupted the Speech-to-text market.

Stack

Design and systems architecture:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment