WhisperSpeech: An Overview

WhisperSpeech is an ambitious project aimed at pioneering the field of speech synthesis. The project's goal is to create a model equivalent to Stable Diffusion but in the domain of speech – promising powerful capabilities and easy customization. The project operates with a commitment to Open Source code and the use of properly licensed speech recordings, ensuring safety for commercial applications.

Key Features and Updates

WhisperSpeech is currently utilizing the English LibreLight dataset for model training and aims to expand to multiple languages in its forthcoming release, with support for languages like Whisper and EnCodec.

Progress Report as of January 18, 2024

The project showcases the ability to mix languages within a single sentence, with English project names flowing smoothly into Polish speech. They highlight:

  • Whisper Speech
  • Collabora
  • Laion
  • Jewels

Additionally, they provide a sample of voice cloning using a speech by Winston Churchill, demonstrating the technology's advanced capabilities.

Progress as of January 10, 2024

The team reports on a new SD S2A model that is notably faster and maintains high-quality speech output. They included a voice cloning example utilizing a reference audio file.

Progress as of December 10, 2023

The update included samples of English speech with a female voice and a Polish speech sample with a male voice.

Older updates have been archived, indicating a progression and commitment to continual improvement.

Downloads and Roadmap

Downloads available include pre-trained models and converted datasets. The roadmap proposes gathering a more extensive emotive speech dataset, exploring generation conditioning on emotions and prosody, establishing a community-driven collection of freely licensed multilingual speech, and training finalized multi-language models.

Architecture and Recognition

The architecture involves several components:

  • AudioLM: Not described in the text but likely a component of the overall speech synthesis framework.
  • SPEAR TTS: Likely another component of the framework or a technology used in conjunction with WhisperSpeech.
  • MusicGen: Possibly related to generating music or controlling prosody in speech.
  • Whisper: Used for modeling semantic tokens through OpenAI's Whisper encoder block.
  • EnCodec: Handles modeling of acoustic tokens, delivering audio quality at reasonable bitrates.
  • Vocos: A vocoder pretrained on EnCodec tokens, enhancing audio quality.

The block diagram visualizes the EnCodec's framework, detailing its function within the project architecture.

Acknowledgments and Citations

WhisperSpeech extends appreciation to its sponsors: Collabora, LAION, Jülich Supercomputing Centre, and www.gauss-centre.eu. Individual contributors, such as 'inevitable-2031' and 'qwerty_qwer', receive thanks for their assistance in the model's development.

Citations listed without details suggest the project's reliance on numerous Open Source ventures and research. The project stands on the shoulders of the broader research community, which it acknowledges through these provisional citation placeholders.

WhisperSpeech projects itself as not only a technical endeavor but also a community-focused initiative promoting openness and collaboration, as indicated by the mention of its presence on the LAION Discord server.


Note: This overview is based on the provided information and the context of the WhisperSpeech project documents. Specific insightful presentations and detailed technical mechanisms were mentioned but not thoroughly described in the text given.


Tags

  • #WhisperSpeech
  • #SpeechSynthesis
  • #OpenSource
  • #TextToSpeech

https://github.com/collabora/WhisperSpeech