Stars
Text-to-Music Generation with Rectified Flow Transformers
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
SpeechGPT Series: Speech Large Language Models
Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
Implementation of Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt (NAACL'24).
Localized watermarking for AI-generated speech audios, with SOTA on robustness and very fast detector
ChatTTS 2000条音色稳定性打分🥇+区分男女年龄👧+在线试听🔈 ChatTTS 2K Speaker Stability Score & Categorized by Gender and Age & Audio Preview
The roadmap of generative AI: use cases and applications | 生成式AI的应用路线图
High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
Bark Voice Cloning and Voice Cloning for Chinese Speech
Suno AI's Bark model in C/C++ for fast text-to-speech
Barkify: an unoffical training implementation of Bark TTS by suno-ai
It is a multi-lingual (97 languages) text content automatic recognition and segmentation tool. 强大的TTS多语言(97种语言)混合文本内容自动分词工具。
Automatic speech annotator processing speech with voice activaty detection, overlapping speech detection, speaker diarization and automatic speech recognition
Inference and training library for high-quality TTS models.
Awesome speech/audio LLMs, representation learning, and codec models
Experimental implementation for a sparse-dictionary based version of the VQ-VAE2 paper
Zero-Shot Speech Editing and Text-to-Speech in the Wild
Foundational model for human-like, expressive TTS
Unified Speech Language Model for paper "SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models"(ICLR 2024)
Instant voice cloning by MIT and MyShell.
VoicePAT is a modular and efficient toolkit for voice privacy research, with main focus on speaker anonymization.
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation