Kyutai releases Hibiki: A 2,7B real-time speech-to-speech and speech-to-text translation with near-human quality and voice transfer

Real-time speech translation poses a complex challenge that requires trouble-free integration of speech recognition, machine translation and text-to-speech synthesis. Traditional cascaded approaches often introduce composite errors, fail to preserve the speaker identity and suffer from slow treatment, making them less suitable for real -time applications such as live interpretation. In addition, existing contemporary translation models are struggling to balance accuracy and latency and are dependent on complex inference mechanisms that are difficult to scale. A significant barrier remains the lack of large -scale, well -adapted spoken data sets, limiting the ability to train models that can generate contextually accurate and natural translations with minimal delay.

Kyutai has evolved HibikiA 2.7 billion parameter decoder-only model designed for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation. Works on 12.5Hz framerate with a 2.2 kbps bitrateHibiki is currently supporting French-to-English translation and is designed to maintain voice characteristics in the translated output. A distilled version, HIBIKI-M (1.7B parameters), is optimized for real -time performance on smartphones, making it more accessible for translation on device.

Technical approach and benefit

Hibikis only the architecture of decoder Activates at the same time voice processing using a multistream -language model that predicts both Text and audio tokens. It uses one Neural Audio Codec (Mimi) To compress sound while maintaining faith and ensuring effective translation generation. An important aspect of its design is Contextual adjustmentA method that utilizes a text translation model’s confusion to determine optimal timing for generating speech, giving hibiki the opportunity to Adjust translation delays dynamic While maintaining the context. In addition, Hibiki supports Batch inferenceTreatment up to 320 sequences parallel to H100 GPUsMake it viable for large applications. The model is trained on 7 m hour English sound, 450k hours of French and 40k hours of synthetic parallel datathat contributes to its robustness across different speech patterns.

Achievement and evaluation

Hibiki has shown strong performance in translation quality and speaker fidelity. It achieves one ASR-Bleu score of 30.5Experiencing existing base lines, including offline models. Human evaluations assess its Naturalness at 3.73/5that is approaching 4.12/5 scores for professional human interpreters. The model is also doing well in The similarity of the speakerwith one 0.52 equality score Compared to 0.43 for seamless. Compared to Seamless and StreamSpechHibiki is delivering consistently Higher translation quality and Better voice transferWhile sustaining a Competitive latency. The distilled Hibiki-m Variant, although slightly lower in the speaker, remains effective for real -time use on device.

Conclusion

Hibiki provides a practical approach to real -time speech translation, integration Contextual adjustment, effective compression and inference in real time To improve the quality of the translation while retaining natural speech properties. By offering a Open-Source release under a allowed cc-by-licenseHibiki has the potential to contribute significantly to progress in multilingual communication.


Check out The paper, models about hugging face, github page and colab notebook. All credit for this research goes to the researchers in this project. Nor do not forget to follow us on Twitter and join in our Telegram Channel and LinkedIn GrOUP. Don’t forget to take part in our 75k+ ml subbreddit.

🚨 Join our machine learning community on Twitter/X


Asif Razzaq is CEO of Marketchpost Media Inc. His latest endeavor is the launch of an artificial intelligence media platform, market post that stands out for its in -depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views and illustrates its popularity among the audience.

✅ [Recommended] Join our Telegram -Canal

Leave a Comment