azure ai speech
36 TopicsAzure AI Speech text to speech Feb 2025 updates: New HD voices and more
By Garfield He Our dedication to enhancing Azure AI Speech voices remains steadfast, as we continually strive to make them more expressive and engaging. We are pleased to announce an upgraded HD version of our neural text-to-speech service for selected voices. This latest iteration further enhances expressiveness by incorporating emotion detection based on the input context. Our advanced technology employs acoustic and linguistic features to produce speech with rich, natural variations. It effectively detects emotional cues within the text and autonomously adjusts the voice's tone and style. With this enhancement, users can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotional expression. What is new? Public preview: updated 13 HD voices to support multilingual Public preview: 14 new HD voices. GA: Super-Realistic Indian Voices: Aarti & Arjun GA: AOAI turbo voices. GA: Embedded voice support with emotions GA: Other quality improvements with regular updates Voice demos Public preview: updated 13 HD voices to support multilingual The latest model of these voices below is updated to support multilingual and more versatile capabilities. Voice name Script Audio de-DE-Seraphina:DragonHDLatestNeural Ich kann dir Der Alchimist von Paulo Coelho empfehlen. Es ist eine inspirierende Geschichte über einen jungen Hirten namens Santiago, der auf der Suche nach einem Schatz seine Träume verfolgt. Das Buch ist voller Weisheit und lehrt, dass der Weg oft wichtiger ist als das Ziel. Es ist eine wunderbare Lektüre für alle, die nach Motivation und Lebenssinn suchen. en-US-Brian:DragonHDLatestNeural Hey again, Elizabeth. Amazing! Well I’m happy to hear you liked my suggestions for 4 star hotels in Florence. Three hotels in that list are under 300 euros a night that week: the Hotel Orto, The Grand Hotel Cavour, and Hotel degli Orafi. en-US-Davis:DragonHDLatestNeural Oh no, I didn't mean to disappoint! How about this: If you want something quirky and fun with lots of laughs, go for Blue Man Group. If you're looking for something awe-inspiring and magical, Cirque du Soleil is your best bet. Both will make for a memorable first date, so you can't go wrong. en-US-Ava:DragonHDLatestNeural I'm really sorry to hear that you're feeling this way. Finding a therapist can be a great step towards feeling better. en-US-Andrew:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. en-US-Andrew2:DragonHDLatestNeural You're very welcome. I'm glad I could assist you. If there's ever anything you want to learn more about, just let me know. I enjoy answering questions and sharing knowledge, and if you ever find yourself bored again, we can always play another game. en-US-Emma:DragonHDLatestNeural Though I’m really sorry you’re having trouble with your mother-in-law. Have you talked to your husband about the situation?" en-US-Emma2:DragonHDLatestNeural Oh no, I didn't mean to disappoint! How about this: If you want something quirky and fun with lots of laughs, go for Blue Man Group. If you're looking for something awe-inspiring and magical, Cirque du Soleil is your best bet. Both will make for a memorable first date, so you can't go wrong. en-US-Steffan:DragonHDLatestNeural I don't care what they say. I'll find a way to break the curse," declared Mia, determination shining in her eyes as she gazed at the ancient tome. "But it's too dangerous," protested Alex, worry etched on his face. "I have to try," Mia replied, her voice firm with resolve. en-US-Aria:DragonHDLatestNeural Hey, Seth. How's it going? What can I help you with today? en-US-Jenny:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. ja-JP-Masaru:DragonHDLatestNeural 最近、新しいプロジェクト任されてさ、毎日残業で忙しいけど、けっこうチームの雰囲気もいい感じで充実してるよ。そっちは最近どう? zh-CN-Xiaochen:DragonHDLatestNeural 她给我们发了一张照片,呃,在一个满是山、山珍海味婚礼上她拿了一个巨小的饭盒在吃,反正就一个特别清淡的啊,减脂营备的餐,然后她呢当时在群里发这个是,嗯,为了求表扬,哈哈哈! Public preview: 12 new HD voices with v1.1 support: provide more variety to HD voices offering Voice name Script Audio de-DE-Florian:DragonHDLatestNeural Mein Lieblingsmusikgenre ist Jazz. Ich liebe die Improvisation und die Vielfalt der Klänge, die Jazz bietet. Die Energie und Emotionen, die durch die Musik transportiert werden, sind einfach einzigartig. Besonders faszinierend finde ich die Kombination aus traditionellen und modernen Elementen, die Jazz so zeitlos und spannend macht. en-US-Adam:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. en-US-Phoebe:DragonHDLatestNeural I'm really sorry to hear that you're feeling this way. Finding a therapist can be a great step towards feeling better. en-US-Serena:DragonHDLatestNeural I’ll escalate this issue to our technical team right away. They’ll contact you within 24 hours with a solution. I won’t stop until this is fixed. en-US-Alloy:DragonHDLatestNeural ...and that’s when I realized how much living abroad teaches you outside the classroom. Oh, and if you’re just joining us, welcome! We’ve been talking about studying abroad, and I was just sharing this one story—my first week in Spain, I thought I had the language down, but when I tried ordering lunch, I panicked and ended up with callos, which are tripe. Not what I expected! But those little missteps really helped me get more comfortable with the language and culture. Anyway, stick around, because next I’ll be sharing some tips for adjusting to life abroad! en-US-Nova:DragonHDLatestNeural Imagine waking up to the sound of gentle waves and the warm Italian sun kissing your skin. At Bella Vista Resort, your dream holiday awaits! Nestled along the stunning Amalfi Coast, our luxurious beachfront resort offers everything you need for the perfect getaway. Indulge in spacious, elegantly designed rooms with breathtaking sea views, relax by our infinity pool, or savor authentic Italian cuisine at our on-site restaurant. Explore picturesque villages, soak up the sun on pristine sandy beaches, or enjoy thrilling water sports—there’s something for everyone! Join us for unforgettable sunsets and memories that will last a lifetime. Book your stay at Bella Vista Resort today and experience the ultimate sunny beach holiday in Italy! es-ES-Ximena:DragonHDLatestNeural Las luces del carnaval brillaban contra el cielo nocturno, atrayendo a los visitantes. Jake y Emma estaban muy emocionados. “¡Vamos a la rueda de la fortuna, Emma!” “Está bien, pero solo si me ganas un premio.” Pasaron la tarde montando montañas rusas y disfrutando de algodón de azúcar. es-ES-Tristan:DragonHDLatestNeural “A visitar a mis abuelos en el campo. ¿Y tú?” “Regreso a casa con mi familia.” El resto del viaje, compartieron historias y risas mientras el tren avanzaba." fr-FR-Vivienne:DragonHDLatestNeural En entrant dans le vieux manoir, Antoine ressentit un frisson le long de son dos. "On devrait vraiment faire demi-tour", dit Clara, les yeux écarquillés. "Mais j’ai entendu parler de trésors cachés ici", répondit Antoine, déterminé. Juste à ce moment-là, une porte claqua derrière eux. "On y va !" s’écria Clara, effrayée. fr-FR-Remy:DragonHDLatestNeural En explorant une grotte mystérieuse, Léo et Mia découvrirent des cristaux lumineux. "C’est magnifique !", s’écria Mia, éblouie. "Regarde, il y a un passage là-bas", dit Léo, pointant du doigt. En s’enfonçant dans la grotte, ils rencontrèrent un dragon endormi. "Nous devons être prudents", murmura Léo, réalisant l’aventure qui les attendait. ja-JP-Nanami:DragonHDLatestNeural えっと、プロジェクトの進捗ですが、予定通り進んでいます。まあ、いくつか問題もありましたが、既に対処済みです。順調に進めています。 zh-CN-Yunfan:DragonHDLatestNeural 每个人都可以在日常生活中采取一些简单的环保行动。我开始减少一次性塑料的使用,进行垃圾分类,并尽量节约能源。这些小措施虽然看似微不足道,但积累起来对环境的保护却能产生积极影响。我相信,关注环保不仅是为了现在的生活,也为未来的子孙着想。你们在环保方面有什么实用的建议吗? GA: Super-Realistic Indian Voices: Aarti & Arjun Aarti (female) and Arjun (male) are natural, conversational voices designed for the Indian market, offering a soft, soothing, and empathetic tone in both Hindi and English. Trained with professional voice artists and advanced SOTA modeling techniques, they excel in handling speech imperfections like pauses and interjections. Their realistic expressions and human-like intonation make them ideal for applications such as customer support, digital assistants, e-learning, and entertainment, ensuring dynamic, engaging, and clear communication in real-time interactions. Voice Domain Script Sample en-IN-AartiNeural Conversational Hmm, I’m not sure what to make for dinner tonight. I want to try something new, but I’m not sure what. Oh no, I hope I don’t end up making something that tastes terrible. Maybe I should look up some recipes online. en-IN-AartiNeural Neutral In the depths of the ocean, a curious mermaid named Leela explored a forgotten shipwreck. Amongst treasures, she found a mysterious locket. When she opened it, a holographic image of a distant world appeared. en-IN-AartiNeural Call center Ugh, that's really unfortunate. But don't worry, I'm here to help you block your VodaTel number right away. Can you please verify your identity by providing some personal details, like your full name, and maybe the last transaction or recharge you did? This will ensure we can proceed quickly and securely. en-IN-ArjunNeural Conversational Raju, listen carefully. For question five, write the formula I told you yesterday. For question six, focus on the main point only—don’t write extra. Remember the example we practiced. And for the last question, mention the key date. Keep it simple, no need for long explanations. en-IN-ArjunNeural Neutral Cooking is an art that brings people together. I love experimenting with new recipes, especially traditional Indian dishes like biryani and samosas. The aroma of masaale, the joy of sharing a homemade meal, and the satisfaction of a well-cooked dish are truly amazing. en-IN-ArjunNeural Call center Mr. Patel, your order (No. 789456123) included 3 paint brushes: Fine Tip Brush (Size 0), Flat Brush (Size 4), and Round Brush (Size 6). Please confirm if the damage is on the Round Brush or the Flat Brush handle. We’ll arrange a replacement as soon as the verification is done. hi-IN-AartiNeural Conversational मुझे समझ नहीं आ रहा है की आज रात खाने में क्या बनाया जाए। मैं कुछ नया, ट्राई करना चाहती हूं, लेकिन मुझे पता नहीं है कि क्या। शायद मुझे कुछ रेसिपी ऑनलाइन देखनी चाहिए। hi-IN-AartiNeural Neutral सुदर्शन क्रिया आपको अपने हृदय और अंतर्ज्ञान के माध्यम से ब्रह्मांड का ज्ञान और मार्गदर्शन प्राप्त करने की अनुमति देती है। श्वास लें और छोड़ें। आइए अब अपना ध्यान अपने हृदय पर केंद्रित करें। hi-IN-AartiNeural Call center ज़रूर, एक मिनट, मैं अभी टाइमिंग्स की जानकारी देती हूं... हांजी, मिल गयी! तो, सालारजंग संग्रहालय रविवार को सुबह 9 बजे खुलता है। इसका मतलब है कि आपके पास सभी चीज़ो को देखने और वास्तव में इतिहास को समझने के लिए पर्याप्त समय होगा। कोई और जानकारी चाहिए? hi-IN-ArjunNeural Conversational हम्म, सेमिनार सुबह 10 बजे शुरू होगा। साढ़े नौ तक पहुंचना चाहिए। क्या तुमने एजेंडा प्रिंट कर लिया? चलो बढ़िया। और हाँ, अपनी नोटबुक मत भूलना। सारे डॉक्युमेंट्स भी तैयार रखना। चलो फिर कल मिलते हैं! hi-IN-ArjunNeural Neutral मैं मानसिक स्वास्थ्य के महत्व पर विचार कर रहा था। जिस तरह हम अपने शरीर का ख्याल रखते हैं, उसी तरह अपने दिमाग का भी ख्याल रखना बेहद जरूरी है। नियमित ब्रेक, ध्यान और प्रियजनों से बात करने से मदद मिल सकती है। hi-IN-ArjunNeural Call center हेलो सर! आपका "ऐक्टिव फिटनेस ट्रैकर" जो 5 नवंबर 2024 को डेलिवर हुआ था, रिप्लेसमेंट के लिए योग्य है। आपकी शिकायत ID FT90987 पर दर्ज है। नया प्रोडक्ट 25 नवंबर 2024 तक डेलिवर होगा। अधिक जानकारी के लिए हमें 1800-000-123 पर कॉल करें। GA: AOAI turbo voices Turbo version of AOAI voices, which have the same persona and support SSML like other Azure voices, now is available to all Speech regions. Voice name AlloyTurboMultilingualNeural EchoTurboMultilingualNeural FableTurboMultilingualNeural NovaTurboMultilingualNeural OnyxTurboMultilingualNeural ShimmerTurboMultilingualNeural GA: Other quality improvements Quality improvement for those voices below with latest recipe Locale Voice Name Gender ar-EG ShakirNeural Male bg-BG KalinaNeural Female ca-ES EnricNeural Male ca-ES JoanaNeural Female da-DK JeppeNeural Male el-GR NestorasNeural Male en-IE EmilyNeural Female fi-FI HarriNeural Male fi-FI SelmaNeural Female fr-CH FabriceNeural Female fr-CH ArianeNeural Female he-IL HilaNeural Female he-IL AvriNeural Male hr-HR GabrijelaNeural Female id-ID ArdiNeural Male ms-MY YasminNeural Female nb-NO PernilleNeural Female nb-NO FinnNeural Male nl-NL MaartenNeural Male pt-PT RaquelNeural Female ro-RO AlinaNeural Female ro-RO EmilNeural Male ru-RU SvetlanaNeural Female sv-SE MattiasNeural Male sv-SE SofieNeural Female vi-VN HoaiMyNeural Female vi-VN NamMinhNeural Male zh-HK HiuMaanNeural Female zh-HK WanLungNeural Male GA: Embedded voice support with emotions Besides the general style, now embedded JennyNeural can support 14 other styles below: angry, assistant, cheerful, chat, customerservice, excited, friendly, hopeful, newscast, sad, shouting, terrified, unfriendly, whispering. Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback Contact us ttsvoicefeedback@microsoft.com1KViews1like8CommentsNumonix supercharges their value to clients with multimodality using Azure AI Content Understanding
Numonix is a compliance recording company that specializes in capturing, analyzing, and managing customer interactions across various modalities. They’re committed to revolutionizing how businesses extract value from customer interactions, offering solutions that empower businesses to make informed decisions while accelerating revenue growth, enhancing customer experiences, and maintaining regulatory compliance. By leveraging state-of-the-art technology, they provide powerful tools that help organizations ensure compliance, mitigate risk, and discover actionable insights from communications data. Numonix has many call center clients for whom regulatory compliance is crucial. They needed a way to help their clients monitor calls and customer interactions, solve call-compliance issues, and gather and extract valuable insights from their data. They were further challenged by the need to process and extract insights from different formats, including audio, images, video, and text, while improving data quality and accuracy, and streamlining workflows. However, the manual process to evaluate calls is cumbersome, inaccurate, inefficient, and resource intensive. Plus, their legacy media deployments required a lot of complex hardware on-premises, which hindered their ability to react quickly. If a client suddenly ran out of storage, Numonix had to scramble to quickly scale up in order to provide the needed storage. Plus, customers had to pay upfront for potential scale that they might want to reach in the future. To solve these issues, they partnered with Microsoft to leverage Azure’s flexible cloud services and updated their service to better manage multimodal content. To do this, they leveraged Azure AI Content Understanding, which helps businesses and developers create multimodal AI apps from varied data types, and helps to unify their separate workflows related to individual modalities. It offers prebuilt templates, a streamlined workflow, and opportunities to customize outputs for use-cases such as call center analytics, marketing automation, and content search, without needing specialized AI skills, all while maintaining robust security and accuracy. Now, Numonix has the ability to capture insights from all recorded call data in multiple modes, including audio, video, text, and images. They can transcribe and analyze content from calls and meetings, understand context, watch videos of call interactions, and ensure compliance across all conversations. With the transition to Azure, the challenges associated with on-premises server environments, including costs and ongoing maintenance, have been eliminated. Now, when it’s determined that extra space is needed, the service seamlessly scales to the increased volume of content being processed. “We have successfully delivered seamless scalability for our customers, including the capability to integrate their Azure Blob storage accounts with our platform,” said Michael Levy, Founder and CEO. “Our platform offers a combination of robust functionality, exceptional flexibility, and comprehensive security—key advantages that we are proud to provide to our customers.” “Adopting Azure has been a transformative decision, enabling us to deliver a cloud-native solution that facilitates faster deployment for our customers while ensuring long-term scalability, technology advancements, and robust security. Scaling from 1,000 to 10,000 users is now a seamless license adjustment, with no need for complex backend modifications or DevOps intervention." Empowered by Azure AI Content Understanding, Numonix now offers industry-leading quality management. They’ve been able to take their customers’ call coverage from roughly around 3% all the way up to 100%, and the cost is lower because it’s far less resource-intensive. “It’s been a productivity multiplier,” says Evan Kahan, CTO of Numonix. “Leveraging Azure AI Content Understanding across multiple modalities has allowed us to supercharge the value of recorded data Numonix captures on behalf of our customers.” At the same time, they’ve expanded their business capabilities to extend even more value to their clients. As a result of implementing Azure AI Content Understanding, they’ve grown from only offering audio, to also offering video, screen sharing, document sharing, chat, live interaction, and document intelligence. In addition, they can leverage multiple tools for their customers—tools like secure and compliant meeting insights, PII redactions, and automated risk alerts all come together to help clients gather and unlock their data's full potential, drive efficiency, and innovate. Says Evan Kahan: “We’re able to bring that all together…to make sure that the audio and video that you have provides the most value that you could possibly get out of it, which we really didn’t have access to before. Everything we got out of it before was a small piece of the picture. Now, with Azure AI Content Understanding, we’re really able to leverage all of these Microsoft tools to bring this full picture to our customers.” Get started: Learn more about Azure AI Content Understanding. Try Azure AI Content Understanding in Azure AI Foundry.` Our commitment to Trustworthy AI Organizations across industries are leveraging Azure AI and Copilot capabilities to drive growth, increase productivity, and create value-added experiences. We’re committed to helping organizations use and build AI that is trustworthy, meaning it is secure, private, and safe. We bring best practices and learnings from decades of researching and building AI products at scale to provide industry-leading commitments and capabilities that span our three pillars of security, privacy, and safety. Trustworthy AI is only possible when you combine our commitments, such as our Secure Future Initiative and our Responsible AI principles, with our product capabilities to unlock AI transformation with confidence.191Views0likes0CommentsFrom Foundry to Fine-Tuning: Topics you Need to Know in Azure AI Services
With so many new features from Azure and newer ways of development, especially in generative AI, you must be wondering what all the different things you need to know are and where to start in Azure AI. Whether you're a developer or IT professional, this guide will help you understand the key features, use cases, and documentation links for each service. Let's explore how Azure AI can transform your projects and drive innovation in your organization. Stay tuned for more details! Term Description Use Case Azure Resource Azure AI Foundry A comprehensive platform for building, deploying, and managing AI-driven applications. Customizing, hosting, running, and managing AI applications. Azure AI Foundry AI Agent Within Azure AI Foundry, an AI Agent acts as a "smart" microservice that can be used to answer questions (RAG), perform actions, or completely automate workflows. can be used in a variety of applications to automate tasks, improve efficiency, and enhance user experiences. Link AutoGen An open-source framework designed for building and managing AI agents, supporting workflows with multiple agents. Developing complex AI applications with multiple agents. Autogen Multi-Agent AI Systems where multiple AI agents collaborate to solve complex tasks. Managing energy in smart grids, coordinating drones. Link Model as a Platform A business model leveraging digital infrastructure to facilitate interactions between user groups. Social media channels, online marketplaces, crowdsourcing websites. Link Azure OpenAI Service Provides access to OpenAI’s powerful language models integrated into the Azure platform. Text generation, summarization, translation, conversational AI. Azure OpenAI Service Azure AI Services A suite of APIs and services designed to add AI capabilities like image analysis, speech-to-text, and language understanding to applications. Image analysis, speech-to-text, language understanding. Link Azure Machine Learning (Azure ML) A cloud-based service for building, training, and deploying machine learning models. Creating models to predict sales, detect fraud. Azure Machine Learning Azure AI Search An AI-powered search service that enhances information to facilitate exploration. Enterprise search, e-commerce search, knowledge mining. Azure AI Search Azure Bot Service A platform for developing intelligent, enterprise-grade bots. Creating chatbots for customer service, virtual assistants. Azure Bot Service Deep Learning A subset of ML using neural networks with many layers to analyze complex data. Image and speech recognition, natural language processing. Link Multimodal AI AI that integrates and processes multiple types of data, such as text and images(including input & output). Describing images, answering questions about pictures. Azure OpenAI Service, Azure AI Services Unimodal AI AI that processes a single type of data, such as text or images (including input & output). Writing text, recognizing objects in photos. Azure OpenAI Service, Azure AI Services Fine-Tuning Models Adapting pre-trained models to specific tasks or datasets for improved performance. Customizing models for specific industries like healthcare. Azure Foundry Model Catalog A repository of pre-trained models available for use in AI projects. Discovering, evaluating, fine-tuning, and deploying models. Model Catalog Capacity & Quotas Limits and quotas for using Azure AI services, ensuring optimal resource allocation. Managing resource usage and scaling AI applications. Link Tokens Units of text processed by language models, affecting cost and performance. Managing and optimizing text processing tasks. Link TPM (Tokens per Minute) A measure of the rate at which tokens are processed, impacting throughput and performance. Allocating and managing processing capacity for AI models. Link PTU(provisioned throughput) provisioned throughput capability allows you to specify the amount of throughput you require in a deployment. Ensuring predictable performance for AI applications. Link967Views1like0CommentsUnlock Multimodal Data Insights with Azure AI Content Understanding: New Code Samples Available
We are excited to share code samples that leverage the Azure AI Content Understanding service to help you extract insights from your images, documents, videos, and audio content. These code samples are available on GitHub and cover the following: Azure AI integrations Visual Document Search: Leverage Azure Document Intelligence, Content Understanding, Azure Search, and Azure OpenAI to unlock natural language search of document contents for a complex document with pictures of charts and diagrams. Video Chapter Generation: Generate video chapters using Azure Content Understanding and Azure OpenAI. This allows you to break long videos into smaller, labeled parts with key details, making it easier to find, share, and access the most relevant content. Video Content Discovery: Learn how to use Content Understanding, Azure Search, and Azure OpenAI models to process videos and create a searchable index for AI-driven content discovery. Content Understanding Operations Analyzer Templates: An Analyzer enables you to tailor Content Understanding to extract valuable insights from your content based on your specific needs. Start quickly with these ready-made templates. Content Extraction: Learn how Content Understanding API can extract semantic information from various files including performing OCR to recognize tables in documents, transcribing audio files, and analyzing faces in videos. Field Extraction: This example demonstrates how to extract specific fields from your content. For instance, you can identify the invoice amount in a document, capture names mentioned in an audio file, or generate a summary of a video. Analyzer Training: For document scenarios, you can further enhance field extraction performance by providing a few labeled samples. Analyzer management: Create a minimal analyzer, list all analyzers in your resource, and delete any analyzers you no longer need. Azure AI Content Understanding: Turn Multimodal Content into Structured Data Azure AI Content Understanding is a cutting-edge Azure AI offering designed to help businesses seamlessly extract insights from various content types. Built with and for Generative AI, it empowers organizations to seamlessly develop GenAI solutions using the latest models, without needing advanced AI expertise. Content Understanding simplifies the processing of unstructured data stores of documents, images, videos, and audio—transforming them into structured, actionable insights. It is versatile and adaptable across numerous industries and, use case scenarios, offering customization and support for input from multiple data types. Here are a few example use cases: Retrieval Augmented Generation (RAG): Enhance and integrate content from any format to power effective content searches or provide answers to frequent questions in scenarios like customer service or enterprise-wide data retrieval. Post-call analytics: Organizations use Content Understanding to analyze call center or meeting recordings, extracting insights like sentiment, speaker details, and topics discussed, including names, companies, and other relevant data. Insurance claims processing: Automate time-consuming processes like analyzing and handling insurance claims or other low-latency batch processing tasks. Media asset management and content creation: Extract essential features from images and videos to streamline media asset organization and enable entity-based searches for brands, settings, key products, and people. Resources & Documentation To begin extracting valuable insights from your multimodal content, explore the following resources: Azure Content Understanding Overview Azure Content Understanding in Azure AI Foundry FAQs Want to get in touch? We’d love to hear from you! Send us an email at cu_contact@microsoft.com1.1KViews0likes0CommentsCreate a custom text to speech avatar through self-service
AI avatars are revolutionizing the way we interact with technology. They serve as virtual sales agents assisting customers, personalized service assistants providing 24/7 support, digital teachers bringing lessons to life, and brand representatives in advertising. Today we are excited to announce that Azure AI Speech service has released a self-service portal in public preview for custom training of text to speech avatars. Now creating an avatar for your business that supports both real-time live chats and video generation is easier than ever. All you need is a minimum of a few minutes of video recordings in total as training data and a consent video to get started. A state-of-the-art avatar model is just a click away. Check out the video below for an overview of the public preview of the custom text to speech avatar self-service portal. In this article, we provide a comprehensive step-by-step guide for developing a custom text-to-speech avatar tailored to your business needs. Steps in the journey Before we begin, here is a step-by-step visualization for creating a custom text to speech avatar. We will explain each of these steps in detail below. Prepare Meet Responsible AI requirements ∙ Read ∙ Fill out custom avatar application Cast an avatar performer ∙ Define avatar persona ∙ Find an avatar performer Record performer ∙ Record permission statement ∙ Record videos Create in Speech Studio Start a new project in Speech Studio ∙ Log into Speech Studio using Azure account ∙ Create a new custom avatar project Upload video data ∙ Upload permission statement recording ∙ Upload video training data Train avatar model ∙ Confirm enough data to train ∙ Check the quality Deploy your avatar model ∙ Deploy trained model Integrate Integrate your avatar for video creation or creating your own apps ∙ Use your avatar via the TTS avatar tool for video generation, or ∙ Build your app with your avatar using the Speech SDK Responsible AI Prioritizing responsible AI is fundamental to our text to speech avatar capability. Custom avatar was developed in strict adherence to our responsible AI principles and is offered as a limited access service with eligibility and use case requirements through a controlled registration and review process. To learn more about the responsible AI considerations in the development and usage of this service, review our Azure Text to Speech Transparency Note. The first step to creating your own custom avatar is filling out a registration form to gain access to the technology. Please make sure to read through the registration form and fill it out completely. Once you’ve registered, your eligibility for access is confirmed, and you have committed to using the feature in alignment with our responsible AI principles, you will be granted access. At this stage we are not offering the service to individual users for personal use. Persona design Persona refers to the attributes that make your imaginary character come to life in a way that will resonate with your customers. For example, you may want a 40ish year-old female, who performs with authority and confidence, is directly engaging, thoughtful and unbiased. Think carefully about your persona because this will be a representation of your company when speaking to customers. Once the persona is defined, you can cast your performer (avatar talent) for data collection. Make sure your avatar talent has experience in the persona and is comfortable with the gestures or movements you would like to capture. Most importantly once you have chosen an avatar talent, make sure that the avatar talent is okay to sign a contract with you stating that they will offer their likeliness to create an avatar, and a synthetic voice if you would like to use the avatar together with a voice that sounds like the person. Keep in mind that the look and feel of the avatar created heavily depends on the persona you have designed. Recording When choosing an avatar talent, it’s a good idea to consider where your recording will take place. We recommend recording in a professional video recording studio or a well-lit place. If you need a commercial, multi-scene avatar, the background of the video should be clean, smooth, pure-colored, and a green screen is the best choice. If your avatar only needs to be used in a single scene, you can select a specific scene to record (such as in your office), but the background can't be subtracted and changed. The custom text to speech avatar doesn't support customization of clothes or looks. Therefore, it's essential to carefully design and prepare the avatar's appearance when recording the training data. At least three video clips are required: Consent video: The consent video must represent the same avatar talent speaking, following the requirement of the consent statement. Naturally speaking: Actor speaks in status 0 but with natural hand gestures from time to time. Minimum 5 minutes, maximum 30 minutes in total. Silent status: A 1-minute video clip of the actor maintaining status 0 without speaking but relaxed. The video clip is used as the main template for both speaking and listening status for a chatbot. If you would like to add custom gestures, prepare two additional video clips: Gestures: One 10-second video clip for each gesture. Each custom avatar model can support no more than 10 gestures. Status 0 speaking: A video clip with the performer speaking for 3 to 5 minutes, representing the posture that the performer can naturally maintain most of the time while speaking. For example, arms crossed in front of the body or hanging down naturally at the sides. The quality of your avatar model heavily depends on the quality of the recorded videos used for training. It’s critical that you make sure the data is collected following the requirements. For more detailed instructions, best practice and sample data, check this document. Uploading data Go to the speech studio portal and log in with your Azure account. Select Custom avatar (preview) and then Create a project. Go to the project, and Set up avatar talent, then Upload consent video. Navigate to Prepare training data and Upload data that you’ve prepared in the previous steps. You can select to upload data from local files on your computer or provide access to the Azure Blob storage. Data files are automatically validated when you select Submit. Data validation includes series of checks on the video files to verify their file format, size, and total volume. If there are any errors, fix them and submit again. After you upload the data, you can check the data overview which indicates whether you have provided enough data to start training. Below is an example of enough data added for training an avatar without additional gestures. Training your avatar model Once you have enough data uploaded, you can start to train a model. Enter a Name to help you identify the model. Choose a name carefully. The model name is used as the avatar name in your synthesis request by the SDK and SSML input. Only letters, numbers, hyphens and underscores are allowed. Use different names for different models. It’s important to note that your avatar model name should be unique. No duplicate names are allowed under the same Speech or Azure AI resource. Training duration varies depending on how much data you use. It normally takes 20-40 compute hours on average to train a custom avatar. Check the pricing note on how training is charged. Deploying your avatar model After you've successfully created your avatar model, you deploy it to your endpoint. When a model is deployed, you will pay for continuous up time of the endpoint regardless of your interaction with that endpoint. Check the pricing note on how model deployment is charged. You can delete a deployment when the model is not in use to reduce spending and conserve resources. Custom avatar training is currently only available in some regions. After your avatar model is trained in a supported region, you can copy it to a Speech resource in another region for deployment as needed. For more information, see the Speech service regions. Integrate your avatar model into your chosen platform After you deploy your custom avatar, it's available to use in Speech Studio or via API: The avatar appears in the avatar list of the text to speech avatar tool on Speech Studio. The avatar appears in the avatar list of the live chat avatar tool on Speech Studio. You can call the avatar from the API by specifying the avatar model name. Check out sample code in GitHub for integrating your avatar with the latest generative AI models, such as Azure OpenAI ChatGPT-4o or the real-time API. If you're also creating a custom neural voice for the actor, the avatar can be highly realistic. For more information, see custom neural voice overview. Note that custom neural voice and custom text to speech avatar are separate features. You can use them independently or together. Customer cases Custom text to speech avatar has enabled many customers and partners around the world to develop engaging customer service solutions for a variety of industries. These include KPMG, Fujifilm, MAPFRE, Dentsu Digital, Bank SinoPac, Herbalife, Coca Cola, and more. (Check out their testimonials here.) In addition, read the story of how CDW is leveraging Azure text to speech avatar in their business solutions. Get started Azure text to speech (TTS) avatar is a powerful tool for developers looking to enhance customer engagement and improve overall experience. With a variety of use cases and customer references, it's clear that Azure TTS avatar is paving the way for a new era of customer engagement and innovation. As developers, you can use Azure TTS avatar to create personalized and engaging experiences for your customers and employees with a rich choice of prebuilt avatars and voices available. You can also leverage custom avatar and custom neural voice to create custom synthetic voices and images that represent your brand. With responsible AI features that promote transparency and fairness, Azure TTS avatar helps you create inclusive and ethical applications that serve a diverse range of users. For more basics of the text to speech avatar service and its responsible AI considerations, check out this blog. Learn more: Create a video using prebuilt avatars Try our live chat demo with prebuilt avatars Learn how to create a custom avatar Try our TTS voice demo Apply for access to custom avatar and custom neural voice1.8KViews2likes0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. ure AI Speech Studio > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)703Views3likes6CommentsMy Journey of Building a Voice Bot from Scratch
My Journey in Building Voice Bot for production The world of artificial intelligence is buzzing with innovations, and one of its most captivating branches is the development of voice bots. These digital entities have the power to transform user interactions, making them more natural and intuitive. In this blog post, I want to take you on a journey through my experience of building a voice bot from scratch using Azure's cutting-edge technologies: OpenAI GPT-4o-Realtime, Azure Text-to-Speech (TTS), and Speech-to-Text (STT). Key Features for Building Effective Voice Bot Natural Interaction: A voice agent's ability to converse naturally is paramount. The goal is to create interactions that mirror human conversation, avoiding robotic or scripted responses. This naturalism fosters user comfort, leading to a more seamless engaging experience. Context Awareness: True sophistication in a voice agent comes from its ability to understand context and retain information. This capability allows it to provide tailored responses and actions based on user history, preferences, and specific queries. Multi-Language Support: One of the significant hurdles in developing a comprehensive voice agent lies in the need for multi-language support. As brands cater to diverse markets, ensuring clear and contextually accurate communication across languages is vital. Real-time Processing: The real-time capabilities of voice agents allow for immediate responses, enhancing the customer experience. This feature is crucial for tasks like booking, purchasing, and inquiries where time sensitivity matters. Furthermore, there are immense opportunities available. When implemented successfully, a robust voice agent can revolutionize customer engagement. Consider a scenario where a business utilizes an AI-driven voice agent to reach out to potential customers in a marketing campaign. This approach can greatly enhance efficiency, allowing the business to manage high volumes of prospects, providing a vastly improved return on investment compared to traditional methods. Before diving into the technicalities, it's crucial to have a clear vision of what you want to achieve with your voice bot. For me, the goal was to create a bot that could engage users in seamless conversations, understand their needs, and provide timely responses. I envisioned a bot that could be integrated into various platforms, offering flexibility and adaptability. Azure provides a robust suite of tools for AI development, and choosing it was an easy decision due to its comprehensive offerings and strong integration capabilities. Here’s how I began: Text-to-Speech (TTS): This service would convert the bot's text responses into human-like speech. Azure TTS offers a range of customizable voices, allowing me to choose one that matched the bot's personality. Speech-to-Text (STT): To understand user inputs, the bot needed to convert spoken language into text. Azure STT was instrumental in achieving this, providing real-time transcription with high accuracy. Foundational Model: This would refer to a large language model (LLM) that powers the bot's understanding of language and generation of text responses. Examples of foundational models include: GPT-4: A powerful LLM developed by OpenAI, capable of generating human-quality text, translating languages, writing different kinds of creative content, and answering your questions in an informative way. Foundation Speech to Speech Model: This could refer to a model that directly translates speech from one language to another, without the need for text as an intermediate step. Such a model could be used for real-time translation or for generating speech in a language different from the input language. As voice technology continues to evolve, different types of voice bots have emerged to cater to varying user needs. In this analysis, we will explore three prominent types: Voice Bot Duplex, GPT-4o-Realtime, and GPT-4o-Realtime + TTS. This detailed comparison will cover their architecture, strengths, weaknesses, best practices, challenges, and potential opportunities for implementation. Type 1: Voice Bot Duplex: Duplex Bot is an advanced AI system that conducts phone conversations and completes tasks using Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). Azure’s automatic speech recognition (ASR) technology, turning spoken language into text. This text is analysed by an LLM to generate responses, which are then converted back to speech by Azure Text-To-Speech (TTS). Duplex Bot can listen and respond simultaneously, improving interaction fluidity and reducing response time. This integration enables Duplex to autonomously manage tasks like booking appointments with minimal human intervention. - Strengths: Low operational cost . Complex architecture with multiple processing hops, making it difficult to implement. Suitable for straightforward use cases with basic conversational requirements. Customizable easily for both STT and TTS side - Weaknesses: Higher latency compared to advanced models, limiting real-time capabilities. Limited ability to perform complex actions or maintain context over longer conversations. Does not capture the human emotion from the speech Switching between language is difficult during the conversation. You have to choose the language beforehand for better output. Type 2- GPT-4o-Realtime GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model model which process these speech byte array , reason and respond back speech as byte array. - Strengths: Simplest architecture with no processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost . You can not customize the voice synthesized. You can not add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Support for different language may be an issue as there is no official documentation of language specific support. Type 3- GPT-4o-Realtime + TTS GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV). Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data. Out of the box, text to speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See the supported languages for custom neural voice. - Strengths: Simple architecture with only one processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements and customized voice. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost but still lower than GPT-4o-Realtime. You cannot add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Does not support custom phrases Conclusion Building a voice bot is an exciting yet challenging journey. As we've seen, leveraging Azure’s advanced tools like GPT-4o-Realtime, Text-to-Speech, and Speech-to-Text can provide the foundation for creating a voice bot that understands, engages, and responds with human-like fluency. Throughout this journey, key aspects like natural interaction, context awareness, multi-language support, and real-time processing were vital in ensuring the bot’s effectiveness across various scenarios. While each voice bot model, from Voice Bot Duplex to GPT-4o-Realtime and GPT-4o-Realtime + TTS, offers its strengths and weaknesses, they all highlight the importance of carefully considering the specific needs of the application. Whether aiming for simple conversations or more sophisticated interactions, the choice of model will directly impact the bot's performance, cost, and overall user satisfaction. Looking ahead, the potential for AI-driven voice bots is immense. With ongoing advancements in AI, voice bots are bound to become even more integrated into our daily lives, transforming the way we interact with technology. As this field continues to evolve, the combination of innovative tools and strategic thinking will be key to developing voice bots that not only meet but exceed user expectations. My Previous Blog: From Zero to Hero: Building Your First Voice Bot with GPT-4o Real-Time API using Python Github Link: https://github.com/monuminu/rag-voice-bot1.2KViews3likes0CommentsAzure AI voices in Arabic improved pronunciation
This blog introduces our work on improving Arabic TTS (Text to Speech) pronunciation with Azure AI Speech. A key component in Arabic TTS is the diacritic model, which represents a challenging task. In written Arabic, diacritics, which indicate vowel sounds, are typically omitted. The diacritic task involves predicting the diacritic for each Arabic character in the written form. We enhanced diacritic prediction by utilizing a base model pre-trained with Machine Translation and other NLP tasks, then fine-tuning it on a comprehensive diacritics' corpus. This approach reduced word-level pronunciation errors by 78%. Additionally, we improved the reading of English words in Arabic texts. For English words transcribed using the Arabic alphabet, they can now be read as standard English. Pronunciation improvement Below shows the diacritic improvement on Microsoft Ar-SA HamedNeural voice. Other Ar-SA and Ar-EG voices also benefit. This improvement is now online for all Ar-SA and Ar-EG voices. Script Baseline Improved Proper noun الهيئة الوطنية للامن الالكتروني نيسا Short sentence ويحتل بطل أفريقيا المركز الثالث في المجموعة، وسيلتقي مع الإكوادور في آخر مبارياته بالمجموعة، يوم الثلاثاء المقبل. Long sentence العالم كله أدرك أن التغيرات المناخية قريبة وأضاف خلال مداخلة هاتفية لبرنامج “في المساء مع قصواء”، مع الإعلامية قصواء الخلالي، والمذاع عبر فضائية CBC، أن العالم كله أدرك أن التغيرات المناخية قريبة من كل فرد على وجه الكرة الأرضية، مشيرًا إلى أن مصر تستغل الزخم الموجود حاليا، وبخاصة أنها تستضيف قمة المناخ COP 27 في شرم الشيخ بنوفمبر المقبل. Our service was compared with two other popular services (referred to as Comany A and Company B) using 400 general scripts, measuring word-level pronunciation accuracy. The results indicate that HamedNeural Voice outperforms Provider A by 1.49% and Provider B by 3.88%. Below are some samples that shows the differences. Azure (Ar-SA HamedNeural) Company A Company B أوتوفيستر: أتذكر العديد من لاعبي نادي الزمالك وتابع: "بالتأكيد أتذكر العديد من لاعبي نادي الزمالك في ذلك التوقيت، عبد الواحد السيد، وبشير التابعي، وحسام حسن، وشقيقه إبراهيم، حازم إمام، ميدو، ومباراة الإسماعيلي التي شهدت 7 أهداف". ويشار إلى أن جرعات اللقاح وأعداد السكان الذين يتم تطعيمهم هي تقديرات تعتمد على نوع اللقاح الذي تعطيه الدولة، أي ما إذا كان من جرعة واحدة أو جرعتين. وتتكامل هذه الخطوة مع جهود إدارة البورصة المستمرة لرفع مستويات وعي ومعرفة المجتمع المصري، وخاصة فئة الشباب منهم، بأساسيات الاستثمار والادخار من خلال سوق الأوراق المالية، وذلك بالتوازي مع جهود تعريف الكيانات الاقتصادية العاملة بمختلف القطاعات الإنتاجية بإجراءات رحلة القيد والطرح والتداول بسوق الأوراق المالية، وذلك للوصول إلى التمويل اللازم للتوسع والنمو ومن ثم التشغيل وزيادة الإنتاجية، ذات مستهدفات خطط الحكومة المصرية التنموية. English word reading The samples below demonstrate the enhancement in reading English words (transcribed using the Arabic alphabet) with the Microsoft Ar-SA HamedNeural voice. This feature will be available online soon. Script Baseline Improved هبتُ إلى كوفي شوب مع أصدقائي لتناول القهوة والتحدث. اشترى أخي هاتفًا جديدًا من هواوي تك لأنه يحتوي على ميزات متقدمة. Get started Microsoft offers over 600 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback552Views1like0CommentsInuktitut: A Milestone in Indigenous Language Preservation and Revitalization via Technology
Project Overview The Power of Indigenous Languages Inuktitut, an official language of Nunavut and a cornerstone of Inuit identity, is now at the forefront of technological innovation. This project demonstrates the resilience and adaptability of Indigenous languages in the digital age. By integrating Inuktitut into modern technology, we affirm its relevance and vitality in contemporary Canadian society. Collaboration with Inuit Communities Central to this project is the partnership between the Government of Nunavut and Microsoft. This collaboration exemplifies the importance of Indigenous leadership in technological advancements. The Government of Nunavut, representing Inuit interests, has been instrumental in guiding this project to ensure it authentically serves the Inuit community. Inuktitut by the Numbers Inuktitut is the language of many Inuit communities – foundational to their way of life. Approximately 24,000 Inuit speak Inuktitut, with 80% using it as their primary language. The 2016 Canadian census reported around 37,570 individuals identifying Inuktitut as their mother tongue, highlighting its significance in Canada's linguistic landscape. New Features Honoring Inuktitut We're excited to introduce two neural voices, "SiqiniqNeural" and "TaqqiqNeural," supporting both Roman and Syllabic orthography. These voices, developed with careful consideration of Inuktitut's unique sounds and rhythms, are now available across various Microsoft applications (Microsoft Translator app, Bing Translator, ClipChamp, Edge Read Aloud and more to come), you also can integrate these voices into your own application through Azure AI Speech services. You can listen to these voices in samples below: Voice name Text Audio iu-Cans-CA-SiqiniqNeural / iu-Latn-CA-SiqiniqNeural ᑕᐃᒫᒃ ᐅᒥᐊᓪᓘᓐᓃᑦ ᑲᓅᓪᓘᓐᓃᑦ, ᐊᖁᐊᓂ ᑕᕝᕙᓂ ᐊᐅᓚᐅᑏᑦ ᐊᑕᖃᑦᑕᕐᖓᑕ,ᖃᐅᔨᒪᔭᐃᓐᓇᕆᒐᔅᓯᐅᒃ. Taimaak umialluunniit kanuulluunniit, aquani tavvani aulautiit ataqattarngata,qaujimajainnarigassiuk. English translation: The boat or the canoes, the outboard motors, are attached to the motors. iu-Cans-CA-TaqqiqNeural / iu-Latn-CA-TaqqiqNeural ᑐᓴᐅᒪᔭᑐᖃᕆᓪᓗᒋᑦ ᓇᓄᐃᑦ ᐃᓄᑦᑎᑐᒡᒎᖅ ᐃᓱᒪᓖᑦ ᐅᑉᐱᓕᕆᐊᒃᑲᓐᓂᓚᐅᖅᓯᒪᕗᖓ ᑕᐃᔅᓱᒪᓂ. Tusaumajatuqarillugit nanuit inuttitugguuq isumaliit uppiliriakkannilauqsimavunga taissumani. English translation: I have heard that the polar bears have Inuit ideas and I re-believed in them at that time. Preserving Language Through Technology The Government of Nunavut has generously shared an invaluable collection of linguistic data, forming the foundation of our text-to-speech models. This rich repository includes 11,300 audio files from multiple speakers, totaling approximately 13 hours of content. These recordings capture a diverse range of Inuktitut expression, from the Bible to traditional stories, and even some contemporary novels written by Inuktitut speakers. Looking Forward This project is more than a technological advancement; it's a step towards digital Reconciliation. By ensuring Inuktitut's presence in the digital realm, we're supporting the language's vitality and accessibility for future generations of Inuit. Global Indigenous Language Revitalization The groundbreaking work with Inuktitut has paved the way for a broader, global initiative to support Indigenous languages worldwide. This expansion reflects Microsoft's commitment to Reconciliation and puts us on the path as a leader in combining traditional knowledge with cutting-edge technology. While efforts began here in Canada with Inuktitut, Microsoft recognizes the global need for Indigenous language revitalization. We're now working with more Indigenous communities across the world, from Māori in New Zealand to Cherokee in North America, always guided by the principle of Indigenous-led collaboration that was fundamental to the success of the Inuktitut project. Our aim is to co-create AI tools that not only translate languages but truly capture the essence of each Indigenous culture. This means working closely with elders, language keepers, and community leaders to ensure our technology respects and accurately reflects the unique linguistic features, cultural contexts, and traditional knowledge systems of each language. These AI tools are designed to empower Indigenous communities in their own language revitalization efforts. From interactive language learning apps to advanced text-to-speech systems, we're providing technological support that complements grassroots language programs and traditional teaching methods. Conclusion We are particularly proud to celebrate this milestone in Indigenous language revitalization in partnership with the Government of Nunavut. This project stands as a testament to what can be achieved when Indigenous knowledge and modern technology come together in a spirit of true partnership and respect, fostering the continued growth and use of Indigenous languages. Find more information about the project in video below: Press release from Government of Nunavut: Language Preservation and Promotion Through Technology: MS Translator Project | Government of Nunavut Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback452Views3likes0CommentsBoost Your Holiday Spirit with Azure AI
Here's the revised LinkedIn post with points 7 and 8 integrated into points 2 and 3: 🎄✨ **Boost Your Holiday Spirit with Azure AI! 🎄✨ As we gear up for the holiday season, what better way to bring innovation to your business than by using cutting-edge Azure AI technologies? From personalized customer experiences to festive-themed data insights, here’s how Azure AI can help elevate your holiday initiatives: 🎅 1. Azure OpenAI Service for Creative Content Kickstart the holiday cheer by using Azure OpenAI to create engaging holiday content. From personalized greeting messages to festive social media posts, the GPT models can assist you in generating creative text in a snap. 🎨 Step-by-step: Use GPT to draft festive email newsletters, promotions, or customer-facing messages. Train models on your specific brand voice for customized holiday greetings. 🎁 2. Azure AI Services for Image Recognition and Generation Enhance your holiday product offerings by leveraging image recognition to identify and categorize holiday-themed products. Additionally, create stunning holiday-themed visuals with DALL-E. Generate unique images from text descriptions to make your holiday marketing materials stand out. 📸 Step-by-step: Use Azure Computer Vision to analyze product images and automatically categorize seasonal items. Implement the AI model in e-commerce platforms to help customers find holiday-specific products faster. Use DALL-E to generate holiday-themed images based on your descriptions. Customize and refine the images to fit your brand’s style. Incorporate these visuals into your marketing campaigns. ✨ 3. Azure AI Speech Services for Holiday Customer Interaction and Audio Generation Transform your customer service experience with Azure’s Speech-to-Text and Text-to-Speech services. You can create festive voice assistants or add holiday-themed voices to your customer support lines for a warm, personalized experience. Additionally, add a festive touch to your audio content with Azure OpenAI. Use models like Whisper for high-quality speech-to-text and text-to-speech conversions, perfect for creating holiday-themed audio messages and voice assistants. 🎙️ Step-by-step: Use Speech-to-Text to transcribe customer feedback or support requests in real-time. Build a holiday-themed voice model using Text-to-Speech for interactive voice assistants. Use Whisper to transcribe holiday messages or convert text to festive audio. Customize the audio to match your brand’s tone and style. Implement these audio clips in customer interactions or marketing materials. 🎄 4. Azure Machine Learning for Predictive Holiday Trends Stay ahead of holiday trends with Azure ML models. Use AI to analyze customer behavior, forecast demand for holiday products, and manage stock levels efficiently. Predict what your customers need before they even ask! 📊 Step-by-step: Use Azure ML to train models on historical sales data to predict trends in holiday shopping. Build dashboards using Power BI integrated with Azure for real-time tracking of holiday performance metrics. 🔔 5. Azure AI for Sentiment Analysis Understand the holiday mood of your customers by implementing sentiment analysis on social media, reviews, and feedback. Gauge the public sentiment around your brand during the festive season and respond accordingly. 📈 Step-by-step: Use Text Analytics for sentiment analysis on customer feedback, reviews, or social media posts. Generate insights and adapt your holiday marketing based on customer sentiment trends. 🌟 6. Latest Azure AI Open Models Explore the newest Azure AI models to bring even more innovation to your holiday projects: GPT-4o and GPT-4 Turbo: These models offer enhanced capabilities for understanding and generating natural language and code, perfect for creating sophisticated holiday content. Embeddings: Use these models to convert holiday-related text into numerical vectors for improved text similarity and search capabilities. 🔧 7. Azure AI Foundry Leverage Azure AI Foundry to build, deploy, and scale AI-driven applications. This platform provides everything you need to customize, host, run, and manage AI applications, ensuring your holiday projects are innovative and efficient 🎉 Conclusion: With Azure AI, the possibilities to brighten your business this holiday season are endless! Whether it's automating your operations or delivering personalized customer experiences, Azure's AI models can help you stay ahead of the game and spread holiday joy. Wishing everyone a season filled with innovation and success! 🎄✨344Views1like0Comments