azure ai services
384 TopicsUse Azure OpenAI and APIM with the OpenAI Agents SDK
The OpenAI Agents SDK provides a powerful framework for building intelligent AI assistants with specialised capabilities. In this blog post, I'll demonstrate how to integrate Azure OpenAI Service and Azure API Management (APIM) with the OpenAI Agents SDK to create a banking assistant system with specialised agents. Key Takeaways: Learn how to connect the OpenAI Agents SDK to Azure OpenAI Service Understand the differences between direct Azure OpenAI integration and using Azure API Management Implement tracing with the OpenAI Agents SDK for monitoring and debugging Create a practical banking application with specialized agents and handoff capabilities The OpenAI Agents SDK The OpenAI Agents SDK is a powerful toolkit that enables developers to create AI agents with specialised capabilities, tools, and the ability to work together through handoffs. It's designed to work seamlessly with OpenAI's models, but can be integrated with Azure services for enterprise-grade deployments. Setting Up Your Environment To get started with the OpenAI Agents SDK and Azure, you'll need to install the necessary packages: pip install openai openai-agents python-dotenv You'll also need to set up your environment variables. Create a `.env` file with your Azure OpenAI or APIM credentials: For Direct Azure OpenAI Connection: # .env file for Azure OpenAI AZURE_OPENAI_API_KEY=your_api_key AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/ AZURE_OPENAI_DEPLOYMENT=your-deployment-name For Azure API Management (APIM) Connection: # .env file for Azure APIM AZURE_APIM_OPENAI_SUBSCRIPTION_KEY=your_subscription_key AZURE_APIM_OPENAI_API_VERSION=2024-08-01-preview AZURE_APIM_OPENAI_ENDPOINT=https://your-apim-name.azure-api.net/ AZURE_APIM_OPENAI_DEPLOYMENT=your-deployment-name Connecting to Azure OpenAI Service The OpenAI Agents SDK can be integrated with Azure OpenAI Service in two ways: direct connection or through Azure API Management (APIM). Option 1: Direct Azure OpenAI Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure OpenAI openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Option 2: Azure API Management (APIM) Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure APIM openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_APIM_OPENAI_SUBSCRIPTION_KEY"), # Note: Using subscription key api_version=os.getenv("AZURE_APIM_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_APIM_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_APIM_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Key Difference: When using Azure API Management, you use a subscription key instead of an API key. This provides an additional layer of management, security, and monitoring for your OpenAI API access. Creating Agents with the OpenAI Agents SDK Once you've set up your Azure OpenAI or APIM connection, you can create agents using the OpenAI Agents SDK: from agents import Agent from openai.types.chat import ChatCompletionMessageParam # Create a banking assistant agent banking_assistant = Agent( name="Banking Assistant", instructions="You are a helpful banking assistant. Be concise and professional.", model="gpt-4o", # This will use the deployment specified in your Azure OpenAI/APIM client tools=[check_account_balance] # A function tool defined elsewhere ) The OpenAI Agents SDK automatically uses the Azure OpenAI or APIM client you've configured, making it seamless to switch between different Azure environments or configurations. Note that at the time of writing this article, there is a ongoing bug where OpenAI Agent SDK is fetching the old input_tokens, output_tokens instead of the new prompt_tokens & completion_tokens returned by newer ChatCompletion APIs. Thus you would need to manually update in agents/run.py file to make this work per https://github.com/openai/openai-agents-python/pull/65/files Implementing Tracing with Azure OpenAI The OpenAI Agents SDK includes powerful tracing capabilities that can help you monitor and debug your agents. When using Azure OpenAI or APIM, you can implement two types of tracing: 1. Console Tracing for Development Console logging is rather verbose, if you would like to explore the Spans then enable do it like below: from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor from agents.tracing.processors import ConsoleSpanExporter, BatchTraceProcessor # Set up console tracing console_exporter = ConsoleSpanExporter() console_processor = BatchTraceProcessor(exporter=console_exporter) add_trace_processor(console_processor) 2. OpenAI Dashboard Tracing Currently the spans are being sent to https://api.openai.com/v1/traces/ingest from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor set_tracing_export_api_key(os.getenv("OPENAI_API_KEY")) Tracing is particularly valuable when working with Azure deployments, as it helps you monitor usage, performance, and behavior across different environments. Running Agents with Azure OpenAI To run your agents with Azure OpenAI or APIM, use the Runner class from the OpenAI Agents SDK: from agents import Runner import asyncio async def main(): # Run the banking assistant result = await Runner.run( banking_assistant, input="Hi, I'd like to check my account balance." ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Practical Example: Banking Agents System Let's look at how we can use Azure OpenAI or APIM with the OpenAI Agents SDK to create a banking system with specialized agents and handoff capabilities. 1. Define Specialized Banking Agents We'll create several specialized agents: General Banking Assistant: Handles basic inquiries and account information Loan Specialist: Focuses on loan options and payment calculations Investment Specialist: Provides guidance on investment options Customer Service Agent: Routes inquiries to specialists 2. Implement Handoff Between Agents from agents import handoff, HandoffInputData from agents.extensions import handoff_filters # Define a filter for handoff messages def banking_handoff_message_filter(handoff_message_data: HandoffInputData) -> HandoffInputData: # Remove any tool-related messages from the message history handoff_message_data = handoff_filters.remove_all_tools(handoff_message_data) return handoff_message_data # Create customer service agent with handoffs customer_service_agent = Agent( name="Customer Service Agent", instructions="""You are a customer service agent at a bank. Help customers with general inquiries and direct them to specialists when needed. If the customer asks about loans or mortgages, handoff to the Loan Specialist. If the customer asks about investments or portfolio management, handoff to the Investment Specialist.""", handoffs=[ handoff(loan_specialist_agent, input_filter=banking_handoff_message_filter), handoff(investment_specialist_agent, input_filter=banking_handoff_message_filter), ], tools=[check_account_balance], ) 3. Trace the Conversation Flow from agents import trace async def main(): # Trace the entire run as a single workflow with trace(workflow_name="Banking Assistant Demo"): # Run the customer service agent result = await Runner.run( customer_service_agent, input="I'm interested in taking out a mortgage loan. Can you help me understand my options?" ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Benefits of Using Azure OpenAI/APIM with the OpenAI Agents SDK Integrating Azure OpenAI or APIM with the OpenAI Agents SDK offers several advantages: Enterprise-Grade Security: Azure provides robust security features, compliance certifications, and private networking options Scalability: Azure's infrastructure can handle high-volume production workloads Monitoring and Management: APIM provides additional monitoring, throttling, and API management capabilities Regional Deployment: Azure allows you to deploy models in specific regions to meet data residency requirements Cost Management: Azure provides detailed usage tracking and cost management tools Conclusion The OpenAI Agents SDK combined with Azure OpenAI Service or Azure API Management provides a powerful foundation for building intelligent, specialized AI assistants. By leveraging Azure's enterprise features and the OpenAI Agents SDK's capabilities, you can create robust, scalable, and secure AI applications for production environments. Whether you choose direct Azure OpenAI integration or Azure API Management depends on your specific needs for API management, security, and monitoring. Both approaches work seamlessly with the OpenAI Agents SDK, making it easy to build sophisticated agent-based applications. Repo: https://github.com/hieumoscow/azure-openai-agents Video demo: https://www.youtube.com/watch?v=gJt-bt-vLJY484Views2likes1CommentThe Future of AI: Customizing AI agents with the Semantic Kernel agent framework
The blog post Customizing AI agents with the Semantic Kernel agent framework discusses the capabilities of the Semantic Kernel SDK, an open-source tool developed by Microsoft for creating AI agents and multi-agent systems. It highlights the benefits of using single-purpose agents within a multi-agent system to achieve more complex workflows with improved efficiency. The Semantic Kernel SDK offers features like telemetry, hooks, and filters to ensure secure and responsible AI solutions, making it a versatile tool for both simple and complex AI projects.264Views2likes0CommentsVoice Bot: GPT-4o-Realtime Best Practices - A learning from customer journey
Voice technology is transforming how we interact with machines, making conversations with AI feel more natural than ever before. With the public beta release of the Realtime API powered by GPT-4o, developers now have the tools to create low-latency, multimodal voice experiences in their apps, opening endless possibilities for innovation. For building voice AI solution introduction of GPT-4o-Realtime was a game changing technology which handles key features like interruption, language switching, emotion handling out of the box with low latency and optimized architecture. GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. Strengths: Simplest architecture with no processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements. Switching between language is very easy Captures emotion of user. Lets see a simple demo of the capability. While there are many strengths there are some weaknesses as well. In this blog, we’ll guide you through the some of the best practices of using the GPT-4o Realtime Model to overcome these challenges. Ok So let’s get started! Customer can implement voice bot with a duplex architecture as well which is using STT >> LLM >> TTS based approach. But adding this document scope is only limited to GPT-4o-Realtime. However, you can learn more about in my blog. My Journey of Building a Voice Bot from Scratch 7 Best Practices to address top challenges to build with GPT-4o-Realtime 1. Reducing Background Noise Sensitivity: Interruption handling is a key feature of GPT-4o-Realtime model. GPT-4o-Realtime does this by using a feature called Voice Activity Detection (VAD) which detect voice activity and perform a callback event input_audio_buffer.speech_started sent by the server when in server_vad mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user. However due to background noise sensitivity sometime we see lot of unintentional interruption. In order to handle this there are two approaches work. A. Optimizing VAD Parameter Prefix Padding (prefix_padding_ms): The amount of time (in milliseconds) added before detected speech to ensure that the beginning of the audio is captured. Increasing: Advantages: Captures more speech at the beginning of utterances, reducing the risk of clipping initial phonemes and improving overall speech quality. Disadvantages: May introduce unnecessary delays in processing, leading to a less responsive system and increased latency. Decreasing: Advantages: Reduces latency, making the system more responsive and quicker to react to speech. Disadvantages: Higher risk of clipping the start of speech, which can result in loss of important information and reduced speech quality. Threshold (threshold): The sensitivity level that determines whether the audio signal is classified as speech or silence, typically ranging from 0 to 1. Increasing: Advantages: Reduces false positives by requiring a stronger signal to classify as speech, which can improve accuracy in noisy environments. Disadvantages: May lead to missed detections (false negatives) if the speech signal is weak, resulting in lost segments of speech. Decreasing: Advantages: Increases sensitivity, allowing softer speech to be detected, which can be beneficial in quiet environments. Disadvantages: Higher likelihood of false positives, where background noise may be incorrectly classified as speech, leading to unnecessary processing. Silence Duration (silence_duration_ms): The minimum length of silence (in milliseconds) required to consider the audio as non-speech or to trigger a pause in detection. o Increasing: Advantages: Helps to avoid brief pauses being classified as silence, maintaining continuity in detected speech segments. Disadvantages: Can lead to longer periods of silence being classified as active speech, potentially causing delays in response or processing. o Decreasing: Advantages: Allows for quicker transitions between speech and silence detection, making the system more dynamic and responsive. Disadvantages: May result in frequent interruptions in detected speech during natural pauses, affecting the flow and comprehension of conversations. B. Custom VAD workaround to handle background sensitivity GPT-4o-Realtime provides flexibility in handling voice activity detection (VAD) through configurable settings. By default, server-side VAD is enabled, allowing the system to automatically detect the end of a user's speech and generate responses accordingly. However, you can customize this behavior by disabling server-side VAD and implementing your own client-side VAD or manual controls. Disabling Server-Side VAD: To turn off server-side VAD, you can set the turn_detection type to none in your session configuration. This configuration requires the client to manage the flow of the conversation manually. Specifically, the client must: Append Audio: end audio data to the server using theinput_audio_buffer.append event. Commit Audio: Indicate that the input is complete by sending theinput_audio_buffer.commit event. Request Response: Initiate the generation of a response by sending the response.create event. his approach is beneficial for applications that utilize push-to-talk functionality or have external mechanisms for controlling audio flow, such as a client-side VAD component. When server-side VAD is disabled, these manual controls can be employed to manage the conversation flow effectively. async def append_input_audio(self, array_buffer): if len(array_buffer) > 0: if self.custom_vad: for i in range(0, len(array_buffer), 1024): chunk = array_buffer[i:i+1024] chunk = np.frombuffer(chunk, dtype=np.int16) vad_output = self.vad_iterator(torch.from_numpy(int2float(chunk))) if vad_output is not None and vad_output == "INTERRUPT_TTS": print("Speech Detected") self.dispatch("conversation.interrupted", None) continue if vad_output is not None and len(vad_output) != 0: print("vad output going to Realtime") array = np.concatenate(vad_output) await self.realtime.send("input_audio_buffer.append", { "audio": array_buffer_to_base64(array), }) self.input_audio_buffer.extend(array) await self.create_response() else: await self.realtime.send("input_audio_buffer.append", { "audio": array_buffer_to_base64(np.array(array_buffer)), }) self.input_audio_buffer.extend(array_buffer) return True And here the code for VAD iterator import copy import torch import numpy as np class VADIterator: def __init__( self, model, threshold: float = 0.5, sampling_rate: int = 16000, min_silence_duration_ms: int = 100, speech_pad_ms: int = 30, ): """ Mainly taken from https://github.com/snakers4/silero-vad Class for stream imitation Parameters ---------- model: preloaded .jit/.onnx silero VAD model threshold: float (default - 0.5) Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. sampling_rate: int (default - 16000) Currently silero VAD models support 8000 and 16000 sample rates min_silence_duration_ms: int (default - 100 milliseconds) In the end of each speech chunk wait for min_silence_duration_ms before separating it speech_pad_ms: int (default - 30 milliseconds) Final speech chunks are padded by speech_pad_ms each side """ self.model = model self.threshold = threshold self.sampling_rate = sampling_rate self.is_speaking = False self.buffer = [] self.start_pad_buffer = [] if sampling_rate not in [8000, 16000]: raise ValueError( "VADIterator does not support sampling rates other than [8000, 16000]" ) self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000 self.reset_states() def reset_states(self): self.model.reset_states() self.triggered = False self.temp_end = 0 self.current_sample = 0 @torch.no_grad() def __call__(self, x): """ x: torch.Tensor audio chunk (see examples in repo) return_seconds: bool (default - False) whether return timestamps in seconds (default - samples) """ if not torch.is_tensor(x): try: x = torch.Tensor(x) except Exception: raise TypeError("Audio cannot be casted to tensor. Cast it manually") window_size_samples = len(x[0]) if x.dim() == 2 else len(x) self.current_sample += window_size_samples speech_prob = self.model(x, self.sampling_rate).item() if (speech_prob >= self.threshold) and self.temp_end: self.temp_end = 0 if (speech_prob >= self.threshold) and not self.triggered: self.triggered = True self.buffer = copy.deepcopy(self.start_pad_buffer) self.buffer.append(x) return "INTERRUPT_TTS" if (speech_prob < self.threshold - 0.15) and self.triggered: if not self.temp_end: self.temp_end = self.current_sample if self.current_sample - self.temp_end >= self.min_silence_samples: # if self.current_sample - self.temp_end > self.speech_pad_samples: # return None # else: # end of speak self.temp_end = 0 self.triggered = False spoken_utterance = self.buffer self.buffer = [] return spoken_utterance if self.triggered: self.buffer.append(x) self.start_pad_buffer.append(x) self.start_pad_buffer = self.start_pad_buffer[-int(self.speech_pad_samples//window_size_samples):] return None def int2float(sound): """ Taken from https://github.com/snakers4/silero-vad """ sound = sound.astype("float32") sound *= 1 / 32768 # sound = sound.squeeze() # depends on the use case return sound def float2int(sound): """ Taken from """ # sound = sound.squeeze() # depends on the use case sound *= 32768 sound = np.clip(sound, -32768, 32767) return sound.astype("int16") 2. Synchronization to reduce Hallucinations Synchronization is also a key feature when building a Voice AI Bot as AI should exactly know how much user has listen and when it was interrupted. This reduces the major hallucination we see in GPT-4o-Realtime. In GPT-4o-Realtime, the conversation.item.truncate event allows clients to manually shorten or truncate a message within a conversation. Upon receiving a conversation.item.truncate event, the server processes the truncation and responds with a conversation.item.truncated event. This ensures that both the client and server maintain a synchronized state regarding the conversation's content. Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user. 3. Reducing WebSocket Connection Delay to address latency A WebSocket connection pool acts as a crucial performance optimization when handling high-volume telephony applications by maintaining a pre-established set of WebSocket connections ready for immediate use. Instead of creating a new WebSocket connection with Azure OpenAI GPT-4o-Realtime for each incoming call—which can lead to timeouts during high load due to connection initialization overhead—the pool contains multiple pre-warmed connections. When a user initiates a phone call, the server can immediately allocate an available connection from the pool, eliminating the latency and potential timeout issues associated with establishing a new WebSocket connection. The pool automatically manages these connections, replenishing them as they're used and maintaining a healthy number of available connections based on traffic patterns. This ensures that voice data can flow instantly between the client and the WebSocket server without users experiencing delays or dropped calls due to connection timeouts. Additionally, the pool can implement features like connection health checks and automatic reconnection strategies, further improving the reliability of the voice communication system. import asyncio import websockets import logging from typing import List, Optional from collections import deque import time from aiohttp import ClientSession, WSMsgType, WSServerHandshakeError, ClientTimeout import os class WebSocketPool: def __init__(self, pool_size: int = 5, max_retries: int = 3): self.pool_size = pool_size self.max_retries = max_retries self.available_connections: deque = deque() self.in_use_connections = set() self.lock = asyncio.Lock() self.default_url = 'wss://api.openai.com' self.url = os.environ["AZURE_OPENAI_ENDPOINT"] self._is_azure_openai = self.url is not None self.api_key = os.environ.get("AZURE_OPENAI_API_KEY") self.api_version = "2024-10-01-preview" self.azure_deployment = os.environ["AZURE_OPENAI_DEPLOYMENT"] self.logger = logging.getLogger(__name__) async def initialize_pool(self): """Initialize the connection pool with the specified number of connections.""" self.logger.info(f"Initializing pool with {self.pool_size} connections") tasks = [self._create_connection() for _ in range(self.pool_size)] await asyncio.gather(*tasks) async def _create_connection(self) -> Optional[websockets.WebSocketClientProtocol]: """Create a single WebSocket connection with retry logic.""" for attempt in range(self.max_retries): try: self._session = ClientSession(base_url=self.url) headers = {"api-key": self.api_key} connection = await self._session.ws_connect( "/openai/realtime", headers=headers, params={"api-version": self.api_version, "deployment": self.azure_deployment}, ) self.logger.info("Successfully created new WebSocket connection") async with self.lock: self.available_connections.append(connection) return connection except Exception as e: self.logger.error(f"Failed to create connection (attempt {attempt + 1}): {str(e)}") if attempt == self.max_retries - 1: self.logger.error("Max retries reached for connection creation") return None await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff async def get_connection(self) -> Optional[websockets.WebSocketClientProtocol]: """Get an available connection from the pool.""" async with self.lock: while len(self.available_connections) == 0: # If no connections are available, create a new one if len(self.in_use_connections) < self.pool_size * 2: # Allow pool to grow up to 2x connection = await self._create_connection() if connection: break await asyncio.sleep(0.1) # Prevent tight loop if not self.available_connections: return None connection = self.available_connections.popleft() self.in_use_connections.add(connection) return connection async def release_connection(self, connection: websockets.WebSocketClientProtocol): """Return a connection to the pool.""" async with self.lock: if connection in self.in_use_connections: self.in_use_connections.remove(connection) if connection.open: self.available_connections.append(connection) else: # Replace closed connection with a new one await self._create_connection() async def health_check(self): """Periodically check and replace unhealthy connections.""" while True: async with self.lock: connections_to_check = list(self.available_connections) for conn in connections_to_check: try: pong = await conn.ping() await asyncio.wait_for(pong, timeout=5) except: self.logger.warning("Unhealthy connection detected, replacing...") self.available_connections.remove(conn) await conn.close() await self._create_connection() await asyncio.sleep(30) # Run health check every 30 seconds async def close_all(self): """Close all connections in the pool.""" async with self.lock: all_connections = list(self.available_connections) + list(self.in_use_connections) for connection in all_connections: await connection.close() self.available_connections.clear() self.in_use_connections.clear() # Example usage async def handle_voice_call(pool: WebSocketPool, call_id: str): connection = await pool.get_connection() if not connection: raise Exception("Failed to get WebSocket connection from pool") try: # Handle voice data await connection.send(f"Starting call {call_id}") response = await connection.recv() # Process voice data... finally: await pool.release_connection(connection) async def main(): # Initialize the pool pool = WebSocketPool(pool_size=5) await pool.initialize_pool() # Start health check in background health_check_task = asyncio.create_task(pool.health_check()) # Simulate multiple concurrent calls calls = [handle_voice_call(pool, f"call_{i}") for i in range(10)] await asyncio.gather(*calls) # Cleanup health_check_task.cancel() await pool.close_all() if __name__ == "__main__": asyncio.run(main()) 4. Creating ‘Human-like’ voice | Realistic Voice: GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV). Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data. Out of the box, text to speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See the supported languages for custom neural voice. # Process the streaming response print("\nStreaming response:") collected_messages = [] tts_sentence_end = [ ".", "!", "?", ";", "。", "!", "?", ";", "\n", "।"] async for event in connection: delta = event.get("delta") if event.type == 'response.text.delta': chunk_message = delta['transcript'] collected_messages.append(chunk_message) # save the message if chunk_message in tts_sentence_end: # sentence end found sent_transcript = ''.join(collected_messages).strip() collected_messages.clear() input, output = tts_client.text_to_speech_streaming_input() async def read_output(): audio = b'' async for chunk in output: playAudio(chunk) async def put_input(): input.write(sent_transcript) input.close() await asyncio.gather(read_output(), put_input()) elif event.type == 'response.text.done': print() You can look at this code Repo: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/realtime-api-plus/README.md 5. Handling Number Pronunciation Issue in regional language: GPT-4o-Realtime always struggle with numbers specifically for non-english languages. We have seen cases where the model struggle with numbers while pronouncing the same. For financial services industry this case been a big issue where there is a mismatch between audio spoken by the model vs what is coming out in response.text.delta event. In order to solve this the trick is add to the prompt or use a TTS plugin as described in the above. For example instead of writing a prompt like this. “ You are loan seller agent for XYZ company. Below are the context provided to you. Customer Name: Raj Approved Loan Amount: 4500 ” Recommended Prompt: “ You are loan seller agent for XYZ company. Below are the context provided to you. Customer Name: Raj Approved Loan Amount: चार हज़ार पाँच सौ ” Here is a sample utility function to convert number to words. This for Hindi but you can write based on your own language. class Num2wordshindi: low_num_dict = {'1':'एक','2':'दौ','3':'तीन','4':'चार','5':'पाँच', '6':'छः','7':'सात','8':'आठ','9':'नौ', '0':'शून्य' } mid_num_dict = {'10':'दस', '11': 'ग्यारह', '12': 'बारह', '13': 'तेरह', '14':'चौदह', '15': 'पंद्रह', '16': 'सोलह', '17': 'सत्रह', '18': 'अठारह', '19': 'उन्नीस', '20': 'बीस', '21': 'इक्कीस', '22': 'बाईस', '23': 'तेईस', '24': 'चौबीस', '25': 'पच्चीस', '26': 'छब्बीस', '27': 'सत्ताईस', '28': 'अट्ठाईस', '29': 'उनतीस', '30': 'तीस', '31': 'इकतीस', '32': 'बत्तीस', '33': 'तैंतीस', '34': 'चौंतीस', '35': 'पैंतीस', '36': 'छतीस', '37': 'सैंतीस', '38': 'अड़तीस', '39': 'उनतालीस', '40': 'चालीस', '41': 'इकतालीस', '42': 'बयालीस', '43': 'तैंतालीस', '44': 'चवालीस', '45': 'पैंतालीस', '46': 'छियालीस', '47': 'सैंतालीस', '48': 'अड़तालीस', '49': 'उड़ंचास', '50': 'पचास', '51': 'इक्यावन', '52': 'बावन', '53': 'तिरेपन', '54': 'चौवन', '55': 'पचपन', '56': 'छप्पन', '57': 'सत्तावन', '58': 'अट्ठावन', '59': 'उनसठ', '60': 'साठ', '61': 'इकसठ', '62': 'बासठ', '63': 'तिरेसठ', '64': 'चौसठ', '65': 'पैंसठ', '66': 'छियासठ', '67': 'सड़सठ', '68': 'अड़सठ', '69': 'उनहत्तर', '70': 'सत्तर', '71': 'इकहत्तर', '72': 'बहत्तर', '73': 'तिहत्तर', '74': 'चौहत्तर', '75': 'पिचहत्तर', '76': 'छिहत्तर', '77': 'सतत्तर', '78': 'अठहत्तर', '79': 'उनासी', '80': 'अस्सी', '81': 'इक्यासी', '82': 'बियासी', '83': 'तिरासी', '84': 'चौरासी', '85': 'पिचासी', '86': 'छियासी', '87': 'सत्तासी', '88': 'अट्ठासी', '89': 'नवासी', '90': 'नब्बे', '91': 'इक्यानवे', '92': 'बानवे', '93': 'तिरानवे', '94': 'चौरानवे', '95': 'पिचानवे', '96': 'छियानवे', '97': 'सत्तानवे', '98': 'अट्ठानवे', '99': 'निन्यानवे', '100': 'सौ' , '00': ' ' } def __init__(self, number): self.nummber_to_change = number def change_to_lst(self): my_lst = str(self.nummber_to_change).split('.') return my_lst def lst1str(self, lst): if lst == '0': return '' else: return self.low_num_dict.get(lst) def lst2str(self, lst): if lst == '00': return '' elif lst[0] == '0': return self.low_num_dict.get(lst[1]) else: return self.mid_num_dict.get(lst) def lst3str(self, lst): if lst == '000': return '' elif lst[0] == '0': return self.lst2str(lst[1:]) else: return f'{self.lst1str(lst[0])} सौ {self.lst2str(lst[1:])}' def lst4str(self, lst): if lst == '0000': return '' elif lst[0] == '0': return self.lst3str(lst[1:]) else: return f'{self.lst1str(lst[0])} हजार {self.lst3str(lst[1:])}' def lst5str(self, lst): if lst == '00000': return '' elif lst[0] == '0': return self.lst4str(lst[1:]) else: return f'{self.lst2str(lst[0])} हजार {self.lst3str(lst[1:])}' def lst_to_str(self, lst): length = len(lst) name_list = ['हज़ार', 'हज़ार','लाख','लाख', 'करोड़', 'करोड़', 'अरब', 'अरब', 'खरब', 'खरब', \ 'नील', 'नील', 'पद्म', 'पद्म', 'शंख', 'शंख', 'महाशंख', 'महाशंख', 'महाउपाध', 'महाउपाध', 'जलद', 'जलद', 'माध', 'माध', 'परार्ध', 'परार्ध', 'अंत', 'अंत', 'महा अंत', 'महा अंत', 'शिष्ट', 'शिष्ट', 'सिंघर', 'सिंघर', 'महा सिंघर', 'महा सिंघर', 'अदंत सिंघर', 'अदंत सिंघर'] if lst == '0': return self.low_num_dict.get(lst) elif length == 1: return self.lst1str(lst) elif length == 2: return self.lst2str(lst) elif length == 3: return self.lst3str(lst) elif 42 > length > 3:# 35 23 548 n = length - 3 lst2 = lst[:n] return_str = '' while length > 3: if length%2 == 0: if lst2[0] == '0': length -= 1 lst2 = lst2[1:] else: return_str = return_str + f'{self.lst1str(lst2[0])} {name_list[length-3]} ' length -= 1 lst2 = lst2[1:] else: if lst2[0:2] == '00': length -= 2 lst2 = lst2[2:] else: return_str = return_str + f'{self.lst2str(lst2[0:2])} {name_list[length-4]} ' length -= 2 lst2 = lst2[2:] return_str = return_str + self.lst3str(lst[n:]) return return_str else: return 'Number Too Long must be <= pow(10, 41)' def to_currency(self): length = len(self.change_to_lst()) lst1 = self.change_to_lst()[0] if length == 1: if lst1 == '1' : return 'एक रुपया' else: return f'{self.lst_to_str(lst1)} रूपये' elif length == 2: lst2 = self.change_to_lst()[1] if lst1 == '1' and lst2 == '01': return 'एक रुपया, एक पैसा' elif lst2 == '01': return f'{self.lst_to_str(lst1)} रूपये, एक पैसा' elif lst2 == '00': return f'{self.lst_to_str(lst1)} रूपये, शून्य पैसे' elif len(lst2) == 1: lst2 = lst2+'0' return f'{self.lst_to_str(lst1)} रूपये, {self.lst_to_str(lst2)} पैसे' else: return f'{self.lst_to_str(lst1)} रूपये, {self.lst_to_str(lst2)} पैसे' def to_words(self): length = len(self.change_to_lst()) lst1 = self.change_to_lst()[0] if length == 1: return self.lst_to_str(lst1) elif length == 2: lst2 = self.change_to_lst()[1] return f'{self.lst_to_str(lst1)} दशमलव {self.lst_to_str(lst2)}' elif length == 3: lst2 = self.change_to_lst()[1] lst3 = self.change_to_lst()[2] return f'{self.lst_to_str(lst1)} दशमलव {self.lst_to_str(lst2)} दशमलव {self.lst_to_str(lst3)}' else: return None 6. Reduce middleware between model and telephony to optimize performance GPT-4o Realtime API is designed to handle real-time, low-latency conversational interactions, making it suitable for applications like customer support agents, voice assistants, and real-time translators. To ensure compatibility and optimal performance, the API supports specific audio formats and sample rates. Supported Audio Formats and Sample Rates: PCM 16-bit: This is a raw audio format that provides uncompressed audio data, ensuring high-quality sound. G.711 a-law: A commonly used audio compression format in telephony systems, which balances quality and bandwidth efficiency. G.711 u-law: A commonly used audio compression format in telephony systems, which balances quality and bandwidth efficiency. Unlike other Realtime models, GPT-4o-Realtime supports an 8k sample rate. It is important not to place a middleware between SIP Telephony and GPT-4o-Realtime. Directly send the audio in G.711 a-law / G.711 u-law audio format. 7. Instruction Following Issue/ Prompt Best Practices: One of the primary challenges users face when working with GPT-4o-Realtime compared to previous OpenAI models, such as GPT-4 and GPT-4o-mini, is the distinct way prompts need to be structured. This difference in prompt engineering stems from several factors related to the model's architecture, capabilities, and intended use cases. GPT-4o-Realtime has been noted to struggle with following instructions not as effectively as its predecessors. The way prompts are crafted for GPT-4o-Realtime requires a higher degree of specificity. While previous models could sometimes infer intent from vague or broadly framed prompts, GPT-4o-Realtime tends to produce more relevant outputs when given clear, concise instructions. Here is the prompt that works well for GPT-4o-Realtime. You must set 5 key module within the Prompt Personality and Tone Context Reference Pronunciations Overall Instruction Conversation States Context In this we give the model all background context like customer Name, Business working hours , Company Name etc. # Context - Business name: Snowy Peak Boards - Hours: Monday to Friday, 8:00 AM - 6:00 PM; Saturday, 9:00 AM - 1:00 PM; Closed on Sundays - Locations (for returns and service centers): - 123 Alpine Avenue, Queenstown 9300, New Zealand - 456 Glacier Road, Wanaka 9305, New Zealand - Products & Services: - Wide variety of snowboards for all skill levels - Snowboard accessories and gear (boots, bindings, helmets, goggles) - Online fitting consultations - Loyalty program offering discounts and early access to new product lines Personality and Tone # Personality and Tone ## Identity You are a knowledgeable and patient tech support specialist with a background in computer engineering and a passion for helping people solve complex technological challenges. Your experience spans over a decade of working with various tech ecosystems, from consumer electronics to enterprise solutions. You've seen technologies evolve and have a deep understanding of both hardware and software intricacies. ## Task Your primary goal is to guide customers through technical issues, providing clear, step-by-step solutions while ensuring they feel supported and understood. You aim to demystify technology, making complex problems seem manageable and less intimidating. ## Demeanor You maintain a calm, methodical approach to problem-solving. Your demeanor is professional yet approachable, similar to a trusted mentor who can break down complex technical concepts into digestible information. You're genuinely invested in helping customers succeed, not just in solving their immediate problem. ## Tone Your voice is steady and reassuring, with a hint of technical precision. You speak with confidence but never condescension. When explaining technical concepts, you use analogies that make sense to people without a technical background, helping them understand without feeling overwhelmed. ## Level of Enthusiasm Your enthusiasm is intellectual and measured. You get excited about solving problems and discovering innovative solutions, but your excitement manifests as a calm, focused energy rather than high-pitched excitement. Think of a detective who's genuinely thrilled about cracking a complex case. ## Level of Formality Your communication style is professionally conversational. You use technical terminology when necessary but always explain it in layman's terms. It's like having a conversation with a highly skilled colleague who happens to be great at explaining things. ## Level of Emotion You are empathetic and understanding. When customers are frustrated, you acknowledge their feelings and focus on finding a solution. Your emotional support is practical—you validate their experience while simultaneously working towards resolving their issue. ## Filler Words Occasionally, you use filler words like "hmm," "let's see," or "interesting" to show you're actively processing information. These words help humanize your technical expertise and make the interaction feel more natural. ## Pacing Your pacing is deliberate and measured. You speak at a speed that allows for comprehension, pausing after explaining complex steps to ensure the customer is following along. When explaining technical processes, you break them down into clear, digestible segments. ## Other Details You always have a backup plan or alternative approach. If one solution doesn't work, you're quick to suggest another method. You're also prone to sharing quick, interesting tech tips that might help the customer in the future, showing that your support goes beyond just fixing the immediate issue. ## Communication Nuances - Use technical accuracy balanced with accessibility - Demonstrate patience with users of all technical skill levels - Provide context for why certain troubleshooting steps are necessary - Always offer a clear path forward, even if the solution isn't immediate - Maintain a problem-solving mindset that feels collaborative Reference Pronunciation: Now this is a key prompting technique to pronounce specific word. This if put properly can make a clear difference in the pronunciation. # Reference Pronunciations - “Snowy Peak Boards”: SNOW-ee Peek Bords - “Schedule”: SHED-yool - “Noah”: NOW-uh Overall Instruction: Here you add the overall instruction to the model. # Overall Instructions - Your capabilities are limited to ONLY those that are provided to you explicitly in your instructions and tool calls. You should NEVER claim abilities not granted here. - Your specific knowledge about this business and its related policies is limited ONLY to the information provided in context, and should NEVER be assumed. - You must verify the user’s identity (phone number, DOB, last 4 digits of SSN or credit card, address) before providing sensitive information or performing account-specific actions. - Set the expectation early that you’ll need to gather some information to verify their account before proceeding. - Don't say "I'll repeat it back to you to confirm" beforehand, just do it. - Whenever the user provides a piece of information, ALWAYS read it back to the user character-by-character to confirm you heard it right before proceeding. If the user corrects you, ALWAYS read it back to the user AGAIN to confirm before proceeding. - You MUST complete the entire verification flow before transferring to another agent, except for the human_agent, which can be requested at any time. Conversation States: Typically voice conversation are flow based for an outbound call. Hence you would like to flow the conversation in a specific manner. In that case you want to put the same in the prompt. Here is the sample example how you can put the same. [ { "id": "1_greeting", "description": "Initial contact and warm welcome for TechNest Smart Home support", "instructions": [ "Use the company name 'TechNest Smart Home Support'", "Provide a friendly initial greeting", "Mention available support channels" ], "examples": [ "Welcome to TechNest Smart Home Support! I'm here to help you resolve any issues with your smart home devices. How can I assist you today?" ], "transitions": [{ "next_step": "2_device_identification", "condition": "Once initial greeting is complete" }] }, { "id": "2_device_identification", "description": "Identify the specific smart home device experiencing issues", "instructions": [ "Ask the user to specify which TechNest device is having problems", "Request model number and serial number", "Confirm device details" ], "examples": [ "Could you tell me which TechNest device you're having trouble with? If possible, please provide the model number and serial number located on the device." ], "transitions": [{ "next_step": "3_problem_description", "condition": "Device details are confirmed" }] }, { "id": "3_problem_description", "description": "Gather detailed information about the device issue", "instructions": [ "Ask for a comprehensive description of the problem", "Request specific error messages or behaviors", "Clarify any ambiguous details" ], "examples": [ "Can you describe the specific issue you're experiencing with your device? Please include any error messages, unusual behaviors, or specific symptoms." ], "transitions": [{ "next_step": "4_troubleshooting_steps", "condition": "Detailed problem description is obtained" }] }, { "id": "4_troubleshooting_steps", "description": "Provide initial troubleshooting guidance", "instructions": [ "Offer a series of standard troubleshooting steps", "Ask the user to attempt these steps", "Request feedback after each step" ], "examples": [ "I'm going to guide you through some standard troubleshooting steps. Let's start by:", "1. Unplugging the device for 30 seconds and plugging it back in", "2. Checking your home Wi-Fi connection", "3. Verifying the device's firmware is up to date" ], "transitions": [{ "next_step": "5_advanced_support", "condition": "Initial troubleshooting steps are completed" }] }, { "id": "5_advanced_support", "description": "Escalate to advanced support if initial steps fail", "instructions": [ "Determine if issue requires advanced technical support", "Collect additional diagnostic information", "Prepare for potential device replacement or repair" ], "examples": [ "I understand the initial troubleshooting steps didn't resolve your issue. Let's collect some additional diagnostic information to determine the next best course of action." ], "transitions": [{ "next_step": "6_warranty_check", "condition": "Advanced support assessment is complete" }] }, { "id": "6_warranty_check", "description": "Verify device warranty status", "instructions": [ "Request purchase date or serial number", "Check warranty coverage", "Explain repair or replacement options" ], "examples": [ "Could you provide me with the purchase date of your device? This will help me determine your warranty coverage." ], "transitions": [{ "next_step": "7_support_resolution", "condition": "Warranty status is confirmed" }] }, { "id": "7_support_resolution", "description": "Finalize support interaction and offer additional assistance", "instructions": [ "Summarize the support interaction", "Provide next steps", "Offer additional support resources" ], "examples": [ "Based on our conversation, here's what we'll do next...", "Would you like me to email you a detailed support summary?" ], "transitions": [{ "next_step": "8_customer_satisfaction", "condition": "Support resolution is communicated" }] }, { "id": "8_customer_satisfaction", "description": "Collect customer feedback and satisfaction rating", "instructions": [ "Request customer satisfaction rating", "Invite feedback on support experience", "Thank the customer" ], "examples": [ "On a scale of 1-5, how would you rate your support experience today?", "We're always looking to improve our service. Do you have any additional feedback?" ], "transitions": [{ "next_step": "end", "condition": "Feedback is collected" }] } ] The blog on GPT-4o-Realtime Best Practices provides an overview of the strengths and weaknesses of using the GPT-4o-Realtime model for voice bots. It highlights the simplicity of the architecture, low latency, and high reliability, making it suitable for complex conversational requirements. The document also discusses issues such as background noise sensitivity, interruption handling, and number pronunciation, and offers best practices for overcoming these challenges. Additionally, it covers the importance of synchronization, the use of custom neural voices, and the optimal handling of audio formats and sample rates for telephony applications. All opinions are personal. Hope you like my Blog. If you do please follow me on linkedIn. Thanks Manoranjan Rajguru AI Global Belt Asia https://www.linkedin.com/in/manoranjan-rajguru/2.2KViews2likes2CommentsAzure AI Speech text to speech Feb 2025 updates: New HD voices and more
By Garfield He Our dedication to enhancing Azure AI Speech voices remains steadfast, as we continually strive to make them more expressive and engaging. We are pleased to announce an upgraded HD version of our neural text-to-speech service for selected voices. This latest iteration further enhances expressiveness by incorporating emotion detection based on the input context. Our advanced technology employs acoustic and linguistic features to produce speech with rich, natural variations. It effectively detects emotional cues within the text and autonomously adjusts the voice's tone and style. With this enhancement, users can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotional expression. What is new? Public preview: updated 13 HD voices to support multilingual Public preview: 14 new HD voices. GA: Super-Realistic Indian Voices: Aarti & Arjun GA: AOAI turbo voices. GA: Embedded voice support with emotions GA: Other quality improvements with regular updates Voice demos Public preview: updated 13 HD voices to support multilingual The latest model of these voices below is updated to support multilingual and more versatile capabilities. Voice name Script Audio de-DE-Seraphina:DragonHDLatestNeural Ich kann dir Der Alchimist von Paulo Coelho empfehlen. Es ist eine inspirierende Geschichte über einen jungen Hirten namens Santiago, der auf der Suche nach einem Schatz seine Träume verfolgt. Das Buch ist voller Weisheit und lehrt, dass der Weg oft wichtiger ist als das Ziel. Es ist eine wunderbare Lektüre für alle, die nach Motivation und Lebenssinn suchen. en-US-Brian:DragonHDLatestNeural Hey again, Elizabeth. Amazing! Well I’m happy to hear you liked my suggestions for 4 star hotels in Florence. Three hotels in that list are under 300 euros a night that week: the Hotel Orto, The Grand Hotel Cavour, and Hotel degli Orafi. en-US-Davis:DragonHDLatestNeural Oh no, I didn't mean to disappoint! How about this: If you want something quirky and fun with lots of laughs, go for Blue Man Group. If you're looking for something awe-inspiring and magical, Cirque du Soleil is your best bet. Both will make for a memorable first date, so you can't go wrong. en-US-Ava:DragonHDLatestNeural I'm really sorry to hear that you're feeling this way. Finding a therapist can be a great step towards feeling better. en-US-Andrew:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. en-US-Andrew2:DragonHDLatestNeural You're very welcome. I'm glad I could assist you. If there's ever anything you want to learn more about, just let me know. I enjoy answering questions and sharing knowledge, and if you ever find yourself bored again, we can always play another game. en-US-Emma:DragonHDLatestNeural Though I’m really sorry you’re having trouble with your mother-in-law. Have you talked to your husband about the situation?" en-US-Emma2:DragonHDLatestNeural Oh no, I didn't mean to disappoint! How about this: If you want something quirky and fun with lots of laughs, go for Blue Man Group. If you're looking for something awe-inspiring and magical, Cirque du Soleil is your best bet. Both will make for a memorable first date, so you can't go wrong. en-US-Steffan:DragonHDLatestNeural I don't care what they say. I'll find a way to break the curse," declared Mia, determination shining in her eyes as she gazed at the ancient tome. "But it's too dangerous," protested Alex, worry etched on his face. "I have to try," Mia replied, her voice firm with resolve. en-US-Aria:DragonHDLatestNeural Hey, Seth. How's it going? What can I help you with today? en-US-Jenny:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. ja-JP-Masaru:DragonHDLatestNeural 最近、新しいプロジェクト任されてさ、毎日残業で忙しいけど、けっこうチームの雰囲気もいい感じで充実してるよ。そっちは最近どう? zh-CN-Xiaochen:DragonHDLatestNeural 她给我们发了一张照片,呃,在一个满是山、山珍海味婚礼上她拿了一个巨小的饭盒在吃,反正就一个特别清淡的啊,减脂营备的餐,然后她呢当时在群里发这个是,嗯,为了求表扬,哈哈哈! Public preview: 12 new HD voices with v1.1 support: provide more variety to HD voices offering Voice name Script Audio de-DE-Florian:DragonHDLatestNeural Mein Lieblingsmusikgenre ist Jazz. Ich liebe die Improvisation und die Vielfalt der Klänge, die Jazz bietet. Die Energie und Emotionen, die durch die Musik transportiert werden, sind einfach einzigartig. Besonders faszinierend finde ich die Kombination aus traditionellen und modernen Elementen, die Jazz so zeitlos und spannend macht. en-US-Adam:DragonHDLatestNeural That's a fantastic dream! The cost of building an A-frame tiny home can vary depending on several factors. en-US-Phoebe:DragonHDLatestNeural I'm really sorry to hear that you're feeling this way. Finding a therapist can be a great step towards feeling better. en-US-Serena:DragonHDLatestNeural I’ll escalate this issue to our technical team right away. They’ll contact you within 24 hours with a solution. I won’t stop until this is fixed. en-US-Alloy:DragonHDLatestNeural ...and that’s when I realized how much living abroad teaches you outside the classroom. Oh, and if you’re just joining us, welcome! We’ve been talking about studying abroad, and I was just sharing this one story—my first week in Spain, I thought I had the language down, but when I tried ordering lunch, I panicked and ended up with callos, which are tripe. Not what I expected! But those little missteps really helped me get more comfortable with the language and culture. Anyway, stick around, because next I’ll be sharing some tips for adjusting to life abroad! en-US-Nova:DragonHDLatestNeural Imagine waking up to the sound of gentle waves and the warm Italian sun kissing your skin. At Bella Vista Resort, your dream holiday awaits! Nestled along the stunning Amalfi Coast, our luxurious beachfront resort offers everything you need for the perfect getaway. Indulge in spacious, elegantly designed rooms with breathtaking sea views, relax by our infinity pool, or savor authentic Italian cuisine at our on-site restaurant. Explore picturesque villages, soak up the sun on pristine sandy beaches, or enjoy thrilling water sports—there’s something for everyone! Join us for unforgettable sunsets and memories that will last a lifetime. Book your stay at Bella Vista Resort today and experience the ultimate sunny beach holiday in Italy! es-ES-Ximena:DragonHDLatestNeural Las luces del carnaval brillaban contra el cielo nocturno, atrayendo a los visitantes. Jake y Emma estaban muy emocionados. “¡Vamos a la rueda de la fortuna, Emma!” “Está bien, pero solo si me ganas un premio.” Pasaron la tarde montando montañas rusas y disfrutando de algodón de azúcar. es-ES-Tristan:DragonHDLatestNeural “A visitar a mis abuelos en el campo. ¿Y tú?” “Regreso a casa con mi familia.” El resto del viaje, compartieron historias y risas mientras el tren avanzaba." fr-FR-Vivienne:DragonHDLatestNeural En entrant dans le vieux manoir, Antoine ressentit un frisson le long de son dos. "On devrait vraiment faire demi-tour", dit Clara, les yeux écarquillés. "Mais j’ai entendu parler de trésors cachés ici", répondit Antoine, déterminé. Juste à ce moment-là, une porte claqua derrière eux. "On y va !" s’écria Clara, effrayée. fr-FR-Remy:DragonHDLatestNeural En explorant une grotte mystérieuse, Léo et Mia découvrirent des cristaux lumineux. "C’est magnifique !", s’écria Mia, éblouie. "Regarde, il y a un passage là-bas", dit Léo, pointant du doigt. En s’enfonçant dans la grotte, ils rencontrèrent un dragon endormi. "Nous devons être prudents", murmura Léo, réalisant l’aventure qui les attendait. ja-JP-Nanami:DragonHDLatestNeural えっと、プロジェクトの進捗ですが、予定通り進んでいます。まあ、いくつか問題もありましたが、既に対処済みです。順調に進めています。 zh-CN-Yunfan:DragonHDLatestNeural 每个人都可以在日常生活中采取一些简单的环保行动。我开始减少一次性塑料的使用,进行垃圾分类,并尽量节约能源。这些小措施虽然看似微不足道,但积累起来对环境的保护却能产生积极影响。我相信,关注环保不仅是为了现在的生活,也为未来的子孙着想。你们在环保方面有什么实用的建议吗? GA: Super-Realistic Indian Voices: Aarti & Arjun Aarti (female) and Arjun (male) are natural, conversational voices designed for the Indian market, offering a soft, soothing, and empathetic tone in both Hindi and English. Trained with professional voice artists and advanced SOTA modeling techniques, they excel in handling speech imperfections like pauses and interjections. Their realistic expressions and human-like intonation make them ideal for applications such as customer support, digital assistants, e-learning, and entertainment, ensuring dynamic, engaging, and clear communication in real-time interactions. Voice Domain Script Sample en-IN-AartiNeural Conversational Hmm, I’m not sure what to make for dinner tonight. I want to try something new, but I’m not sure what. Oh no, I hope I don’t end up making something that tastes terrible. Maybe I should look up some recipes online. en-IN-AartiNeural Neutral In the depths of the ocean, a curious mermaid named Leela explored a forgotten shipwreck. Amongst treasures, she found a mysterious locket. When she opened it, a holographic image of a distant world appeared. en-IN-AartiNeural Call center Ugh, that's really unfortunate. But don't worry, I'm here to help you block your VodaTel number right away. Can you please verify your identity by providing some personal details, like your full name, and maybe the last transaction or recharge you did? This will ensure we can proceed quickly and securely. en-IN-ArjunNeural Conversational Raju, listen carefully. For question five, write the formula I told you yesterday. For question six, focus on the main point only—don’t write extra. Remember the example we practiced. And for the last question, mention the key date. Keep it simple, no need for long explanations. en-IN-ArjunNeural Neutral Cooking is an art that brings people together. I love experimenting with new recipes, especially traditional Indian dishes like biryani and samosas. The aroma of masaale, the joy of sharing a homemade meal, and the satisfaction of a well-cooked dish are truly amazing. en-IN-ArjunNeural Call center Mr. Patel, your order (No. 789456123) included 3 paint brushes: Fine Tip Brush (Size 0), Flat Brush (Size 4), and Round Brush (Size 6). Please confirm if the damage is on the Round Brush or the Flat Brush handle. We’ll arrange a replacement as soon as the verification is done. hi-IN-AartiNeural Conversational मुझे समझ नहीं आ रहा है की आज रात खाने में क्या बनाया जाए। मैं कुछ नया, ट्राई करना चाहती हूं, लेकिन मुझे पता नहीं है कि क्या। शायद मुझे कुछ रेसिपी ऑनलाइन देखनी चाहिए। hi-IN-AartiNeural Neutral सुदर्शन क्रिया आपको अपने हृदय और अंतर्ज्ञान के माध्यम से ब्रह्मांड का ज्ञान और मार्गदर्शन प्राप्त करने की अनुमति देती है। श्वास लें और छोड़ें। आइए अब अपना ध्यान अपने हृदय पर केंद्रित करें। hi-IN-AartiNeural Call center ज़रूर, एक मिनट, मैं अभी टाइमिंग्स की जानकारी देती हूं... हांजी, मिल गयी! तो, सालारजंग संग्रहालय रविवार को सुबह 9 बजे खुलता है। इसका मतलब है कि आपके पास सभी चीज़ो को देखने और वास्तव में इतिहास को समझने के लिए पर्याप्त समय होगा। कोई और जानकारी चाहिए? hi-IN-ArjunNeural Conversational हम्म, सेमिनार सुबह 10 बजे शुरू होगा। साढ़े नौ तक पहुंचना चाहिए। क्या तुमने एजेंडा प्रिंट कर लिया? चलो बढ़िया। और हाँ, अपनी नोटबुक मत भूलना। सारे डॉक्युमेंट्स भी तैयार रखना। चलो फिर कल मिलते हैं! hi-IN-ArjunNeural Neutral मैं मानसिक स्वास्थ्य के महत्व पर विचार कर रहा था। जिस तरह हम अपने शरीर का ख्याल रखते हैं, उसी तरह अपने दिमाग का भी ख्याल रखना बेहद जरूरी है। नियमित ब्रेक, ध्यान और प्रियजनों से बात करने से मदद मिल सकती है। hi-IN-ArjunNeural Call center हेलो सर! आपका "ऐक्टिव फिटनेस ट्रैकर" जो 5 नवंबर 2024 को डेलिवर हुआ था, रिप्लेसमेंट के लिए योग्य है। आपकी शिकायत ID FT90987 पर दर्ज है। नया प्रोडक्ट 25 नवंबर 2024 तक डेलिवर होगा। अधिक जानकारी के लिए हमें 1800-000-123 पर कॉल करें। GA: AOAI turbo voices Turbo version of AOAI voices, which have the same persona and support SSML like other Azure voices, now is available to all Speech regions. Voice name AlloyTurboMultilingualNeural EchoTurboMultilingualNeural FableTurboMultilingualNeural NovaTurboMultilingualNeural OnyxTurboMultilingualNeural ShimmerTurboMultilingualNeural GA: Other quality improvements Quality improvement for those voices below with latest recipe Locale Voice Name Gender ar-EG ShakirNeural Male bg-BG KalinaNeural Female ca-ES EnricNeural Male ca-ES JoanaNeural Female da-DK JeppeNeural Male el-GR NestorasNeural Male en-IE EmilyNeural Female fi-FI HarriNeural Male fi-FI SelmaNeural Female fr-CH FabriceNeural Female fr-CH ArianeNeural Female he-IL HilaNeural Female he-IL AvriNeural Male hr-HR GabrijelaNeural Female id-ID ArdiNeural Male ms-MY YasminNeural Female nb-NO PernilleNeural Female nb-NO FinnNeural Male nl-NL MaartenNeural Male pt-PT RaquelNeural Female ro-RO AlinaNeural Female ro-RO EmilNeural Male ru-RU SvetlanaNeural Female sv-SE MattiasNeural Male sv-SE SofieNeural Female vi-VN HoaiMyNeural Female vi-VN NamMinhNeural Male zh-HK HiuMaanNeural Female zh-HK WanLungNeural Male GA: Embedded voice support with emotions Besides the general style, now embedded JennyNeural can support 14 other styles below: angry, assistant, cheerful, chat, customerservice, excited, friendly, hopeful, newscast, sad, shouting, terrified, unfriendly, whispering. Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback Contact us ttsvoicefeedback@microsoft.com1KViews1like8CommentsAzure APIM Cost Rate Limiting with Cosmos & Flex Functions
Azure API Management (APIM) provides built-in rate limiting policies, but implementing sophisticated Dollar cost quota management for Azure OpenAI services requires a more tailored approach. This solution combines Azure Functions, Cosmos DB, and stored procedures to implement cost-based quota management with automatic renewal periods. Architecture Client → APIM (with RateLimitConfig) → Azure Function Proxy → Azure OpenAI ↓ Cosmos DB (quota tracking) Technical Implementation 1. Rate Limit Configuration in APIM The rate limiting configuration is injected into the request body by APIM using a policy fragment. Here's an example for a basic $5 quota: <set-variable name="rateLimitConfig" value="@{ var productId = context.Product.Id; var config = new JObject(); config["counterKey"] = productId; config["quota"] = 5; return config.ToString(); }" /> <include-fragment fragment-id="RateLimitConfig" /> For more advanced scenarios, you can customize token costs. Here's an example for a $10 quota with custom token pricing: <set-variable name="rateLimitConfig" value="@{ var productId = context.Product.Id; var config = new JObject(); config["counterKey"] = productId; config["startDate"] = "2025-03-02T00:00:00Z"; config["renewal_period"] = 86400; config["explicitEndDate"] = null; config["quota"] = 10; config["input_cost_per_token"] = 0.00003; config["output_cost_per_token"] = 0.00006; return config.ToString(); }" /> <include-fragment fragment-id="RateLimitConfig" /> Flexible Counter Keys The counterKey parameter is highly flexible and can be set to any unique identifier that makes sense for your rate limiting strategy: Product ID: Limit all users of a specific APIM product (e.g., "starter", "professional") User ID: Apply individual limits per user Subscription ID: Track usage at the subscription level Custom combinations: Combine identifiers for granular control (e.g., "product_starter_user_12345") Rate Limit Configuration Parameters Parameter Description Example Value Required counterKey Unique identifier for tracking quota usage "starter10" or "user_12345" Yes quota Maximum cost allowed in the renewal period 10 Yes startDate When the quota period begins. If not provided, the system uses the time when the policy is first applied "2025-03-02T00:00:00Z" No renewal_period Seconds until quota resets (86400 = daily). If not provided, no automatic reset occurs 86400 No endDate Optional end date for the quota period null or "2025-12-31T23:59:59Z" No input_cost_per_token Custom cost per input token 0.00003 No output_cost_per_token Custom cost per output token 0.00006 No Scheduling and Time Windows The time-based parameters work together to create flexible quota schedules: If the current date falls outside the range defined by startDate and endDate , requests will be rejected with an error The renewal window begins either on the specified startDate or when the policy is first applied The renewal_period determines how frequently the accumulated cost resets to zero Without a renewal_period , the quota accumulates indefinitely until the endDate is reached 2. Quota Checking and Cost Tracking The Azure Function performs two key operations: Pre-request quota check: Before processing each request, it verifies if the user has exceeded their quota Post-request cost tracking: After a successful request, it calculates the cost and updates the accumulated usage Cost Calculation For cost calculation, the system uses: Custom pricing: If input_cost_per_token and output_cost_per_token are provided in the rate limit config LiteLLM pricing: If custom pricing is not specified, the system falls back to LiteLLM's model prices for accurate cost estimation based on the model being used The function returns appropriate HTTP status codes and headers: HTTP 429 (Too Many Requests) when quota is exceeded Response headers with usage information: x-counter-key: starter5 x-accumulated-cost: 5.000915 x-quota: 5 3. Cosmos DB for State Management Cosmos DB maintains the quota state with documents that track: { "id": "starter5", "counterKey": "starter5", "accumulatedCost": 5.000915, "startDate": "2025-03-02T00:00:00.000Z", "renewalPeriod": 86400, "renewalStart": 1741132800000, "endDate": null, "quota": 5 } A stored procedure handles atomic updates to ensure accurate tracking, including: Adding costs to the accumulated total Automatically resetting costs when the renewal period is reached Updating quota values when configuration changes Benefits Fine-grained Cost Control: Track actual API usage costs rather than just request counts Flexible Quotas: Set daily, weekly, or monthly quotas with automatic renewal Transparent Usage: Response headers provide real-time quota usage information Product Differentiation: Different APIM products can have different quota levels Custom Pricing: Override default token costs for special pricing tiers Flexible Tracking: Use any identifier as the counter key for versatile quota management Time-based Scheduling: Define active periods and automatic reset windows for quota management Getting Started Deploy the Azure Function with Cosmos DB integration Configure APIM policies to include rate limit configuration Set up different product policies for various quota levels For a detailed implementation, visit our GitHub repository. Demo Video: https://www.youtube.com/watch?v=vMX86_XpSAo Tags: #AzureOpenAI #APIM #CosmosDB #RateLimiting #Serverless238Views0likes2CommentsUnleashing Innovation: AI Agent Development with Azure AI Foundry
Creating AI agents using Azure AI Foundry is a game-changer for businesses and developers looking to harness the power of artificial intelligence. These AI agents can automate complex tasks, provide insightful data analysis, and enhance customer interactions, leading to increased efficiency and productivity. By leveraging Azure AI Foundry, organizations can build, deploy, and manage AI solutions with ease, ensuring they stay competitive in an ever-evolving technological landscape. The importance of creating AI agents lies in their ability to transform operations, drive innovation, and deliver personalized experiences, making them an invaluable asset in today's digital age. Let's take a look at how to create an agent on Azure AI Foundry. We'll explore some of the features and experiment with its capabilities in the playground. I recommend by creating a new resource group with a new Azure OpenAI resource. Once the Azure OpenAI Resource is created, follow these steps to get started with Azure AI Foundry Agents. Implementation Overview Open Azure AI Foundry and click on the Azure AI Foundry link at the top right to get to the home page where you'll see all your projects. Click on + Create project then click on Create new hub Give it a name then click Next and Create New resources will be created with your new project. Once inside your new project you should see the Agents preview option on the left menu Select your Azure OpenAI Service resource and click Let's go We can now get started with implementation. A model needs to be deployed. However, it's important to consider which models can be used and their regions for creating these agents. Below is a quick summary of what's currently available. Supported models in Azure AI Agent Service - Azure AI services | Microsoft Learn Other models supported include Meta-Llama-405B-Instruct, Mistral-large-2407, Cohere-command-r-plus, and Cohere-command-r. I've deployed gpt-4 as Global Standard and can now create a new agent. Click on +New agent. A new Agent will be created and details such as the agent instructions, model deployment, Knowledge and Action configurations, and model settings are shown. Incorporating knowledge into AI agents is to enhance their ability to provide accurate, relevant, and context-specific responses. This makes them more effective in automating tasks, answering complex queries, and supporting decision-making processes. Actions enable AI agents to perform specific tasks and interact with various service and data sources. Here we can leverage these abilities by adding a Custom Function, OpenAPI 3.0 specified tool, or an Azure function to help run tasks. The Code Interpreter feature within Actions empowers the agent to read and analyze datasets, generate code, and create visualizations such as graphs and charts. In the next section we'll go deeper with code interpreters' abilities. Code Interpreter For this next step I'll leverage weatherHistory.csv file from Weather Dataset for code interpreter to perform on. Next Actions click on + Add then click on Code interpreter and add the csv file. Update the Instructions to "You are a Weather Data Expert Agent, designed to provide accurate, up-to-date, and detailed weather information." Lets explore what Code interpreter can do. Click on Try in playground on the top right. I'll start by asking "can you tell me which month had the most rain?", code interpreter already knows that I'm asking a question in reference to the data file I just gave it and will breakdown the question into multiple steps to provide the best possible answer. We can see that based on the dataset, August 2010 has the most where 768 instances of rainfall were recorded. We'll take it a step further and create a graph using a different question. Let's ask the agent "ok, can you create a bar chart that shows the amount of rain fall from each year using the provided dataset?" in which the agent will respond with the following: This is just a quick demonstration of how powerful code interpreter can be. Code interpreter allows for efficient data interpretation and presentation as shown above, making it easier to derive insights and make informed decisions. We'll create and add a Bing Grounding Resource which will allow an agent to include real-time public web data into their responses. Bing Grounding Resource A Bing Grounding Resource is a powerful tool that enables AI agents to access and incorporate real-time data from the web into their responses and also ensures that the information provided by the agents is accurate, current, and relevant. An agent will be able to perform Bing searches when needed, fetching up-to-date information and enhancing the overall reliability and transparency of its responses. By leveraging Bing Grounding, AI agents can deliver more precise and contextually appropriate answers, significantly improving user satisfaction and trust. To add a Bing Ground Resource to the agent: Create the Resource: Navigate to the Azure AI Foundry portal and create a new Bing Grounding resource. Add Knowledge: Go to your agent in Azure AI Foundry, click on + Add next to Knowledge on the right side, select Grounding with Big Search, + Create connection. Add connection with API key. The Bing Grounding resource is now added to your agent. In the playground I'll add first ask "Is it raining over downtown New York today?". I will get a live response that also includes the links to the sources where the information was retrieved from. The agent responds as shown below: Next i'll ask the agent "How's should I prepare for the weather in New York this week? Any clothing recommendations?" in which the agent responds: The agent is able to breakdown the question using gpt-4 in detail by leveraging the source information from Bing and providing appropriate information to the user. Other the capabilities of custom functions, OpenAPI 3.0 specified tools, and Azure Functions significantly enhance the versatility and power of Azure AI agents. Custom functions allow agents to perform specialized tasks tailored to specific business needs, while OpenAPI 3.0 specified tools enable seamless integration with a wide range of external services and APIs. Azure Functions further extend the agent's capabilities by allowing it to execute serverless code, automating complex workflows and processes. Together, these features empower developers to build highly functional and adaptable AI agents that can efficiently handle diverse tasks, drive innovation, and deliver exceptional value to users. Conclusion Developing an AI Agent on Azure AI Foundry is a swift and efficient process, thanks to its robust features and comprehensive tools. The platform's Bing Grounding Resource ensures that your AI models are well-informed and contextually accurate, leveraging vast amounts real-time of data to enhance performance. Additionally, the Code Interpreter simplifies the integration and execution of solving complex tasks involving data analysis. By utilizing these powerful resources, you can accelerate the development of intelligent agents that are not only capable of understanding and responding to user inputs but also continuously improving through iterative learning. Azure AI Foundry provides a solid foundation for creating innovative AI solutions that can drive significant value across various applications. Additional Resources: Quickstart - Create a new Azure AI Agent Service project - Azure AI services | Microsoft Learn How to use Grounding with Bing Search in Azure AI Agent Service - Azure OpenAI | Microsoft Learn581Views0likes0CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.378Views5likes2CommentsLearn about Azure AI during the Global AI Bootcamp 2025
The Global AI Bootcamp starting next week, and it’s more exciting than ever! With 135 bootcamps in 44 countries, this is your chance to be part of a global movement in AI innovation. 🤖🌍 From Germany to India, Nigeria to Canada, and beyond, join us for hands-on workshops, expert talks, and networking opportunities that will boost your AI skills and career. Whether you’re a seasoned pro or just starting out, there’s something for everyone! 🚀 Why Attend? 🛠️ Hands-on Workshops: Build and deploy AI models. 🎤 Expert Talks: Learn the latest trends from industry leaders. 🤝 Network: Connect with peers, mentors, and potential collaborators. 📈 Career Growth: Discover new career paths in AI. Don't miss this incredible opportunity to learn, connect, and grow! Check out the event in your city or join virtually. Let's shape the future of AI together! 🌟 👉 Explore All Bootcamps388Views0likes0CommentsRAG Best Practice With AI Search
RAG Best Practice With AI Search Please refer to my repo to get more AI resources, wellcome to start it: https://github.com/xinyuwei-david/david-share.git This article if from one of my repo: https://github.com/xinyuwei-david/david-share/tree/master/LLMs/RAG-Best-Practice Although models like GPT-4 and GPT-3.5 are powerful, their knowledge cannot be the most up-to-date. Previously, we often introduced engineering techniques in the use of LLMs by treating prompt engineering, RAG, and fine-tuning as parallel methods. In fact, these three technologies can be combined. Four stages of RAG The thinking in the paper I read is excellent—it divides RAG into four stages. Level 1: Explicit Fact Queries Characteristics Simplicity: Directly retrieving explicit factual information from provided data without the need for complex reasoning or multi-step processing. Requirement: Efficiently and accurately retrieve relevant content and generate precise answers. Techniques and Engineering Suggestions a. Basic RAG Methods Data Preprocessing and Chunking Fixed-Length Chunking: Splitting the text by fixed lengths, which may interrupt sentences or paragraphs. Paragraph-Based or Semantic Chunking: Chunking based on natural paragraphs or semantic boundaries to maintain content integrity. : Divide long texts or documents into appropriate chunks for indexing and retrieval. Common chunking strategies include: Index Construction: Sparse Indexing: Use traditional information retrieval methods like TF-IDF or BM25 based on keyword matching. Dense Indexing: Use pre-trained language models (e.g., BERT) to generate text vector embeddings for vector retrieval. Retrieval Techniques: Utilize vector similarity calculations or keyword matching to retrieve the most relevant text fragments from the index. Answer Generation: Input the retrieved text fragments as context into the LLM to generate the final answer. b. Improving Retrieval and Generation Phases Multimodal Data Processing: If the data includes tables, images, or other non-text information, convert them into text form or use multimodal models for processing. Retrieval Optimization: Recursive Retrieval: Perform multiple rounds of retrieval when a single retrieval isn't sufficient to find the answer, gradually narrowing down the scope. Retrieval Result Re-ranking: Use models to score or re-rank retrieval results, prioritizing the most relevant content. Generation Optimization: Filtering Irrelevant Information: Before the generation phase, filter out retrieved content unrelated to the question to avoid interfering with the model's output. Controlling Answer Format: Through carefully designed prompts, ensure the model generates answers with correct formatting and accurate content. Engineering Practice Example Example: Constructing a Q&A system to answer common questions about company products. Data Preparation: Collect all relevant product documents, FAQs, user manuals, etc. Clean, chunk, and index the documents. System Implementation: After a user asks a question, use dense vector retrieval to find the most relevant text fragments from the index. Input the retrieved fragments as context into the LLM to generate an answer. Optimization Strategies: Regularly update documents and indexes to ensure information is current. Monitor user feedback to improve retrieval strategies and prompt designs, enhancing answer quality. Level 2: Implicit Fact Queries Characteristics Increased Complexity: Requires a certain degree of reasoning or multi-step derivation based on the retrieved data. Requirement: The model needs to decompose the question into multiple steps, retrieve and process them separately, and then synthesize the final answer. Techniques and Engineering Suggestions a. Multi-Hop Retrieval and Reasoning Iterative RAG: IRCoT (Iterative Retrieval Chain-of-Thought): Use chain-of-thought reasoning to guide the model in retrieving relevant information at each step, gradually approaching the answer. RAT (Retrieve and Answer with Thought): Introduce retrieval steps during the answering process, allowing the model to retrieve new information when needed. Question Decomposition: Break down complex questions into simpler sub-questions, retrieve and answer them individually, then synthesize the results. b. Graph or Tree Structured Retrieval and Reasoning Building Knowledge Graphs: Extract entities and relationships from data to construct knowledge graphs, helping the model understand complex dependencies. Graph Search Algorithms: Use algorithms like Depth-First Search (DFS) or Breadth-First Search (BFS) to find paths or subgraphs related to the question within the knowledge graph. c. Using SQL or Other Structured Queries Text-to-SQL Conversion: Convert natural language questions into SQL queries to retrieve answers from structured databases. Tool Support: Use existing text-to-SQL conversion tools (e.g., Chat2DB) to facilitate natural language to database query conversion. Engineering Practice Example Scenario: A user asks, "In which quarters over the past five years did company X's stock price exceed company Y's?" Question Decomposition: Obtain quarterly stock price data for company X and company Y over the past five years. Compare the stock prices for each quarter. Identify the quarters where company X's stock price exceeded company Y's. Implementation Steps: Step 1: Use text-to-SQL tools to convert the natural language query into SQL queries and retrieve relevant data from the database. Step 2: Use a programming language (e.g., Python) to process and compare the data. Step 3: Organize the results into a user-readable format. Answer Generation: Input the organized results as context into the LLM to generate a natural language response. Level 3: Interpretable Rationale Queries Characteristics Application of Domain-Specific Rules and Guidelines: The model needs to understand and follow rules typically not covered in pre-training data. Requirement: Integrate external rules, guidelines, or processes into the model so it can follow specified logic and steps when answering. Techniques and Engineering Suggestions a. Prompt Engineering and Prompt Optimization Designing Effective Prompts: Explicitly provide rules or guidelines within the prompt to guide the model in following specified steps when answering. Automated Prompt Optimization: Use optimization algorithms (e.g., reinforcement learning) to automatically search and optimize prompts, improving the model's performance on specific tasks. OPRO (Optimization with Prompt Rewriting): The model generates and evaluates prompts on its own, iteratively optimizing to find the best prompt combination. b. Chain-of-Thought (CoT) Prompts Guiding Multi-Step Reasoning: Require the model to display its reasoning process within the prompt, ensuring it follows specified logic. Manual or Automated CoT Prompt Design: Design appropriate CoT prompts based on task requirements or use algorithms to generate them automatically. c. Following External Processes or Decision Trees Encoding Rules and Processes: Convert decision processes into state machines, decision trees, or pseudocode for the model to execute. Model Adjustment: Enable the model to parse and execute these encoded rules. Engineering Practice Example Example: A customer service chatbot handling return requests. Scenario: A customer requests a return. The chatbot needs to guide the customer through the appropriate process according to the company's return policy. Technical Implementation: Rule Integration: Organize the company's return policies and procedures into clear steps or decision trees. Prompt Design: Include key points of the return policy within the prompt, requiring the model to guide the customer step by step. Model Execution: The LLM interacts with the customer based on the prompt, following the return process to provide clear guidance. Optimization Strategies: Prompt Optimization: Adjust prompts based on customer feedback to help the model more accurately understand and execute the return process. Multi-Turn Dialogue: Support multiple rounds of conversation with the customer to handle various potential issues and exceptions. Level 4: Hidden Rationale Queries Characteristics Highest Complexity: Involves domain-specific, implicit reasoning methods; the model needs to discover and apply these hidden logics from data. Requirement: The model must be capable of mining patterns and reasoning methods from large datasets, akin to the thought processes of domain experts. Techniques and Engineering Suggestions a. Offline Learning and Experience Accumulation Learning Patterns and Experience from Data: Train the model to generalize potential rules and logic from historical data and cases. Self-Supervised Learning: Use the model-generated reasoning processes (e.g., Chain-of-Thought) as auxiliary information to optimize the model's reasoning capabilities. b. In-Context Learning (ICL) Providing Examples and Cases: Include relevant examples within the prompt for the model to reference similar cases during reasoning. Retrieving Relevant Cases: Use retrieval modules to find cases similar to the current question from a database and provide them to the model. c. Model Fine-Tuning Domain-Specific Fine-Tuning: Fine-tune the model using extensive domain data to internalize domain knowledge. Reinforcement Learning: Employ reward mechanisms to encourage the model to produce desired reasoning processes and answers. Engineering Practice Example Example: A legal assistant AI handling complex cases. Scenario: A user consults on a complex legal issue. The AI needs to provide advice, citing relevant legal provisions and precedents. Technical Implementation: Data Preparation: Collect a large corpus of legal documents, case analyses, expert opinions, etc. Model Fine-Tuning: Fine-tune the LLM using legal domain data to equip it with legal reasoning capabilities. Case Retrieval: Use RAG to retrieve relevant precedents and legal provisions from a database. Answer Generation: Input the retrieved cases and provisions as context into the fine-tuned LLM to generate professional legal advice. Optimization Strategies: Continuous Learning: Regularly update the model by adding new legal cases and regulatory changes. Expert Review: Incorporate legal experts to review the model's outputs, ensuring accuracy and legality. Comprehensive Consideration: Combining Fine-Tuned LLMs and RAG While fine-tuning LLMs can enhance the model's reasoning ability and domain adaptability, it cannot entirely replace the role of RAG. RAG has unique advantages in handling dynamic, massive, and real-time updated knowledge. Combining fine-tuning and RAG leverages their respective strengths, enabling the model to possess strong reasoning capabilities while accessing the latest and most comprehensive external knowledge. Advantages of the Combination Enhanced Reasoning Ability: Through fine-tuning, the model learns domain-specific reasoning methods and logic. Real-Time Knowledge Access: RAG allows the model to retrieve the latest external data in real-time when generating answers. Flexibility and Scalability: RAG systems can easily update data sources without the need to retrain the model. Practical Application Suggestions Combining Fine-Tuning and RAG for Complex Tasks: Use fine-tuning to enhance the model's reasoning and logic capabilities, while employing RAG to obtain specific knowledge and information. Evaluating Cost-Benefit Ratio: Consider the costs and benefits of fine-tuning; focus on fine-tuning core reasoning abilities and let RAG handle knowledge acquisition. Continuous Update and Maintenance: Establish data update mechanisms for the RAG system to ensure the external data accessed by the model is up-to-date and accurate. RAG Detailed technical explaination Retrieval Augmented Generation (RAG) is a technique that combines large language models (LLMs) with information retrieval. It enhances the model's capabilities by retrieving and utilizing relevant information from external knowledge bases during the generation process. This provides the model with up-to-date, domain-specific knowledge, enabling it to generate more accurate and contextually relevant responses. Purpose of RAG Why do we need RAG? Reducing Hallucinations: LLMs may produce inaccurate or false information, known as "hallucinations," when they lack sufficient context. RAG reduces the occurrence of hallucinations by providing real-time external information. Updating Knowledge: The pre-training data of LLMs may lag behind current information. RAG allows models to access the latest data sources, maintaining the timeliness of information. Enhancing Accuracy: By retrieving relevant background information, the model's answers become more accurate and professional. How RAG Works The core idea of RAG is to retrieve relevant information from a document repository and input it into the LLM along with the user's query, guiding the model to generate a more accurate answer. The general process is as follows: User Query: The user poses a question or request to the system. Retrieval Phase: The system uses the query to retrieve relevant document fragments (chunks) from a document repository or knowledge base. Generation Phase: The retrieved document fragments are input into the LLM along with the original query to generate the final answer. Key Steps to Building a RAG System Clarify Objectives Before starting to build a RAG system, you need to first clarify your goals: Upgrade Search Interface: Do you want to add semantic search capabilities to your existing search interface? Enhance Domain Knowledge: Do you wish to utilize domain-specific knowledge to enhance search or chat functions? Add a Chatbot: Do you want to add a chatbot to interact with customers? Expose Internal APIs: Do you plan to expose internal APIs through user dialogues? Clear objectives will guide the entire implementation process and help you choose the most suitable technologies and strategies. Data Preparation Data is the foundation of a RAG system, and its quality directly affects system performance. Data preparation includes the following steps: (1) Assess Data Formats Structured Data: Such as CSV, JSON, etc., which need to be converted into text format to facilitate indexing and retrieval. Tabular Data: May need to be converted or enriched to support more complex searches or interactions. Text Data: Such as documents, articles, chat records, etc., which may need to be organized or filtered. Image Data: Including flowcharts, documents, photographs, and similar images. (2) Data Enrichment Add Contextual Information: Supplement data with additional textual content, such as knowledge bases or industry information. Data Annotation: Label key entities, concepts, and relationships to enhance the model's understanding capabilities. (3) Choose the Right Platform Vector Databases: Such as AI Search, Qdrant, etc., used for storing and retrieving embedding vectors. Relational Databases: The database schema needs to be included in the LLM's prompts to translate user requests into SQL queries. Text Search Engines: Like AI Search, Elasticsearch, Couchbase, which can be combined with vector search to leverage both text and semantic search advantages. Graph Databases: Build knowledge graphs to utilize the connections and semantic relationships between nodes. Document Chunking In a RAG system, document chunking is a critical step that directly affects the quality and relevance of the retrieved information. Below are the best practices for chunking: Model Limitations: LLMs have a maximum context length limitation. Improve Retrieval Efficiency: Splitting large documents into smaller chunks helps to improve retrieval accuracy and speed. Methods to do chunking Fixed-Size Chunking: Define a fixed size (e.g., 200 words) for chunks and allow a certain degree of overlap (e.g., 10-15%). Content-Based Variable-Size Chunking: Chunk based on content features (such as sentences, paragraphs, Markdown structures). Custom or Iterative Chunking Strategies: Combine fixed-size and variable-size methods and adjust according to specific needs. Importance of Content Overlap Preserve Context: Allowing some overlap between chunks during chunking helps to retain contextual information. Recommendation: Start with about 10% overlap and adjust based on specific data types and use cases. Choosing the Right Embedding Model Embedding models are used to convert text into vector form to facilitate similarity computation. When choosing an embedding model, consider: Model Input Limitations: Ensure the input text length is within the model's allowable range. Model Performance and Effectiveness: Choose a model with good performance and suitable effectiveness based on the specific application scenario. New Embedding Models: OpenAI has introduced two new embedding models: text-embedding-3-small and text-embedding-3-large. Model Size and Performance: text-embedding-3-large is a larger and more powerful embedding model capable of creating embeddings with up to 3,072 dimensions. Performance Improvements: MIRACL Benchmark: text-embedding-3-large scored 54.9 on the MIRACL benchmark, showing a significant improvement over text-embedding-ada-002 which scored 31.4. MTEB Benchmark: text-embedding-3-large scored 64.6 on the MTEB benchmark, surpassing text-embedding-ada-002 which scored 61.0. Analysis of Improvements: Higher Dimensions: The ability of text-embedding-3-large to create embeddings with up to 3,072 dimensions allows it to better capture and represent the concepts and relationships within the content. Improved Training Techniques: The new model employs more advanced training techniques and optimization methods, resulting in better performance on multilingual retrieval and English tasks. Flexibility: text-embedding-3-large allows developers to balance performance and cost by adjusting the dimensionality of the embeddings. For example, reducing the 3,072-dimensional embeddings to 256 dimensions can still outperform the uncompressed text-embedding-ada-002 on the MTEB benchmark. Note: To migrate from text-embedding-ada-002 to text-embedding-3-large, you'll need to manually generate new embeddings, as upgrading between embedding models is not automatic. The first step is to deploy the new model (text-embedding-3-large) within your Azure environment. After that, re-generate embeddings for all your data, as embeddings from the previous model will not be compatible with the new one. AI Search Service Capacity and Performance Optimization Service Tiers and Capacity Refer to: https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity Upgrade Service Tier: Upgrading from Standard S1 to S2 can provide higher performance and storage capacity. Increase Partitions and Replicas: Adjust based on query load and index size. Avoid Complex Queries: Reduce the use of high-overhead queries, such as regular expression queries. Query Optimization: Retrieve only the required fields, limit the amount of data returned, and use search functions rather than complex filters.Tips for Improving Azure AI Search Performance Tips for Improving Azure AI Search Performance Index Size and Architecture: Regularly optimize the index; remove unnecessary fields and documents. Query Design: Optimize query statements to reduce unnecessary scanning and computation. Service Capacity: Adjust replicas and partitions appropriately based on query load and index size. Avoid Complex Queries: Reduce the use of high-overhead queries, such as regular expression queries. Chunking Large Documents Use Built-in Text Splitting Skills: Choose modes like pages or sentences based on needs. Adjust Parameters: Set appropriate maximumPageLength, pageOverlapLength, etc., based on document characteristics. Use Tools Like LangChain: For more flexible chunking and embedding operations. L1+L2 Search + Query Rewriting and New Semantic Reranker L1 Hybrid Search+L2 Re-ranker:Enhance search result Query Rewriting: Improve recall rate and accuracy by rewriting user queries. Semantic Reranker: Use cross-encoders to re-rank candidate results, enhancing result relevance. Prompt Engineering Use Rich Examples: Provide multiple examples to guide the model's learning and improve its responses. Provide Clear Instructions: Ensure that instructions are explicit and unambiguous to avoid misunderstandings. Restrict Input and Output Formats: Define acceptable input and output formats to prevent malicious content and protect model security. Note: Prompt Engineering is not suitable for Azure OpenAI o1 Reference: https://mp.weixin.qq.com/s/tLcAfPU6hUkFsNMjDFeklw?token=1531586958&lang=zh_CN Your task is to review customer questions and categorize them into one of the following 4 types of problems. The review steps are as follows, please perform step by step: 1. Extract three keywords from customer questions and translate them into English. Please connect the three keywords with commas to make it a complete JSON value. 2. Summarize the customer’s questions in 15 more words and in English. 3. Categorize the customer’s questions based on Review text and summary. Category list: • Technical issue: customer is experiencing server-side issues, client errors or product limitations. Example: "I'm having trouble logging into my account. It keeps saying there's a server error." • Product inquiry: customer would like to know more details about the product or is asking questions about how to use it. Example: "Can you provide more information about the product and how to use it?" • Application status: customer is requesting to check the status of their Azure OpenAI, GPT-4 or DALLE application. Example: "What is the status of my Azure OpenAI application?" • Unknown: if you only have a low confidence score to categorize. Example: "I'm not sure if this is the right place to ask, but I have a question about billing." Provide them in JSON format with the following keys: Case id; Key-words; Summary; Category. Please generate only one JSON structure per review. Please show the output results - in a table, the table is divided into four columns: Case ID, keywords, Summary, Category. Demo: Lenovo ThinkPad Product RAG I have a Lenovo ThinkPad product manual, and I want to build a RAG (Retrieval-Augmented Generation) system based on it. The document includes up to dozens of product models, many of which have very similar names. Moreover, the document is nearly 900 pages long. Therefore, to construct a RAG system based on this document and provide precise answers, I need to address the following issues: How to split the document; How to avoid loss of relevance; How to resolve information discontinuity; The problem of low search accuracy due to numerous similar products; The challenge of diverse questioning on the system's generalization ability. In the end, I chunked the document based on the product models and set more effective prompts, so that the RAG (Retrieval-Augmented Generation) system can accurately answer questions. System prompt You are an AI assistant that helps people answer the question of Lenovo product. Please be patient and try to answer your questions in as much detail as possible and give the reason. When you use reference documents, please list the file names directly, not only Citation 1, should be ThinkPad E14 Gen 4 (AMD).pdf.txt eg. Index format is as following: Final Result: Please see my demo vedios on Yutube Refer to: https://arxiv.org/pdf/2409.14924v11.9KViews1like1CommentImplementing Event Hub Logging for Azure OpenAI Streaming APIs
Azure OpenAI's streaming responses use Server-Sent Events (SSE), which support only one subscriber. This creates a challenge when using APIM's Event Hub Logger as it would consume the stream, preventing the actual client from receiving the response. This solution introduces a lightweight Azure Function proxy that enables Event Hub logging while preserving the streaming response for clients. With token usage data being available in both stream & non-stream AOAI API, we can monitor this the right way! Architecture Client → APIM → Azure Function Proxy → Azure OpenAI ↓ Event Hub Technical Implementation Streaming Response Handling The core implementation uses FastAPI's StreamingResponse to handle Server-Sent Events (SSE) streams with three key components: 1. Content Aggregation async def process_openai_stream(response, messages, http_client, start_time): content_buffer = [] async def generate(): for chunk in response: if chunk.choices[0].delta.content: content_buffer.append(chunk.choices[0].delta.content) yield f"data: {json.dumps(chunk.model_dump())}\n\n" This enables real-time streaming to clients while collecting the complete response for logging. The content buffer maintains minimal memory overhead by storing only text content. 2. Token Usage Collection if hasattr(chunk, 'usage') and chunk.usage: log_data = { "type": "stream_completion", "content": "".join(content_buffer), "usage": chunk.usage.model_dump(), "model": model_name, "region": headers.get("x-ms-region", "unknown") } log_to_eventhub(log_data) Token usage metrics are captured from the final chunk, providing accurate consumption data for cost analysis and monitoring. 3. Performance Tracking @app.route(route="openai/deployments/{deployment_name}/chat/completions") async def aoaifn(req: Request): start_time = time.time() response = await process_request() latency_ms = int((time.time() - start_time) * 1000) log_data["latency_ms"] = latency_ms End-to-end latency measurement includes request processing, OpenAI API call, and response handling, enabling performance monitoring and optimization. Demo Function Start API Call Event Hub Setup Deploy the Azure Function Configure environment variables: AZURE_OPENAI_KEY= AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_BASE_URL=https://.openai.azure.com/ AZURE_EVENTHUB_CONN_STR= Update APIM routing to point to the Function App Extension scenarios: APIM Managed Identity Auth token passthrough PII Filtering: Integration with Azure Presidio for real-time PII detection and masking in logs Cost Analysis: Token usage mapping to Azure billing metrics Latency based routing: AOAI Endpoint ranking could be built based on Latency metrics Monitoring Dashboard: Real-time visualisation of: Token usage per model/deployment Response latencies Error rates Regional distribution Implementation available on GitHub.863Views1like3Comments