azure openai service
181 TopicsUse Azure OpenAI and APIM with the OpenAI Agents SDK
The OpenAI Agents SDK provides a powerful framework for building intelligent AI assistants with specialised capabilities. In this blog post, I'll demonstrate how to integrate Azure OpenAI Service and Azure API Management (APIM) with the OpenAI Agents SDK to create a banking assistant system with specialised agents. Key Takeaways: Learn how to connect the OpenAI Agents SDK to Azure OpenAI Service Understand the differences between direct Azure OpenAI integration and using Azure API Management Implement tracing with the OpenAI Agents SDK for monitoring and debugging Create a practical banking application with specialized agents and handoff capabilities The OpenAI Agents SDK The OpenAI Agents SDK is a powerful toolkit that enables developers to create AI agents with specialised capabilities, tools, and the ability to work together through handoffs. It's designed to work seamlessly with OpenAI's models, but can be integrated with Azure services for enterprise-grade deployments. Setting Up Your Environment To get started with the OpenAI Agents SDK and Azure, you'll need to install the necessary packages: pip install openai openai-agents python-dotenv You'll also need to set up your environment variables. Create a `.env` file with your Azure OpenAI or APIM credentials: For Direct Azure OpenAI Connection: # .env file for Azure OpenAI AZURE_OPENAI_API_KEY=your_api_key AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/ AZURE_OPENAI_DEPLOYMENT=your-deployment-name For Azure API Management (APIM) Connection: # .env file for Azure APIM AZURE_APIM_OPENAI_SUBSCRIPTION_KEY=your_subscription_key AZURE_APIM_OPENAI_API_VERSION=2024-08-01-preview AZURE_APIM_OPENAI_ENDPOINT=https://your-apim-name.azure-api.net/ AZURE_APIM_OPENAI_DEPLOYMENT=your-deployment-name Connecting to Azure OpenAI Service The OpenAI Agents SDK can be integrated with Azure OpenAI Service in two ways: direct connection or through Azure API Management (APIM). Option 1: Direct Azure OpenAI Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure OpenAI openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Option 2: Azure API Management (APIM) Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure APIM openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_APIM_OPENAI_SUBSCRIPTION_KEY"), # Note: Using subscription key api_version=os.getenv("AZURE_APIM_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_APIM_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_APIM_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Key Difference: When using Azure API Management, you use a subscription key instead of an API key. This provides an additional layer of management, security, and monitoring for your OpenAI API access. Creating Agents with the OpenAI Agents SDK Once you've set up your Azure OpenAI or APIM connection, you can create agents using the OpenAI Agents SDK: from agents import Agent from openai.types.chat import ChatCompletionMessageParam # Create a banking assistant agent banking_assistant = Agent( name="Banking Assistant", instructions="You are a helpful banking assistant. Be concise and professional.", model="gpt-4o", # This will use the deployment specified in your Azure OpenAI/APIM client tools=[check_account_balance] # A function tool defined elsewhere ) The OpenAI Agents SDK automatically uses the Azure OpenAI or APIM client you've configured, making it seamless to switch between different Azure environments or configurations. Note that at the time of writing this article, there is a ongoing bug where OpenAI Agent SDK is fetching the old input_tokens, output_tokens instead of the new prompt_tokens & completion_tokens returned by newer ChatCompletion APIs. Thus you would need to manually update in agents/run.py file to make this work per https://github.com/openai/openai-agents-python/pull/65/files Implementing Tracing with Azure OpenAI The OpenAI Agents SDK includes powerful tracing capabilities that can help you monitor and debug your agents. When using Azure OpenAI or APIM, you can implement two types of tracing: 1. Console Tracing for Development Console logging is rather verbose, if you would like to explore the Spans then enable do it like below: from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor from agents.tracing.processors import ConsoleSpanExporter, BatchTraceProcessor # Set up console tracing console_exporter = ConsoleSpanExporter() console_processor = BatchTraceProcessor(exporter=console_exporter) add_trace_processor(console_processor) 2. OpenAI Dashboard Tracing Currently the spans are being sent to https://api.openai.com/v1/traces/ingest from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor set_tracing_export_api_key(os.getenv("OPENAI_API_KEY")) Tracing is particularly valuable when working with Azure deployments, as it helps you monitor usage, performance, and behavior across different environments. Running Agents with Azure OpenAI To run your agents with Azure OpenAI or APIM, use the Runner class from the OpenAI Agents SDK: from agents import Runner import asyncio async def main(): # Run the banking assistant result = await Runner.run( banking_assistant, input="Hi, I'd like to check my account balance." ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Practical Example: Banking Agents System Let's look at how we can use Azure OpenAI or APIM with the OpenAI Agents SDK to create a banking system with specialized agents and handoff capabilities. 1. Define Specialized Banking Agents We'll create several specialized agents: General Banking Assistant: Handles basic inquiries and account information Loan Specialist: Focuses on loan options and payment calculations Investment Specialist: Provides guidance on investment options Customer Service Agent: Routes inquiries to specialists 2. Implement Handoff Between Agents from agents import handoff, HandoffInputData from agents.extensions import handoff_filters # Define a filter for handoff messages def banking_handoff_message_filter(handoff_message_data: HandoffInputData) -> HandoffInputData: # Remove any tool-related messages from the message history handoff_message_data = handoff_filters.remove_all_tools(handoff_message_data) return handoff_message_data # Create customer service agent with handoffs customer_service_agent = Agent( name="Customer Service Agent", instructions="""You are a customer service agent at a bank. Help customers with general inquiries and direct them to specialists when needed. If the customer asks about loans or mortgages, handoff to the Loan Specialist. If the customer asks about investments or portfolio management, handoff to the Investment Specialist.""", handoffs=[ handoff(loan_specialist_agent, input_filter=banking_handoff_message_filter), handoff(investment_specialist_agent, input_filter=banking_handoff_message_filter), ], tools=[check_account_balance], ) 3. Trace the Conversation Flow from agents import trace async def main(): # Trace the entire run as a single workflow with trace(workflow_name="Banking Assistant Demo"): # Run the customer service agent result = await Runner.run( customer_service_agent, input="I'm interested in taking out a mortgage loan. Can you help me understand my options?" ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Benefits of Using Azure OpenAI/APIM with the OpenAI Agents SDK Integrating Azure OpenAI or APIM with the OpenAI Agents SDK offers several advantages: Enterprise-Grade Security: Azure provides robust security features, compliance certifications, and private networking options Scalability: Azure's infrastructure can handle high-volume production workloads Monitoring and Management: APIM provides additional monitoring, throttling, and API management capabilities Regional Deployment: Azure allows you to deploy models in specific regions to meet data residency requirements Cost Management: Azure provides detailed usage tracking and cost management tools Conclusion The OpenAI Agents SDK combined with Azure OpenAI Service or Azure API Management provides a powerful foundation for building intelligent, specialized AI assistants. By leveraging Azure's enterprise features and the OpenAI Agents SDK's capabilities, you can create robust, scalable, and secure AI applications for production environments. Whether you choose direct Azure OpenAI integration or Azure API Management depends on your specific needs for API management, security, and monitoring. Both approaches work seamlessly with the OpenAI Agents SDK, making it easy to build sophisticated agent-based applications. Repo: https://github.com/hieumoscow/azure-openai-agents Video demo: https://www.youtube.com/watch?v=gJt-bt-vLJY526Views2likes1CommentThe Future of AI: Customizing AI agents with the Semantic Kernel agent framework
The blog post Customizing AI agents with the Semantic Kernel agent framework discusses the capabilities of the Semantic Kernel SDK, an open-source tool developed by Microsoft for creating AI agents and multi-agent systems. It highlights the benefits of using single-purpose agents within a multi-agent system to achieve more complex workflows with improved efficiency. The Semantic Kernel SDK offers features like telemetry, hooks, and filters to ensure secure and responsible AI solutions, making it a versatile tool for both simple and complex AI projects.274Views2likes0CommentsAzure APIM Cost Rate Limiting with Cosmos & Flex Functions
Azure API Management (APIM) provides built-in rate limiting policies, but implementing sophisticated Dollar cost quota management for Azure OpenAI services requires a more tailored approach. This solution combines Azure Functions, Cosmos DB, and stored procedures to implement cost-based quota management with automatic renewal periods. Architecture Client → APIM (with RateLimitConfig) → Azure Function Proxy → Azure OpenAI ↓ Cosmos DB (quota tracking) Technical Implementation 1. Rate Limit Configuration in APIM The rate limiting configuration is injected into the request body by APIM using a policy fragment. Here's an example for a basic $5 quota: <set-variable name="rateLimitConfig" value="@{ var productId = context.Product.Id; var config = new JObject(); config["counterKey"] = productId; config["quota"] = 5; return config.ToString(); }" /> <include-fragment fragment-id="RateLimitConfig" /> For more advanced scenarios, you can customize token costs. Here's an example for a $10 quota with custom token pricing: <set-variable name="rateLimitConfig" value="@{ var productId = context.Product.Id; var config = new JObject(); config["counterKey"] = productId; config["startDate"] = "2025-03-02T00:00:00Z"; config["renewal_period"] = 86400; config["explicitEndDate"] = null; config["quota"] = 10; config["input_cost_per_token"] = 0.00003; config["output_cost_per_token"] = 0.00006; return config.ToString(); }" /> <include-fragment fragment-id="RateLimitConfig" /> Flexible Counter Keys The counterKey parameter is highly flexible and can be set to any unique identifier that makes sense for your rate limiting strategy: Product ID: Limit all users of a specific APIM product (e.g., "starter", "professional") User ID: Apply individual limits per user Subscription ID: Track usage at the subscription level Custom combinations: Combine identifiers for granular control (e.g., "product_starter_user_12345") Rate Limit Configuration Parameters Parameter Description Example Value Required counterKey Unique identifier for tracking quota usage "starter10" or "user_12345" Yes quota Maximum cost allowed in the renewal period 10 Yes startDate When the quota period begins. If not provided, the system uses the time when the policy is first applied "2025-03-02T00:00:00Z" No renewal_period Seconds until quota resets (86400 = daily). If not provided, no automatic reset occurs 86400 No endDate Optional end date for the quota period null or "2025-12-31T23:59:59Z" No input_cost_per_token Custom cost per input token 0.00003 No output_cost_per_token Custom cost per output token 0.00006 No Scheduling and Time Windows The time-based parameters work together to create flexible quota schedules: If the current date falls outside the range defined by startDate and endDate , requests will be rejected with an error The renewal window begins either on the specified startDate or when the policy is first applied The renewal_period determines how frequently the accumulated cost resets to zero Without a renewal_period , the quota accumulates indefinitely until the endDate is reached 2. Quota Checking and Cost Tracking The Azure Function performs two key operations: Pre-request quota check: Before processing each request, it verifies if the user has exceeded their quota Post-request cost tracking: After a successful request, it calculates the cost and updates the accumulated usage Cost Calculation For cost calculation, the system uses: Custom pricing: If input_cost_per_token and output_cost_per_token are provided in the rate limit config LiteLLM pricing: If custom pricing is not specified, the system falls back to LiteLLM's model prices for accurate cost estimation based on the model being used The function returns appropriate HTTP status codes and headers: HTTP 429 (Too Many Requests) when quota is exceeded Response headers with usage information: x-counter-key: starter5 x-accumulated-cost: 5.000915 x-quota: 5 3. Cosmos DB for State Management Cosmos DB maintains the quota state with documents that track: { "id": "starter5", "counterKey": "starter5", "accumulatedCost": 5.000915, "startDate": "2025-03-02T00:00:00.000Z", "renewalPeriod": 86400, "renewalStart": 1741132800000, "endDate": null, "quota": 5 } A stored procedure handles atomic updates to ensure accurate tracking, including: Adding costs to the accumulated total Automatically resetting costs when the renewal period is reached Updating quota values when configuration changes Benefits Fine-grained Cost Control: Track actual API usage costs rather than just request counts Flexible Quotas: Set daily, weekly, or monthly quotas with automatic renewal Transparent Usage: Response headers provide real-time quota usage information Product Differentiation: Different APIM products can have different quota levels Custom Pricing: Override default token costs for special pricing tiers Flexible Tracking: Use any identifier as the counter key for versatile quota management Time-based Scheduling: Define active periods and automatic reset windows for quota management Getting Started Deploy the Azure Function with Cosmos DB integration Configure APIM policies to include rate limit configuration Set up different product policies for various quota levels For a detailed implementation, visit our GitHub repository. Demo Video: https://www.youtube.com/watch?v=vMX86_XpSAo Tags: #AzureOpenAI #APIM #CosmosDB #RateLimiting #Serverless238Views0likes2CommentsThe Future of AI: Reduce AI Provisioning Effort - Jumpstart your solutions with AI App Templates
In the previous post, we introduced Contoso Chat – an open-source RAG-based retail chat sample for Azure AI Foundry, that serves as both an AI App template (for builders) and the basis for a hands-on workshop (for learners). And we briefly talked about five stages in the developer workflow (provision, setup, ideate, evaluate, deploy) that take them from the initial prompt to a deployed product. But how can that sample help you build your app? The answer lies in developer tools and AI App templates that jumpstart productivity by giving you a fast start and a solid foundation to build on. In this post, we answer that question with a closer look at Azure AI App templates - what they are, and how we can jumpstart our productivity with a reuse-and-extend approach that builds on open-source samples for core application architectures.238Views0likes0CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.379Views5likes2CommentsPrompt Engineering for OpenAI’s O1 and O3-mini Reasoning Models
Important Attempting to extract the model's internal reasoning is prohibited, as it violates the acceptable use guidelines. This section explores how O1 and O3-mini differ from GPT-4o in input handling, reasoning capabilities, and response behavior, and outlines prompt engineering best practices to maximize their performance. Finally, we apply these best practices to a legal case analysis scenario. Differences Between O1/O3-mini and GPT-4o Input Structure and Context Handling Built-in Reasoning vs. Prompted Reasoning: O1-series models have built-in chain-of-thought reasoning, meaning they internally reason through steps without needing explicit coaxing from the prompt. In contrast, GPT-4o often benefits from external instructions like “Let’s think step by step” to solve complex problems, since it doesn’t automatically engage in multi-step reasoning to the same extent. With O1/O3, you can present the problem directly; the model will analyze it deeply on its own. Need for External Information: GPT-4o has a broad knowledge base and access to tools (e.g. browsing, plugins, vision) in certain deployments, which helps it handle a wide range of topics. By comparison, the O1 models have a narrower knowledge base outside their training focus. For example, O1-preview excelled at reasoning tasks but couldn’t answer questions about itself due to limited knowledge context. This means when using O1/O3-mini, important background information or context should be included in the prompt if the task is outside common knowledge – do not assume the model knows niche facts. GPT-4o might already know a legal precedent or obscure detail, whereas O1 might require you to provide that text or data. Context Length: The reasoning models come with very large context windows. O1 supports up to 128k tokens of input, and O3-mini accepts up to 200k tokens (with up to 100k tokens output), exceeding GPT-4o’s context length. This allows you to feed extensive case files or datasets directly into O1/O3. For prompt engineering, structure large inputs clearly (use sections, bullet points, or headings) so the model can navigate the information. Both GPT-4o and O1 can handle long prompts, but O1/O3’s higher capacity means you can include more detailed context in one go, which is useful in complex analyses. Reasoning Capabilities and Logical Deduction Depth of Reasoning: O1 and O3-mini are optimized for methodical, multi-step reasoning. They literally “think longer” before answering, which yields more accurate solutions on complex tasks. For instance, O1-preview solved 83% of problems on a challenging math exam (AIME), compared to GPT-4o’s 13% – a testament to its superior logical deduction in specialized domains. These models internally perform chain-of-thought and even self-check their work. GPT-4o is also strong but tends to produce answers more directly; without explicit prompting, it might not analyze as exhaustively, leading to errors in very complex cases that O1 could catch. Handling of Complex vs. Simple Tasks: Because O1-series models default to heavy reasoning, they truly shine on complex problems that have many reasoning steps (e.g. multi-faceted analyses, long proofs). In fact, on tasks requiring five or more reasoning steps, a reasoning model like O1-mini or O3 outperforms GPT-4 by a significant margin (16%+ higher accuracy). However, this also means that for very simple queries, O1 may “overthink.” Research found that on straightforward tasks (fewer than 3 reasoning steps), O1’s extra analytical process can become a disadvantage – it underperformed GPT-4 in a significant portion of such cases due to excessive reasoning. GPT-4o might answer a simple question more directly and swiftly, whereas O1 might generate unnecessary analysis. The key difference is O1 is calibrated for complexity, so it may be less efficient for trivial Q&A. Logical Deduction Style: When it comes to puzzles, deductive reasoning, or step-by-step problems, GPT-4o usually requires prompt engineering to go stepwise (otherwise it might jump to an answer). O1/O3 handle logical deduction differently: they simulate an internal dialogue or scratchpad. For the user, this means O1’s final answers tend to be well-justified and less prone to logical gaps. It will have effectively done a “chain-of-thought” internally to double-check consistency. From a prompt perspective, you generally don’t need to tell O1 to explain or check its logic – it does so automatically before presenting the answer. With GPT-4o, you might include instructions like “first list the assumptions, then conclude” to ensure rigorous logic; with O1, such instructions are often redundant or even counterproductive. Response Characteristics and Output Optimization Detail and Verbosity: Because of their intensive reasoning, O1 and O3-mini often produce detailed, structured answers for complex queries. For example, O1 might break down a math solution into multiple steps or provide a rationale for each part of a strategy plan. GPT-4o, on the other hand, may give a more concise answer by default or a high-level summary, unless prompted to elaborate. In terms of prompt engineering, this means O1’s responses might be longer or more technical. You have more control over this verbosity through instructions. If you want O1 to be concise, you must explicitly tell it (just as you would GPT-4) – otherwise, it might err on the side of thoroughness. Conversely, if you want a step-by-step explanation in the output, GPT-4o might need to be told to include one, whereas O1 will happily provide one if asked (and has likely done the reasoning internally regardless). Accuracy and Self-Checking: The reasoning models exhibit a form of self-fact-checking. OpenAI notes that O1 is better at catching its mistakes during the response generation, leading to improved factual accuracy in complex responses. GPT-4o is generally accurate, but it can occasionally be confidently wrong or hallucinate facts if not guided. O1’s architecture reduces this risk by verifying details as it “thinks.” In practice, users have observed that O1 produces fewer incorrect or nonsensical answers on tricky problems, whereas GPT-4o might require prompt techniques (like asking it to critique or verify its answer) to reach the same level of confidence. This means you can often trust O1/O3 to get complex questions right with a straightforward prompt, whereas with GPT-4 you might add instructions like “check your answer for consistency with the facts above.” Still, neither model is infallible, so critical factual outputs should always be reviewed. Speed and Cost: A notable difference is that O1 models are slower and more expensive in exchange for their deeper reasoning. O1 Pro even includes a progress bar for long queries. GPT-4o tends to respond faster for typical queries. O3-mini was introduced to offer a faster, cost-efficient reasoning model – it’s much cheaper per token than O1 or GPT-4o and has lower latency. However, O3-mini is a smaller model, so while it’s strong in STEM reasoning, it might not match full O1 or GPT-4 in general knowledge or extremely complex reasoning. When prompt engineering for optimal response performance, you need to balance depth vs. speed: O1 might take longer to answer thoroughly. If latency is a concern and the task isn’t maximal complexity, O3-mini (or even GPT-4o) could be a better choice. OpenAI’s guidance is that GPT-4o “is still the best option for most prompts,” using O1 primarily for truly hard problems in domains like strategy, math, and coding. In short, use the right tool for the job – and if you use O1, anticipate longer responses and plan for its slower output (possibly by informing the user or adjusting system timeouts). Prompt Engineering Techniques to Maximize Performance Leveraging O1 and O3-mini effectively requires a slightly different prompting approach than GPT-4o. Below are key prompt engineering techniques and best practices to get the best results from these reasoning models: Keep Prompts Clear and Minimal Be concise and direct with your ask. Because O1 and O3 perform intensive internal reasoning, they respond best to focused questions or instructions without extraneous text. OpenAI and recent research suggest avoiding overly complex or leading prompts for these models. In practice, this means you should state the problem or task plainly and provide only necessary details. There is no need to add “fluff” or multiple rephrasing of the query. For example, instead of writing: “In this challenging puzzle, I’d like you to carefully reason through each step to reach the correct solution. Let’s break it down step by step...”, simply ask: “Solve the following puzzle [include puzzle details]. Explain your reasoning.” The model will naturally do the step-by-step thinking internally and give an explanation. Excess instructions can actually overcomplicate things – one study found that adding too much prompt context or too many examples worsened O1’s performance, essentially overwhelming its reasoning process. Tip: For complex tasks, start with a zero-shot prompt (just the task description) and only add more instruction if you find the output isn’t meeting your needs. Often, minimal prompts yield the best results with these reasoning models. Avoid Unnecessary Few-Shot Examples Traditional prompt engineering for GPT-3/4 often uses few-shot examples or demonstrations to guide the model. With O1/O3, however, less is more. The O1 series was explicitly trained to not require example-laden prompts. In fact, using multiple examples can hurt performance. Research on O1-preview and O1-mini showed that few-shot prompting consistently degraded their performance – even carefully chosen examples made them do worse than a simple prompt in many cases. The internal reasoning seems to get distracted or constrained by the examples. OpenAI’s own guidance aligns with this: they recommend limiting additional context or examples for reasoning models to avoid confusing their internal logic. Best practice: use zero-shot or at most one example if absolutely needed. If you include an example, make it highly relevant and simple. For instance, in a legal analysis prompt, you generally would not prepend a full example case analysis; instead, just ask directly about the new case. The only time you might use a demonstration is if the task format is very specific and the model isn’t following instructions – then show one brief example of the desired format. Otherwise, trust the model to figure it out from a direct query. Leverage System/Developer Instructions for Role and Format Setting a clear instructional context can help steer the model’s responses. With the API (or within a conversation’s system message), define the model’s role or style succinctly. For example, a system message might say: “You are an expert scientific researcher who explains solutions step-by-step”. O1 and O3-mini respond well to such role instructions and will incorporate them in their reasoning. However, remember that they already excel at understanding complex tasks, so your instructions should focus on what kind of output you want, not how to think. Good uses of system/developer instructions include: Defining the task scope or persona: e.g. “Act as a legal analyst” or “Solve the problem as a math teacher explaining to a student.” This can influence tone and the level of detail. Specifying the output format: If you need the answer in a structured form (bullet points, a table, JSON, etc.), explicitly say so. O1 and especially O3-mini support structured output modes and will adhere to format requests. For instance: “Provide your findings as a list of key bullet points.” Given their logical nature, they tend to follow format instructions accurately, which helps maintain consistency in responses Setting boundaries: If you want to control verbosity or focus, you can include something like “Provide a brief conclusion after the detailed analysis” or “Only use the information given without outside assumptions.” The reasoning models will respect these boundaries, and it can prevent them from going on tangents or hallucinating facts. This is important since O1 might otherwise produce a very exhaustive analysis – which is often great, but not if you explicitly need just a summary. Ensure any guidance around tone, role, format is included each time. Control Verbosity and Depth Through Instructions While O1 and O3-mini will naturally engage in deep reasoning, you have control over how much of that reasoning is reflected in the output. If you want a detailed explanation, prompt for it (e.g. “Show your step-by-step reasoning in the answer”). They won’t need the nudge to do the reasoning, but they do need to be told if you want to see it. Conversely, if you find the model’s answers too verbose or technical for your purposes, instruct it to be more concise or to focus only on certain aspects. For example: “In 2-3 paragraphs, summarize the analysis with only the most critical points.” The models are generally obedient to such instructions about length or focus. Keep in mind that O1’s default behavior is to be thorough – it’s optimized for correctness over brevity – so it may err on the side of giving more details. A direct request for brevity will override this tendency in most cases. For O3-mini, OpenAI provides an additional tool to manage depth: the “reasoning effort” parameter (low, medium, high). This setting lets the model know how hard to “think.” In prompt terms, if using the API or a system that exposes this feature, you can dial it up for very complex tasks (ensuring maximum reasoning, at the cost of longer answers and latency) or dial it down for simpler tasks (faster, more streamlined answers). This is essentially another way to control verbosity and thoroughness. If you don’t have direct access to that parameter, you can mimic a low effort mode by explicitly saying “Give a quick answer without deep analysis” for cases where speed matters more than perfect accuracy. Conversely, to mimic high effort, you might say “Take all necessary steps to arrive at a correct answer, even if the explanation is long.” These cues align with how the model’s internal setting would operate. Ensure Accuracy in Complex Tasks To get the most accurate responses on difficult problems, take advantage of the reasoning model’s strengths in your prompt. Since O1 can self-check and even catch contradictions, you can ask it to utilize that: e.g. “Analyze all the facts and double-check your conclusion for consistency.” Often it will do so unprompted, but reinforcing that instruction can signal the model to be extra careful. Interestingly, because O1 already self-fact-checks, you rarely need to prompt it with something like “verify each step” (that’s more helpful for GPT-4o). Instead, focus on providing complete and unambiguous information. If the question or task has potential ambiguities, clarify them in the prompt or instruct the model to list any assumptions. This prevents the model from guessing wrongly. Handling sources and data: If your task involves analyzing given data (like summarizing a document or computing an answer from provided numbers), make sure that data is clearly presented. O1/O3 will diligently use it. You can even break data into bullet points or a table for clarity. If the model must not hallucinate (say, in a legal context it shouldn’t make up laws), explicitly state “base your answer only on the information provided and common knowledge; do not fabricate any details.” The reasoning models are generally good at sticking to known facts, and such an instruction further reduces the chance of hallucinationIterate and verify: If the task is critical (for example, complex legal reasoning or a high-stakes engineering calculation), a prompt engineering technique is to ensemble the model’s responses. This isn’t a single prompt, but a strategy: you could run the query multiple times (or ask the model to consider alternative solutions) and then compare answers. O1’s stochastic nature means it might explore different reasoning paths each time. By comparing outputs or asking the model to “reflect if there are alternative interpretations” in a follow-up prompt, you can increase confidence in the result. While GPT-4o also benefits from this approach, it’s especially useful for O1 when absolute accuracy is paramount – essentially leveraging the model’s own depth by cross-verifying. Finally, remember that model selection is part of prompt engineering: If a question doesn’t actually require O1-level reasoning, using GPT-4o might be more efficient and just as accurate. OpenAI recommends saving O1 for the hard cases and using GPT-4o for the rest. So a meta-tip: assess task complexity first. If it’s simple, either prompt O1 very straightforwardly to avoid overthinking, or switch to GPT-4o. If it’s complex, lean into O1’s abilities with the techniques above. How O1/O3 Handle Logical Deduction vs. GPT-4o The way these reasoning models approach logical problems differs fundamentally from GPT-4o, and your prompt strategy should adapt accordingly: Handling Ambiguities: In logical deduction tasks, if there’s missing info or ambiguity, GPT-4o might make an assumption on the fly. O1 is more likely to flag the ambiguity or consider multiple possibilities because of its reflective approach. To leverage this, your prompt to O1 can directly ask: “If there are any uncertainties, state your assumptions before solving.” GPT-4 might need that nudge more. O1 might do it naturally or at least is less prone to assuming facts not given. So in comparing the two, O1’s deduction is cautious and thorough, whereas GPT-4o’s is swift and broad. Tailor your prompt accordingly – with GPT-4o, guide it to be careful; with O1, you mainly need to supply the information and let it do its thing. Step-by-Step Outputs: Sometimes you actually want the logical steps in the output (for teaching or transparency). With GPT-4o, you must explicitly request this (“please show your work”). O1 might include a structured rationale by default if the question is complex enough, but often it will present a well-reasoned answer without explicitly enumerating every step unless asked. If you want O1 to output the chain of logic, simply instruct it to — it will have no trouble doing so. In fact, O1-mini was noted to be capable of providing stepwise breakdowns (e.g., in coding problems) when prompted. Meanwhile, if you don’t want a long logical exposition from O1 (maybe you just want the final answer), you should say “Give the final answer directly” to skip the verbose explanation. Logical Rigor vs. Creativity: One more difference: GPT-4 (and 4o) has a streak of creativity and generative strength. Sometimes in logic problems, this can lead it to “imagine” scenarios or analogies, which isn’t always desired. O1 is more rigor-focused and will stick to logical analysis. If your prompt involves a scenario requiring both deduction and a bit of creativity (say, solving a mystery by piecing clues and adding a narrative), GPT-4 might handle the narrative better, while O1 will strictly focus on deduction. In prompt engineering, you might combine their strengths: use O1 to get the logical solution, then use GPT-4 to polish the presentation. If sticking to O1/O3 only, be aware that you might need to explicitly ask it for creative flourishes or more imaginative responses – they will prioritize logic and correctness by design. Key adjustment: In summary, to leverage O1/O3’s logical strengths, give them the toughest reasoning tasks as a single well-defined prompt. Let them internally grind through the logic (they’re built for it) without micromanaging their thought process. For GPT-4o, continue using classic prompt engineering (decompose the problem, ask for step-by-step reasoning, etc.) to coax out the same level of deduction. And always match the prompt style to the model – what confuses GPT-4o might be just right for O1, and vice versa, due to their different reasoning approaches. Crafting Effective Prompts: Best Practices Summary To consolidate the above into actionable guidelines, here’s a checklist of best practices when prompting O1 or O3-mini: Use Clear, Specific Instructions: Clearly state what you want the model to do or answer. Avoid irrelevant details. For complex questions, a straightforward ask often suffices (no need for elaborate role-play or multi-question prompts). Provide Necessary Context, Omit the Rest: Include any domain information the model will need (facts of a case, data for a math problem, etc.), since the model might not have up-to-date or niche knowledge. But don’t overload the prompt with unrelated text or too many examples – extra fluff can dilute the model’s focus Minimal or No Few-Shot Examples: By default, start with zero-shot prompts. If the model misinterprets the task or format, you can add one simple example as guidance, but never add long chains of examples for O1/O3. They don’t need it, and it can even degrade performance. Set the Role or Tone if Needed: Use a system message or a brief prefix to put the model in the right mindset (e.g. “You are a senior law clerk analyzing a case.”). This helps especially with tone (formal vs. casual) and ensures domain-appropriate language. Specify Output Format: If you expect the answer in a particular structure (list, outline, JSON, etc.), tell the model explicitly. The reasoning models will follow format instructions reliably. For instance: “Give your answer as an ordered list of steps.” Control Length and Detail via Instructions: If you want a brief answer, say so (“answer in one paragraph” or “just give a yes/no with one sentence explanation”). If you want an in-depth analysis, encourage it (“provide a detailed explanation”). Don’t assume the model knows your desired level of detail by default – instruct it. Leverage O3-mini’s Reasoning Effort Setting: When using O3-mini via API, choose the appropriate reasoning effort (low/medium/high) for the task. High gives more thorough answers (good for complex legal reasoning or tough math), low gives faster, shorter answers (good for quick checks or simpler queries). This is a unique way to tune the prompt behavior for O3-mini. Avoid Redundant “Think Step-by-Step” Prompts: Do not add phrases like “let’s think this through” or chain-of-thought directives for O1/O3; the model already does this internally. Save those tokens and only use such prompts on GPT-4o, where they have impact. Test and Iterate: Because these models can be sensitive to phrasing, if you don’t get a good answer, try rephrasing the question or tightening the instructions. You might find that a slight change (e.g. asking a direct question vs. an open-ended prompt) yields a significantly better response. Fortunately, O1/O3’s need for iteration is less than older models (they usually get complex tasks right in one go), but prompt tweaking can still help optimize clarity or format. Validate Important Outputs: For critical use-cases, don’t rely on a single prompt-answer cycle. Use follow-up prompts to ask the model to verify or justify its answer (“Are you confident in that conclusion? Explain why.”), or run the prompt again to see if you get consistent results. Consistency and well-justified answers indicate the model’s reasoning is solid. By following these techniques, you can harness O1 and O3-mini’s full capabilities and get highly optimized responses that play to their strengths. Applying Best Practices to a Legal Case Analysis Finally, let’s consider how these prompt engineering guidelines translate to a legal case analysis scenario (as mentioned earlier). Legal analysis is a perfect example of a complex reasoning task where O1 can be very effective, provided we craft the prompt well: Structure the Input: Start by clearly outlining the key facts of the case and the legal questions to be answered. For example, list the background facts as bullet points or a brief paragraph, then explicitly ask the legal question: “Given the above facts, determine whether Party A is liable for breach of contract under U.S. law.” Structuring the prompt this way makes it easier for the model to parse the scenario. It also ensures no crucial detail is buried or overlooked. Provide Relevant Context or Law: If specific statutes, case precedents, or definitions are relevant, include them (or summaries of them) in the prompt. O1 doesn’t have browsing and might not recall a niche law from memory, so if your analysis hinges on, say, the text of a particular law, give it to the model. For instance: “According to [Statute X excerpt], [provide text]… Apply this statute to the case.” This way, the model has the necessary tools to reason accurately. Set the Role in the System Message: A system instruction like “You are a legal analyst who explains the application of law to facts in a clear, step-by-step manner.” will cue the model to produce a formal, reasoned analysis. While O1 will already attempt careful reasoning, this instruction aligns its tone and structure with what we expect in legal discourse (e.g. citing facts, applying law, drawing conclusions). No Need for Multiple Examples: Don’t supply a full example case analysis as a prompt (which you might consider doing with GPT-4o). O1 doesn’t need an example to follow – it can perform the analysis from scratch.. You might, however, briefly mention the desired format: “Provide your answer in an IRAC format (Issue, Rule, Analysis, Conclusion).” This format instruction gives a template without having to show a lengthy sample, and O1 will organize the output accordingly. Control Verbosity as Needed: If you want a thorough analysis of the case, let O1 output its comprehensive reasoning. The result may be several paragraphs covering each issue in depth. If you find the output too verbose or if you specifically need a succinct brief (for example, a quick advisory opinion), instruct the model: “Keep the analysis to a few key paragraphs focusing on the core issue.” This ensures you get just the main points. On the other hand, if the initial answer seems too brief or superficial, you can prompt again: “Explain in more detail, especially how you applied the law to the facts.” O1 will gladly elaborate because it has already done the heavy reasoning internally. Accuracy and Logical Consistency: Legal analysis demands accuracy in applying rules to facts. With O1, you can trust it to logically work through the problem, but it’s wise to double-check any legal citations or specific claims it makes (since its training data might not have every detail). You can even add a prompt at the end like, “Double-check that all facts have been addressed and that the conclusion follows the law.” Given O1’s self-checking tendency, it may itself point out if something doesn’t add up or if additional assumptions were needed. This is a useful safety net in a domain where subtle distinctions matter. Use Follow-Up Queries: In a legal scenario, it’s common to have follow-up questions. For instance, if O1 gives an analysis, you might ask, “What if the contract had a different clause about termination? How would that change the analysis?” O1 can handle these iterative questions well, carrying over its reasoning. Just remember that, if the project you ar working on, the interface doesn’t have long-term memory beyond the current conversation context (and no browsing), each follow-up should either rely on the context provided or include any new information needed. Keep the conversation focused on the case facts at hand to prevent confusion. By applying these best practices, your prompts will guide O1 or O3-mini to deliver high-quality legal analysis. In summary, clearly present the case, specify the task, and let the reasoning model do the heavy lifting. The result should be a well-reasoned, step-by-step legal discussion that leverages O1’s logical prowess, all optimized through effective prompt construction. Using OpenAI’s reasoning models in this way allows you to tap into their strength in complex problem-solving while maintaining control over the style and clarity of the output. As OpenAI’s own documentation notes, the O1 series excels at deep reasoning tasks in domains like research and strategy– legal analysis similarly benefits from this capability. By understanding the differences from GPT-4o and adjusting your prompt approach accordingly, you can maximize the performance of O1 and O3-mini and obtain accurate, well-structured answers even for the most challenging reasoning tasks.18KViews6likes4CommentsImplementing Event Hub Logging for Azure OpenAI Streaming APIs
Azure OpenAI's streaming responses use Server-Sent Events (SSE), which support only one subscriber. This creates a challenge when using APIM's Event Hub Logger as it would consume the stream, preventing the actual client from receiving the response. This solution introduces a lightweight Azure Function proxy that enables Event Hub logging while preserving the streaming response for clients. With token usage data being available in both stream & non-stream AOAI API, we can monitor this the right way! Architecture Client → APIM → Azure Function Proxy → Azure OpenAI ↓ Event Hub Technical Implementation Streaming Response Handling The core implementation uses FastAPI's StreamingResponse to handle Server-Sent Events (SSE) streams with three key components: 1. Content Aggregation async def process_openai_stream(response, messages, http_client, start_time): content_buffer = [] async def generate(): for chunk in response: if chunk.choices[0].delta.content: content_buffer.append(chunk.choices[0].delta.content) yield f"data: {json.dumps(chunk.model_dump())}\n\n" This enables real-time streaming to clients while collecting the complete response for logging. The content buffer maintains minimal memory overhead by storing only text content. 2. Token Usage Collection if hasattr(chunk, 'usage') and chunk.usage: log_data = { "type": "stream_completion", "content": "".join(content_buffer), "usage": chunk.usage.model_dump(), "model": model_name, "region": headers.get("x-ms-region", "unknown") } log_to_eventhub(log_data) Token usage metrics are captured from the final chunk, providing accurate consumption data for cost analysis and monitoring. 3. Performance Tracking @app.route(route="openai/deployments/{deployment_name}/chat/completions") async def aoaifn(req: Request): start_time = time.time() response = await process_request() latency_ms = int((time.time() - start_time) * 1000) log_data["latency_ms"] = latency_ms End-to-end latency measurement includes request processing, OpenAI API call, and response handling, enabling performance monitoring and optimization. Demo Function Start API Call Event Hub Setup Deploy the Azure Function Configure environment variables: AZURE_OPENAI_KEY= AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_BASE_URL=https://.openai.azure.com/ AZURE_EVENTHUB_CONN_STR= Update APIM routing to point to the Function App Extension scenarios: APIM Managed Identity Auth token passthrough PII Filtering: Integration with Azure Presidio for real-time PII detection and masking in logs Cost Analysis: Token usage mapping to Azure billing metrics Latency based routing: AOAI Endpoint ranking could be built based on Latency metrics Monitoring Dashboard: Real-time visualisation of: Token usage per model/deployment Response latencies Error rates Regional distribution Implementation available on GitHub.863Views1like3CommentsBuilt-in Enterprise Readiness with Azure AI Agent Service
Ensure enterprise-grade security and compliance with Private Network Isolation (BYO VNet) in Azure AI Agent Service. This feature allows AI agents to operate within a private, isolated network, giving organizations full control over data and networking configurations. Learn how Private Network Isolation enhances security, scalability, and compliance for mission-critical AI workloads.1.5KViews2likes0CommentsAnnouncing Provisioned Deployment for Azure OpenAI Service Fine-tuning
You've fine-tuned your models to make your agents behave and speak how you'd like. You've scaled up your RAG application to meet customer demand. You've now got a good problem: users love the service but want it snappier and more responsive. Azure OpenAI Service now offers provisioned deployments for fine-tuned models, giving your applications predictable performance with predictable costs! 💡 What is Provisioned Throughput? If you're unfamiliar with Provisioned Throughput, it allows Azure OpenAI Service customers to purchase capacity in terms of performance needs instead of per-token. With fine-tuned deployments, it replaces both the hosting fee and the token-based billing of Standard and Global Standard (now in Public Preview) with a throughput-based capacity unit called provisioned through units (PTU). Every PTU corresponds to a commitment of both latency and throughput in Tokens per Minute (TPM). This differs from Standard and Global Standard which only provide availability guarantees and best-effort performance. With fine-tuned deployments, it replaces both the hosting fee and the token-based billing of Standard and Global Standard (now in Public Preview) with a throughput-based capacity unit called a PTU. 🤔 Is this the same PTU I'm already using? You might already be using Provisioned Throughput Units with base models and with fine-tuned models they work the same way. In fact, they're completely interchangeable! Already have quota in North Central US for 800 PTU and an annual Azure reservation rate? PTUs are interchangeable and model independent meaning you can get started with using them for fine-tuning immediately without any additional steps. Just select Provisioned Managed (Public Preview) from the model deployment dialog and set your PTU allotment. 📋 What's available in Public Preview? We're offering provisioned deployment in two regions for both gpt-4o (2024-08-06) and gpt-4o-mini (2024-07-18) to support Azure OpenAI Service customers: North Central US Switzerland West If your workload requires regions other than the above, please make sure to submit a request so we can consider it for General Availability. 🙏 🚀 How do I get started? If you don't already have PTU quota from base models, the easiest way to get started and shifting your fine-tuned deployments to provisioned is: Understand your workload needs. Is it spiky but with a baseline demand? Review some of our previous materials on right-sizing PTUs (or have CoPilot summarize it for you 😆). Estimate the PTUs you need for your workload by using the calculator. Increase your regional PTU quota, if required. Deploy your fine-tuned models to secure your Provisioned Throughput capacity. Make sure to purchase an Azure Reservation to cover your PTU usage to save big. Have a spiky workload? Combine PTU and Standard/Global Standard and configure your architecture for spillover. Have feedback as you continue on your PTU journey with Azure OpenAI Service? Let us know how we can make it better!594Views0likes0CommentsEnabling SharePoint RAG with LogicApps Workflows
SharePoint Online is quite popular for storing organizational documents. Many organizations use it due to its robust features for document management, collaboration, and integration with other Microsoft 365 services. SharePoint Online provides a secure, centralized location for storing documents, making it easier for everyone from organization to access and collaborate on files from the device of their choice. Retrieve-Augment-Generate (RAG) is a process used to infuse the large language model with organizational knowledge without explicitly fine tuning it which is a laborious process. RAG enhances the capabilities of language models by integrating them with external data sources, such as SharePoint documents. In this approach, documents stored in SharePoint are first converted into smaller text chunks and vector embeddings of the chunks, then saved into index store such as Azure AI Search. Embeddings are numerical representations capturing the semantic properties of the text. When a user submits a text query, the system retrieves relevant document chunks from the index based on best matching text and text embeddings. These retrieved document chunks are then used to augment the query, providing additional context and information to the large language model. Finally, the augmented query is processed by the language model to generate a more accurate and contextually relevant response. Azure AI Search provides a built-in connector for SharePoint Online, enabling document ingestion via a pull approach, currently in public preview. This blog post outlines a LogicApps workflow-based method to export documents, along with associated ACLs and metadata, from SharePoint to Azure Storage. Once in Azure Storage, these documents can be indexed using the Azure AI Search indexer. At a high level, two workflow groups (historic and ongoing) are created, but only one should be active at a time. The historic flow manages the export of all documents from SharePoint Online to initially populate the Azure AI Search index from Azure Storage where documents are exported to. This flow processes documents from a specified start date to the current date, incrementally considering documents created within a configurable time window before moving to the next time slice. The sliding time window approach ensures compliance with SharePoint throttling limits by preventing the export of all documents at once. This method enables a gradual and controlled document export process by targeting documents created in a specific time window. Once the historical document export is complete, the ongoing export workflow should be activated (historic flow should be deactivated). This workflow exports documents from the timestamp when the historical export concluded up to the current date and time. The ongoing export workflow also accounts for documents created or modified since the last load and handles scenarios where documents are renamed at the source. Both workflows save the last exported timestamp in Azure Storage and use it as a starting point for every run. Historic document export flow Parent flow Recurs at every N hours. This is a configurable value. Usually export of historic documents requires many runs depending upon the total count of documents which could range from thousands to millions. Sets initial values for the sliding window variables - from_date_time_UTC, to_date_time_UTC from_date_time_UTC is read from the blob-history.txt file The to_date_time_UTC is set to from_date_time_UTC plus the increment days. If this increment results in a date greater than the current datetime, to_date_time_UTC is set to the current datetime Get the list of all SharePoint lists and Libraries using the built-in action Initialize the additional variables - files_to_process, files_to_process_temp, files_to_process_chunks Later, these variables facilitate the grouping of documents into smaller lists, with each group being passed to the child flow to enable scaling with parallel execution Loop through list of SharePoint Document libraries and lists Focus only on Document library, ignore SharePoint list (Handle SharePoint list processing only if your specific use case requires it) Get the files within the document library and file properties where file creation timestamp falls between from_date_time_UTC and to_date_time_UTC Created JSON to capture the Document library name and id (this will be required in the child flow to export a document) Use Javascript to only retain the documents and ignore folders. The files and their properties also have folders as a separate item which we do not require. Append the list of files to the variable Use the built-in chunk function to create list of lists, each containing the document as an item Invoke child workflow and pass each sub-list of files Wait for all child flows to finish successfully and then write the to_date_time_UTC to the blob-history.txt file Child flow Loop through each item which is document metadata received from the parent flow Get the content of file and save into Azure Storage Run SharePoint /roleassignments API to get the ACL (Access Control List) information, basically the users and groups that have access to the document Run Javascript to keep roles of interest Save the filtered ACL into Azure Storage Save the document metadata which is document title, created / modified timestamps, creator, etc. into Azure Storage All the information is saved into Azure Storage which offers flexibility to leverage the parts based on use case requirements All document metadata is also saved into an Azure SQL Database table for the purpose of determining if the file being processed was modified (exists in the database table) or renamed (file names do not match) Return Status 200 indicating the child flow has successfully completed Ongoing data export flow Parent flow The ongoing parent flow is very similar to the historic flow, it’s just that Get the files within the document library action gets the files that have creation timestamp or modified timestamp between from_date_time_UTC and to_date_time_UTC. This change allows to handle files that get created or modified in SharePoint after last run of the ongoing workflow. Note: Remember, you need to disable the historic flow after all history load has been completed. The ongoing flow can be enabled after the historic flow is disabled. Child flow The ongoing child flow also follows similar pattern of the historic child flow. Notable differences are – Handling of document rename at source which deletes the previously exported file / metadata / ACL from Azure Storage and recreates these artefacts with new file name. Return Status 200 indicating the child flow has successfully completed Both flows have been divided into parent-child flows, enabling the export process to scale by running multiple document exports simultaneously. To manage or scale this process, adjust the concurrency settings within LogicApps actions and the App scale-out settings under the LogicApps service. These adjustments help ensure compliance with SharePoint throttling limits. The presented solution works with single site out of the box and can be updated to work with a list of sites. Workflow parameters Parameter Name Type Example Value sharepoint_site_address String https://XXXXX.sharepoint.com/teams/test-sp-site blob_container_name String sharepoint-export blob_container_name_acl String sharepoint-acl blob_container_name_metadata String sharepoint-metadata blob_load_history_container_name String load-history blob_load_history_file_name String blob-history.txt file_group_count Int 40 increment_by_days int 7 The workflows can be imported into from GitHub repository below. Github repo: SharePoint-to-Azure-Storage-for-AI-Search LogicApps workflows658Views0likes0Comments