artificial intelligence
185 TopicsThe Future of AI: Customizing AI agents with the Semantic Kernel agent framework
The blog post Customizing AI agents with the Semantic Kernel agent framework discusses the capabilities of the Semantic Kernel SDK, an open-source tool developed by Microsoft for creating AI agents and multi-agent systems. It highlights the benefits of using single-purpose agents within a multi-agent system to achieve more complex workflows with improved efficiency. The Semantic Kernel SDK offers features like telemetry, hooks, and filters to ensure secure and responsible AI solutions, making it a versatile tool for both simple and complex AI projects.212Views1like0CommentsAnnouncing DeepSeek-V3 on Azure AI Foundry and GitHub
We are pleased to announce the availability of DeepSeek-V3 on Azure AI Foundry model catalog with token-based billing. This latest iteration is part of our commitment to enable powerful, efficient, and accessible AI solutions through the breadth and diversity of choice in the model catalog.1.6KViews2likes0CommentsThe Future of AI: Reduce AI Provisioning Effort - Jumpstart your solutions with AI App Templates
In the previous post, we introduced Contoso Chat – an open-source RAG-based retail chat sample for Azure AI Foundry, that serves as both an AI App template (for builders) and the basis for a hands-on workshop (for learners). And we briefly talked about five stages in the developer workflow (provision, setup, ideate, evaluate, deploy) that take them from the initial prompt to a deployed product. But how can that sample help you build your app? The answer lies in developer tools and AI App templates that jumpstart productivity by giving you a fast start and a solid foundation to build on. In this post, we answer that question with a closer look at Azure AI App templates - what they are, and how we can jumpstart our productivity with a reuse-and-extend approach that builds on open-source samples for core application architectures.235Views0likes0CommentsThe Future of AI: Harnessing AI for E-commerce - personalized shopping agents
Explore the development of personalized shopping agents that enhance user experience by providing tailored product recommendations based on uploaded images. Leveraging Azure AI Foundry, these agents analyze images for apparel recognition and generate intelligent product recommendations, creating a seamless and intuitive shopping experience for retail customers.375Views5likes2CommentsPrepare and get ready for AI-900 Certification
The Azure AI Fundamentals Training program is designed to provide a foundational understanding of artificial intelligence (AI) concepts and how AI services can be utilized on Azure. This training is ideal for individuals new to AI, aiming to build a solid understanding of AI concepts, practical applications, and the Responsible AI considerations involved. Throughout the program, participants will explore various AI services available on Azure. By the end of the program, attendees will be equipped with the knowledge to implement and manage AI solutions using Azure's tools and services The training program will run from 7th January, and we will have live sessions on YouTube and Discord from 8 - 9pm GMT+3 Earn a Certification Voucher Upon successful completion of the program, Kenyan participants will receive a certification voucher. By earning this certification, you'll be able to showcase your knowledge and skills to potential employers and colleagues, giving you a competitive edge in the job market. Key Program Takeaways Gain a foundational understanding of AI concepts and their applications. Explore Azure's AI services and tools. Learn how to implement practical AI solutions using Azure. Understand the responsible AI considerations and best practices for developing and deploying AI projects. Learner Checklist As a learner/participant, this is how you can participate in the program: Catch up and rewatch our previous AI sessions on Microsoft Reactor. Link: https://aka.ms/aifundamentalstraining-reactors Sign up and participate in the Microsoft Learn Challenge: https://aka.ms/aifundamentalstraining-csc - closing soon! Certification vouchers are up for grabs once you complete the challenge for all Kenyan participants. Once you sign up or complete the challenge, fill the form https://aka.ms/aifundamentalstraining-voucher to be eligible for a certification voucher. Join the Discord Community to interact with other learners: https://aka.ms/aifundamentalstraining-discord Sign up to the AINSI Skills Navigator to customize your learning journey at https://aka.ms/aifundamentalstraining-navigator Continue learning and exploring: https://aka.ms/aifundamentalstraining-collection Online Sessions Calendar The training program will run from 7th January, and we will have live sessions on YouTube and Discord from 8 - 9pm GMT+3 Week Topic Live Sessions Link Description 7 Jan Introduction to Artificial Intelligence and Azure AI Services YouTube Embark on a journey to explore the fundamentals of artificial intelligence (AI) with Azure. 9 Jan Microsoft Azure AI Fundamentals: Computer Vision YouTube Dive into the world of computer vision with Azure and discover how to harness the power of AI to analyze and interpret visual data. 14 Jan Microsoft Azure AI Fundamentals: Natural Language Processing YouTube Dive into the fascinating world of natural language processing (NLP) with Azure and learn how to build intelligent applications that can understand and interpret human language 16 Jan Generative AI pt 1 - Fundamentals of Generative AI YouTube Step into the world of generative AI with Azure and discover how to create new content such as text, images, music, and code using advanced AI models. 23 Jan Responsible generative AI YouTube Explore the principles and practices of responsible AI with Azure. 28 Jan Document Intelligence and Knowledge Mining YouTube Discover the power of Azure AI Search and learn how to build intelligent search solutions that can transform your data into actionable insights 29 Jan Generative AI pt 2 - Introduction to Azure AI Foundry YouTube Step into the world of generative AI with Azure and discover how to create new content such as text, images, music, and code using advanced AI models in Azure AI Foundry. 30 Jan Certification Readiness Session Discord Prepare to ace your Microsoft certification exams with this comprehensive walkthrough session. What are you waiting for? Rewatch and engage with live sessions https://aka.ms/aifundamentalstraining-reactors Join in and learn together with us! Remember, certification vouchers are up for grabs! Preparing for the AI-900 Certification can be a rewarding experience that opens up new opportunities in the field of AI and ML. By following the tips and utilizing the resources provided, you'll be well on your way to achieving your certification. Stay motivated, keep learning, and good luck on your journey to becoming AI-900 certified! All learning resources can be found at: MS Learn Collection: https://aka.ms/aifundamentalstraining-collection Enjoyed the session? Send us your feedback, the good, the bad and the ugly at: https://aka.ms/aifundamentalstraining-feedback3KViews4likes7CommentsThe Future of AI: Power Your Agents with Azure Logic Apps
Building intelligent applications no longer requires complex coding. With advancements in technology, you can now create agents using cloud-based tools to automate workflows, connect to various services, and integrate business processes across hybrid environments without writing any code.2KViews2likes1CommentDistillation: Turning Smaller Models into High-Performance, Cost-Effective Solutions
by Vishal Yadav, Nikhil Pandey Introduction Large Language Models (LLMs) have transformed the landscape of natural language processing (NLP) with their ability to understand and generate human-like text. However, their size and complexity often pose challenges in terms of deployment, speed, and cost. Usually for specialized niche tasks, we end up deploying the best available model even though we don’t utilize all its capabilities. This is where distillation comes in, offering a method to create (fine-tune) smaller, customized, more efficient models, while retaining much of the performance of a significantly larger state-of-the-art model. What is distillation? Distillation is a technique designed to transfer knowledge of a large pre-trained model (the "teacher") into a smaller model (the "student"), enabling the student model to achieve comparable performance to the teacher model. This technique allows users to leverage the high quality of larger LLMs, while reducing inference costs in a production environment, thanks to the smaller student model. How distillation works? In distillation, knowledge can be transferred from teacher to student model in several ways. Here, we specifically discuss response-based, offline distillation, where the student model learns to mimic the output (only predictions) of the teacher model, and the teacher model is not trained during distillation. Teacher Model: A large, high-capacity teacher model that is already pre-trained on massive datasets. This model has learnt rich representations and complex patterns from the data which allows it to generalize well even on unseen tasks. Knowledge Extraction: The teacher model generates outputs based on given inputs, which are then used as training data for the student model. This involves not just mimicking outputs but also understanding the underlying reasoning processes. Student Model Training: A smaller student model is trained using the extracted knowledge as a guide. The student model learns to mimic the teacher model's behavior and predictions on specific tasks. Advantages Reduced Size: The resulting student model is significantly smaller, making it easier to deploy in resource-constrained environments. Lower Cost: Running smaller models incurs lower operational costs while maintaining competitive performance levels. Task-Specific Optimization: Distillation can be tailored for specific applications, enhancing efficiency and accuracy. Performance: Smaller models exhibit significantly lower latency compared to larger models, which in turn boosts the throughput of the deployment. Customization: Distillation allows users to select desirable traits from multiple larger models and transfer them to smaller models. Personalization: Personality traits can be incorporated into the model, enabling it to respond with relevant answers when queried about its personality. Synthetic Data Generation: At scale data generation can be done either only for labels or from scratch using just seed/meta data. Generalization: Distillation can help student models generalize better by learning from the teacher model's knowledge and avoiding overfitting. Improved Multilingual Capabilities: The multilingual performance of smaller models can be significantly enhanced with the help of teacher models making them suitable for global applications. Distillation in Azure AI Foundry Distillation as a Service is now supported on Azure allowing a variety of task types and more to be added soon. Following tasks are supported. Summarization: Given a document (article) to summarize, generate an entity-dense summary of the document. Conversational Assistant: Generate AI assistant responses on single-turn and multi-turn conversational datasets. To generate each response, the available chat history and the current user prompt are utilized. Natural Language Understanding (NLU) o MATH: Generate numeric answers to math problems. o Natural Language Inference (NLI): Given premise and hypothesis, determine if premise entails the hypothesis, or contradicts the hypothesis, or is neutral i.e. neither entails not contradicts the hypothesis. o Multiple-Choice Question Answering: Given question and answer choices, determine the correct answer choice. Distillation Process Overview of the two-step distillation process: (1) Generate synthetic data using a task-specific, elaborate prompt (2) Train (and infer from) the student model using a shorter prompt (Figure source: https://arxiv.org/pdf/2410.18588) The distillation process involves two main steps: generate high quality synthetic data (labels) using the teacher model, followed by instruction-based finetuning of the student model. Data Generation High-quality data generation is crucial for the student model's performance. Azure provides a proprietary library of advanced prompts, to generate high-quality synthetic data for all supported tasks, utilizing techniques such as Chain of Thought (CoT) or Chain of Density (CoD), and other best practices. This option can be enabled by passing the `enable_chain_of_thought` parameter while invoking the distillation pipeline, ensuring reasoning-based answers and consequently high-quality data for distillation. Instruction Fine-Tuning The next step is to fine-tune the smaller model using the task-specific generated data. This involves using a concise, task-specific prompt and training with the input and generated output (excluding reasoning steps). These innovations ensure significant performance gains for a given task while minimizing the cost (number of tokens) for the user. When using user-provided prompts, the same prompt is applied in both data generation and fine-tuning. Distillation Code Snippet Distillation is supported by the Azure SDK and CLI. Support for this was added in version 1.22.0 of azure-ai-ml. Ensure that the azure-ai-ml package is >= 1.22.0 before using the code snippet below. Model Offerings Teacher Models Currently Meta Llama 3.1 405B Instruct is supported as the teacher model for distillation. Student Models Currently Meta Llama 3.1 8B Instruct is supported as the student model for distillation. Soon all Microsoft’s Phi 3 and 3.5 Instruct series models will also be available for distillation. The following table demonstrates our current and upcoming student model offerings. Student Model Region Availability Meta Llama 3.1 8B Instruct West US 3 Available Phi 3/3.5 Instruct East US 2 Coming Soon At the time of this writing, fine-tuning of Meta Llama 3.1 Instruct series of models, and deployment of such fine-tuned models, is only available in West US 3 region. Whereas fine-tuning of Microsoft’s Phi 3 Instruct series of models, and deployment of such fine-tuned models, is only available in East US 2 region. Ensure your AI Foundry project is setup in the appropriate region for your selected student model. Notebooks Distilling Large Language Models for NLI Tasks: A Practical Guide Notebook - Distillation with Large Language Models This notebook provides a comprehensive guide on how to distil a large teacher model into a smaller student model, specifically for Natural Language Inference (NLI) tasks. It uses the Meta Llama 3.1 405B Instruct as the teacher and the Meta Llama 3.1 8B Instruct as the student model. Key Highlights Teacher and Student Models: The process uses Meta Llama 3.1 405B Instruct as the teacher model and Meta Llama 3.1 8B Instruct as the student model. Prerequisites: Ensure you have subscribed to the required models and set up an AI Foundry project in the West US 3 region for distillation of a Meta Llama 3.1 8B Instruct student model. SDK Installation: Install necessary SDKs such as azure-ai-ml, azure-identity, and mlflow. Dataset Preparation: Use the ConjNLI dataset from Hugging Face for training and validation. Distillation Job: Configure and run the distillation job to transfer knowledge from the teacher to the student model. Deployment: Optionally, deploy the distilled model to a serverless endpoint and perform sample inferences. This notebook simplifies the complex task of model distillation, making it accessible even to those new to NLP and model training. Results Using the ConjNLI dataset and Chain-Of-Thought (CoT) distillation, we obtain the following accuracy (%) metrics. Dataset Student Model Teacher (405B) with CoT Prompting Student with CoT Prompting Student Distilled on CoT-prompted Teacher Output ConjNLI (dev) Meta Llama 3.1 8B Instruct 69.98 52.81 63.88 ConjNLI (dev) Phi 3 Mini 128k Instruct 69.98 43.98 57.78 Distillation with the Meta Llama 3.1 8B Instruct and Phi 3 Mini 128k Instruct student models provides approximately 21% and 31% improvement respectively over directly prompting the student model using CoT prompt. For detailed results on other datasets and tasks, we refer the user to check the published results in our knowledge distillation paper. Conclusion Distillation represents a significant step forward in development and deployment of LLM/SLM at scale. By transferring the knowledge from a large pre-trained model (teacher) to a smaller, more efficient model (student), distillation offers a practical solution to the challenges of deploying large models, such as high costs and complexity. This technique not only reduces model size and operational costs but also enhances the performance of student models for specific tasks. The support for distillation on Azure AI Foundry further simplifies the process, making it accessible for various applications, such as summarization, conversational assistance, and natural language understanding tasks. Furthermore, the detailed, hands-on example notebooks provided in Azure Github can help facilitate easier adoption. In summary, distillation not only bridges the gap between generalist understanding and specialized application but also makes the way for a more sustainable and practical approach to leveraging LLMs in real-world scenarios.4.7KViews2likes1CommentA Framework for Calculating ROI for Agentic AI Apps
Contributors and Reviewers: Anurag Karuparti (C), Aishwarya Umachandran(C), Tara Webb(R), Bart Czernicki (R), Simon Lacasse (R), Vishnu Pamula (R) ROI serves as a critical metric for assessing the financial benefits of any investment, including AI projects. It helps determine whether the investment generates more value than it costs. The fundamental formula for calculating ROI is: ROI = (Net Return from Investment - Cost of Investment) / Cost of Investment * 100 Studies indicate that companies investing in AI are realizing significant returns, with an average ROI of $3.7 for every $1 invested. Notably, 5% of organizations worldwide are achieving an even higher average ROI of $10 for every $1 invested. (IDC Study 2024) 1. Key Metrics for Measuring ROI in Agentic AI Apps Measuring the ROI of agentic AI apps necessitates a comprehensive approach that considers both tangible and intangible benefits. Intangible benefits may be difficult to quantify but significantly contribute to ROI. Here are some key metrics to consider: a. Tangible Benefits Cost Savings: Agentic Apps can automate tasks, leading to significant cost reductions in areas like customer service, data entry, and many business operations. By handling complex workflows autonomously, agentic AI minimizes the need for human intervention, resulting in lower labor costs and increased efficiency. Revenue Increase: Agentic Apps can help businesses identify new revenue streams, optimize pricing strategies, and improve sales and marketing effectiveness, ultimately driving revenue growth. Productivity Gains: By automating tasks and providing employees with enhanced tools and information, Agentic Apps can boost productivity and efficiency. Data Quality Improvements: Agentic Apps can minimize errors in tasks such as data entry and analysis, leading to improved accuracy and reduced costs associated with correcting mistakes. Improved Customer Satisfaction: Agentic Apps can enhance customer satisfaction by providing personalized experience, faster service, and proactive problem-solving. Faster Time-to-Market: Agentic AI can accelerate product development and deployment, enabling businesses to bring new products and services to market faster. b. Intangible Benefits Improved Decision-Making: Agentic AI can analyze vast amounts of data and provide valuable insights that can help businesses make more informed decisions. Enhanced Brand Reputation: By providing innovative and efficient services, agentic AI can enhance a company's brand reputation and foster customer loyalty. Increased Employee Satisfaction: By automating mundane tasks and empowering employees with better tools, agentic AI can improve employee satisfaction and retention. Improved Compliance: Agentic AI can help businesses comply with regulations and reduce the risk of penalties. Increased Innovation: By freeing up employees from routine tasks, agentic AI can foster a culture of innovation and creativity. 2. Cost Components of Developing and Deploying Agentic Apps Developing and deploying agentic AI apps involves various cost components, which can be categorized as follows: Cost Component Description Example Development Costs This includes the cost of software and development tools, salaries of developers, data scientists, and machine learning engineers, and cloud computing resources. Salaries for a team comprising a data scientist ($120,000 - $180,000 per year), a machine learning engineer ($130,000 - $200,000 per year), and an AI software developer ($110,000 - $170,000 per year) and development costs on cloud platforms like Azure (The above salaries are just estimates based on public info and can vary) Data Acquisition and Preparation Agentic AI apps may require large amounts of data for training and operation. This includes the cost of acquiring data, cleaning it, and preparing it for use in AI models. Purchasing datasets from third-party providers or investing in data annotation services. Testing and Deployment This includes the cost of testing the AI app, deploying it to the cloud or on-premises, and integrating it with existing systems. Cloud computing costs for deploying the app on platforms Azure, AWS and Google. Maintenance and Updates Agentic AI apps require ongoing maintenance and updates to ensure they remain effective and secure. This includes the cost of monitoring the app, fixing bugs, and adding new features. Costs associated with software updates, security patches, and ongoing monitoring of the app's performance. 3. New Revenue Streams from Agentic Apps Agentic AI apps can generate revenue through various business models by enhancing business operations in several ways. Revenue Stream/Value Proposition Description Example Subscription Fees Businesses can charge users a recurring fee for access to the agentic AI app. Offering different subscription tiers with varying levels of access and features. Usage-Based Pricing Businesses can charge users based on their usage of the app, such as the number of tasks performed, or the amount of data processed. Charging users per API call or per transaction processed by the agentic AI app. Licensing Fees Businesses can license their agentic AI technology to other companies. Granting other businesses, the right to use the agentic AI technology in their own products or services. It's important to note that agentic AI is poised to disrupt traditional SaaS business models, particularly the prevalent per-seat pricing model. As agentic AI becomes more sophisticated, businesses may shift towards alternative pricing models, such as usage-based pricing or outcome-based pricing, where the cost is directly tied to the AI's contribution to measurable business goals. 4. Framework for Calculating ROI for Agentic Apps Based on the analysis presented above, the following framework can be used to calculate the ROI of agentic AI apps: Define Objectives and KPIs: Clearly define the objectives of implementing the agentic AI app and the key performance indicators (KPIs) that will be used to measure its success. This could include metrics such as cost savings, revenue increase, productivity gains, customer satisfaction, and error reduction. Establish a Baseline: Establish a baseline for the KPIs before implementing the agentic AI app. This will help measure the impact of the app on the business. Estimate Revenue Gains and Cost Savings: Estimate the potential revenue gains and cost savings that can be achieved by implementing the AI Agentic. This may involve analyzing historical data, conducting surveys, and consulting with industry experts. Identify and Assess Costs: Identify all costs associated with developing, deploying, and maintaining the agentic AI app. This includes development costs, data acquisition costs, infrastructure costs, and ongoing maintenance costs. Determine Intangible Benefits: Identify and assess the intangible benefits of the agentic AI app, such as improved decision-making, enhanced brand reputation, and increased employee satisfaction. While these benefits may be difficult to quantify, they can significantly contribute to the overall ROI. Set a Realistic Timeframe: Establish a realistic timeframe for measuring the ROI of the agentic AI app. This should consider the time it takes to develop, deploy, and fully integrate the app into the business. Develop a Current State Scenario: Develop a scenario that represents the current state of the business without the agentic AI app. This will help compare the performance of the business with and without the app. Calculate the ROI: Using the data gathered in the previous steps, calculate the ROI of the agentic AI app using the ROI formula. Monitor and Adjust: Continuously monitor the performance of the agentic AI app and track the KPIs. Adjust the app and its implementation as needed to optimize its effectiveness and maximize ROI. When calculating the ROI of AI initiatives, it's crucial to avoid common pitfalls such as: Uncertainty of Benefits: Accurately estimating the benefits of AI can be challenging due to the evolving nature of technology and the potential for unforeseen outcomes. Computing ROI Based on a Single Point in Time: AI projects often have long-term benefits that may not be fully realized in the short term. As per a recent IDC Study in Nov 2024, organizations realize value in14 months. Treating Each AI Project Individually: AI projects can have synergistic effects and evaluating them in isolation may underestimate their overall impact on the business. 5. Example Scenarios: Option-1 A financial services call center handles 100,000 customer inquiries per year, each currently taking an average of 5 minutes. Of these calls, 10% (10,000 calls) are simple, routine requests (e.g., checking balances) and can be easily automated. Additionally, misrouting and inefficient handling cause each call to run 1 extra minute on average. Current Situation (Before Multi-Agent AI): Total calls: 100,000 Simple, routine calls: 10,000 Agent costs per minute: $0.50 Routine Calls Cost (Before AI): Routine calls each take 3 minutes. Total routine call time: 10,000 calls × 3 min = 30,000 min Cost: 30,000 min × $0.50 = $15,000 per year Misrouting Cost (Before AI): Extra 1 minute per call due to misrouting. Total extra time: 100,000 calls × 1 min = 100,000 min Cost: 100,000 min × $0.50 = $50,000 per year Total Extra Costs (Before AI): Routine tasks: $15,000 Misrouting: $50,000 Combined inefficiencies: $65,000 per year After Implementing Multi-Agent Collaboration AI: The AI system handles routine inquiries automatically and optimizes call routing: Routine Calls Automated: 10,000 routine calls no longer require agent time. Saves $15,000 per year on routine tasks. Correct Routing: Removes the extra 1 minute per call. Saves $50,000 per year from avoiding misrouting costs. Efficiency Gains: With misrouting fixed and agents freed from routine tasks, staff can handle a slight increase in call volume and also reduce overtime. Staff can handle an additional 4000 calls annually, each call at 5 minutes on average. (4000*5*0.50 = $10,000) Total Annual Savings After AI (Tangible Benefit): Routine tasks saved: $15,000 Misrouting eliminated: $50,000 Efficiency gains: $10,000 Total: $75,000 System Costs: Implementation and integration: $40,000 Annual maintenance: $5,000 Total Annual Cost: $45,000 ROI Calculation: Net Benefit: $75,000 (savings) – $45,000 (cost) = $30,000 ROI = (Net Benefit / Cost) × 100% = (30,000 / 45,000) × 100% ≈ 67% A 67% ROI means that for every dollar invested in the multi-agent collaboration AI system, the call center gains an additional 67 cents in profit each year. Option 2 Scenario: A company wants to semi-automate customer support for their e-commerce platform using an AI-powered chatbot on Azure. The AI-powered customer service chatbot provides support for very frequently asked questions. It automates responses, provides real-time order tracking, and offers personalized product recommendations while proactively engaging customers with tailored offers and anticipating their needs. It autonomously handles tasks like follow-ups and issue resolution, integrates seamlessly with existing systems, supports multiple languages, and operates 24/7 to enhance customer satisfaction and drive sales. Additionally, it escalates complex issues to human agents and continuously improves through self-feedback. Cost Estimation: Development and Deployment: $25,000 (including Azure App Service, Azure Agent Service, and other development costs) Maintenance and Support: $5,000 per year Benefit Estimation: Reduced Customer Service Costs: The chatbot handles 2,000 customer inquiries per month, which previously required 3 full-time employees with an average salary of $40,000 per year. Increased Sales: The chatbot's personalized recommendations and efficient support lead to a 5% increase in monthly sales, Calculating ROI: Annual Cost Savings 3 employees * $40,000 = $120,000 Chatbot cost = $25,000 (development) + $5,000 (maintenance) = $30,000 Cost savings = $120,000 - $30,000 = $90,000 Annual Revenue Increase Monthly sales: $500,000 Increase: 5% of $500,000 = $25,000 per month Yearly increase: $25,000 * 12 = $300,000 Total Annual Benefits $90,000 (cost savings) + $300,000 (revenue) = $390,000 ROI ROI = (Total Benefits − Annual Cost) / Annual Cost × 100% = (390,000 − 30,000 / 30,000) × 100% = 1200% This example demonstrates a significant ROI for the customer service chatbot. However, it's important to remember that this is a simplified calculation. Actual ROI may vary depending on various factors specific to the business and its implementation. Note: Calculating Azure Costs Azure costs vary by use case and are dependent on the architecture components. We'll discuss example scenarios for calculating these costs in a future blog. 6. Risks and Considerations Since the core of these agents relies on LLM, there is a potential for hallucination. Rigorous testing and evaluation are therefore critical before deploying them to production. Additionally, in the initial stages, agents may exhibit inefficiencies due to the complexity of orchestration, potentially introducing a 10–20% overhead. It is wise to set an ROI range that considers differences in response confidence. However, over time, these agents are expected to improve and optimize through iterative learning and feedback. 7. ROI will differ from use case to use case For example, in one call center, routine inquiries might be the primary source of inefficiency, while in another, the biggest gains might come from reducing customer wait times. Similarly, different industries may have different labor costs, different complexity levels for tasks, or varying levels of baseline performance. Cloud workload costs on Azure may also change based on usage patterns, the AI services you choose, data storage needs, and the extent of system integration required. In short, while the overall method for calculating ROI remains the same (measure gains, subtract costs, then divide by costs), the types of gains (e.g., labor reduction, error reduction, increased throughput, improved customer satisfaction) and the kinds of costs (e.g., Azure compute, integration services, licensing fees, training expenses) will be different for each scenario. As a result, you need to carefully identify the relevant metrics and expenses for every individual use case. Conclusion Agentic AI apps hold immense potential for businesses seeking to automate tasks, enhance efficiency, and improve decision-making. By implementing a comprehensive framework for calculating ROI, businesses can effectively justify their investment in agentic AI and ensure that these apps deliver both tangible and intangible benefits. This framework should encompass both quantitative and qualitative metrics, including cost savings, revenue increases, productivity gains, customer satisfaction, and intangible benefits such as improved decision-making and enhanced brand reputation. While the framework presented in this report provides a structured approach to evaluating the ROI of agentic AI apps, it's important to acknowledge the potential challenges and limitations. Quantifying some intangible benefits, such as enhanced brand reputation or increased employee satisfaction, can be subjective and may require alternative measurement approaches. Furthermore, the rapidly evolving nature of agentic AI technology may necessitate ongoing adjustments to the ROI framework to accurately capture its impact on businesses. Despite these challenges, a well-defined ROI framework remains crucial for making informed decisions about agentic AI investments and maximizing their potential. By carefully evaluating the ROI of agentic AI apps, businesses can strategically leverage this transformative technology to achieve their objectives and gain a competitive edge in the evolving digital landscape. References: IDC’s 2024 AI opportunity study: Top five AI trends to watch - The Official Microsoft Blog1.2KViews0likes0CommentsFine-Tuning Small Language Models for Function-Calling: A Comprehensive Guide
In the rapidly evolving landscape of artificial intelligence, fine-tuning small language models (SLMs) for use case specific workloads has become increasingly essential. The motivation behind this lies in the need for lower latency, reduced memory footprint, and improved accuracy—all while maintaining cost-effectiveness. This blog delves into the reasons for fine-tuning SLMs for function-call, key considerations, and a practical guide to implementing fine-tuning on Azure Why Fine-Tune Small Language Models? 1. Lower Latency and Reduced Memory Footprint : Smaller models with fewer weights inherently offer faster processing times due to reduced matrix multiplication operations. This lower latency is crucial for real-time applications where speed is paramount. Additionally, these models reduce the memory footprint, making them ideal for deployment in resource-constrained environments. 2. Cost Efficiency : Fine-tuning smaller models is more cost-effective than training large models from scratch. It reduces the computational resources required, thereby lowering operational costs. This makes it a viable option for startups and enterprises looking to optimize their AI expenditure. 3. Improved Accuracy : By tailoring a model to a specific function-calling use case, you can achieve higher accuracy. Fine-tuning allows the model to learn the intricacies of function-calling, thereby providing more relevant and precise outputs. 4. Smaller Token Size : Smaller models and efficient token handling lead to a reduction in token size, which further optimizes processing speed and resource usage. Key Considerations for Fine-Tuning a. Selection of the Right Base Model : Choosing the appropriate base model is crucial. Evaluate industrial benchmarks and leaderboards, such as the [Berkeley Function Call Leaderboard] to guide your selection. Consider factors like model size, which affects GPU VRAM requirements, accuracy, and context length. For this blg post, we will use Llama-3.2-3b-instruct model as our base model for fine-tuning. b. Dataset Preparation : Proper dataset preparation is a cornerstone for successful fine-tuning of SLMs for function-calling tasks. The dataset must be representative of real-world scenarios and cover the full spectrum of use cases you anticipate. For this blog, we will utilize the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, renowned for its comprehensive coverage of simple, multiple, and multi-turn function-calling scenarios across diverse domains. - Key Steps in Dataset Preparation: Understanding the Complexity of the Use Case Before diving into the technicalities of dataset preparation, it's essential to understand the complexity of the use case at hand. Is the task limited to function-calling, or does it involve a broader, more generic conversation? If the latter is true, it becomes imperative to ensure that the existing knowledge and capabilities of the language model (SLM) are preserved. The dataset should seamlessly integrate both function-call and non-function-call scenarios to provide a holistic conversational experience. Differentiating Function-Calling Scenarios Let's explore the different scenarios that might arise in function-calling applications: Single Function-Calling: This scenario involves invoking a single function based on user input. For instance, in the travel industry, a user might ask, "What are the available flights from New York to London on December 10th?" The dataset should include examples that allow the model to extract relevant information and call the flight search function accurately. Multiple Function-Calling: Here, the language model must choose one function from a set of possible tools. For example, if a user asks, "Can you book me a hotel or a flight to Paris?" the dataset should provide instances where the model decides between booking a hotel or a flight based on user preferences or additional input. Multi-Turn Conversations: This scenario requires tools to be invoked in a sequence based on the conversation's state. Consider a user planning a vacation: "I want to visit Italy. What are my options?" followed by "Book me a flight," and then "Find a hotel in Rome." The dataset should capture the flow of conversation, enabling the model to handle each request in context. Parallel Function-Calling: In situations where multiple tools need to be invoked simultaneously, such as booking flights and hotels at the same time, the dataset should include examples that allow the model to manage these parallel tasks effectively. For instance, "Book a flight to Tokyo and reserve a hotel in Shinjuku for the same dates." Handling Missing Information: A robust dataset should also include scenarios where the language model needs to ask the user for missing information. For example, if a user simply says, "Book me a flight," the model should prompt, "Could you please specify the destination and dates?" c. Compute Selection Ensure your compute setup has adequate VRAM to accommodate model weights, gradients, and activations. The compute should be tailored to your model size and batch size requirements. d. Hyperparameter Selection : The selection of hyperparameters is a critical step that can significantly influence the performance of a model. Hyperparameters, unlike the model’s parameters, are not learned from the data but are set before the training process begins. Choosing the right hyperparameters can lead to faster convergence and higher accuracy, making this an area that demands careful attention. Hyperparameters can be thought of as the settings or knobs that you, as the model trainer, can adjust to tailor the training process. These include learning rate, batch size, the architecture of layers, and more. One of the leading methodologies for fine-tuning models is LORA (Low-Rank Adaptation), which has gained popularity due to its efficiency and effectiveness. LORA is a technique that allows for the efficient adaptation of large language models by introducing low-rank matrices during the training process. This approach reduces the number of trainable parameters, leading to faster convergence and reduced computational costs. When using LORA, two primary hyperparameters to consider are: Rank: This represents the dimensionality of the low-rank matrices. It is a critical factor influencing the model’s capacity to learn nuanced patterns. Alpha: This is a scaling factor applied to the low-rank updates, typically set to be 2-4 times the rank value. A good starting point for these parameters might be a rank of 8 and an alpha of 16, but these values should be tailored based on the model's complexity and the specific task at hand. e. Optimize context length : Another significant aspect of model fine-tuning, especially in function-calling scenarios, is the management of context length. In these prompts, we often provide detailed information such as function names, descriptions, and argument types, which consume a substantial number of tokens. Efficiently managing this context can lead to performance gains without sacrificing accuracy. Iterative Experimentation with Context Details: To optimize context length, an iterative experimentation approach is recommended: Baseline Experiment: Start by including all possible details—function descriptions, argument types, and more. This serves as your baseline for comparison. Simplified Contexts: Gradually remove elements from the context: First Iteration: Retain only the function names and arguments, omitting descriptions. Second Iteration: Remove the arguments, keeping just the function names. Final Iteration: Test the model's performance without any function names or arguments. By incrementally simplifying the context, you can identify the minimal necessary While conducting these experiments, it is advantageous to utilize previous checkpoints. Instead of starting from the base model for each iteration, use the trained model from the previous step as a starting point. This approach can save time and computational resources, allowing for more efficient experimentation. Fine-Tuning on Azure: Step-by-Step Now lets run the fine-tuning job while adhering to all the guidelines and instructions shared above:- 1. Create an Azure Machine Learning Workspace: An Azure Machine Learning workspace is your control center for managing all the resources you need to train, deploy, automate, and manage machine learning models. It serves as a central repository for your datasets, compute resources, and models. To get started, you can create a workspace through the Azure portal by navigating to the Azure Machine Learning service and selecting "Create new workspace." Ensure you configure resource group, workspace name, region, and other necessary settings. 2. Create a Compute Instance: To run your Python notebook and execute scripts, you need a compute instance. This virtual machine in Azure Machine Learning allows you to perform data preparation, training, and experimentation. Go to the "Compute" section in your workspace, select "Create," and choose a compute instance that fits your needs, ensuring it has the necessary specifications for your workload. 3: Dataset Preparation: For this blog, we'll use the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, which includes simple, multi-turn function calling and generic conversations across various domains. The dataset needs to be formatted to be compatible with the OpenAI format: Convert each conversation into a chat_template format. Assign roles as 'system', 'user', or 'assistant'. Remove "<|endoftext|>” string and if the response is a function-call, replace the “<functioncall>” string and add role as tool so that LLM knows when to stop responding and wait for function execution results def parse_conversation(input_string): ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"} # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):" # Split the input string and keep the delimiters parts = re.split(pattern, input_string) # Initialize the list to store conversation entries conversation = [] # Iterate over the parts, skipping the first empty string for i in range(1, len(parts), 2): role = parts[i].strip() content = parts[i + 1].strip() content = content.replace("<|endoftext|>", "").strip() if content.startswith('<functioncall>'): # build structured data for function call # try to turn function call from raw text to structured data content = content.replace('<functioncall>', '').strip() # replace single quotes with double quotes for valid JSON clean_content = content.replace("'{", '{').replace("'}", '}') data_json = json.loads(clean_content) # Make it compatible with openAI prompt format func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']} content = {'tool_uses': [func_call]} # Append a dictionary with the role and content to the conversation list conversation.append({"role": ROLE_MAPPING[role], "content": content}) return conversation def prepare_dataset(tokenizer, args): # Create the cache_dir cache_dir = "./outputs/dataset" os.makedirs(cache_dir, exist_ok = True) # Load the dataset from disk train_dataset = load_from_disk(args.train_dir) eval_dataset = load_from_disk(args.val_dir) column_names = list(train_dataset.features) def apply_chat_template(examples): conversations = [] for system, chat in zip(examples["system"], examples["chat"]): try: system_message = parse_conversation(system) chat_message = parse_conversation(chat) message = system_message + chat_message conversations.append(message) except Exception as e: print(e) text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in conversations] return {"text": text} # process the dataseta and drop unused columns processed_train_dataset = train_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) processed_eval_dataset = eval_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names) return processed_train_dataset, processed_eval_dataset 4: Create a Data Asset: Azure Machine Learning allows you to register datasets as data assets, making them easily manageable and reusable: def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False): try: latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)]) if update: raise ResourceExistsError('Found Data asset, but will update the Data.') else: data_asset = ml_client.data.get(name=data_name, version=latest_data_version) logger.info(f"Found Data asset: {data_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: data = Data( path=data_local_dir, type=AssetTypes.URI_FOLDER, description=f"{data_name} for fine tuning", tags={"FineTuningType": "Instruction", "Language": "En"}, name=data_name ) data_asset = ml_client.data.create_or_update(data) logger.info(f"Created/Updated Data asset: {data_name}") return data_asset train_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_train", data_local_dir=f"{DATA_DIR}/train", update=True) val_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_val", data_local_dir=f"{DATA_DIR}/val", update=True) test_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_test", data_local_dir=f"{DATA_DIR}/test", update=True) 5: Create an Environment: While Azure provides built-in environments for common use cases, creating a custom environment tailored to your specific needs can be beneficial. An environment in Azure ML is essentially a containerized setup that defines the software, libraries, and other dependencies required to run your machine learning workload. Why Use Environments? Reproducibility: By defining an environment, you ensure that your training and inference processes are reproducible, with the same configuration used every time. Consistency: Environments help maintain consistency across different runs and teams, reducing "it works on my machine" problems. Portability: They encapsulate your dependencies, making it easier to move and share your ML projects across different Azure services or even with external collaborators. %%writefile {CLOUD_DIR}/train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False): try: latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)]) if update: raise ResourceExistsError('Found Environment asset, but will update the Environment.') else: env_asset = ml_client.environments.get(name=env_name, version=latest_env_version) print(f"Found Environment asset: {env_name}. Will not create again") except (ResourceNotFoundError, ResourceExistsError) as e: print(f"Exception: {e}") env_docker_image = Environment( build=BuildContext(path=docker_dir), name=env_name, description="Environment created from a Docker context.", ) env_asset = ml_client.environments.create_or_update(env_docker_image) print(f"Created Environment asset: {env_name}") return env_asset env = get_or_create_docker_environment_asset(ml_client, azure_env_name, docker_dir=f"{CLOUD_DIR}/train", update=False) Reference : training.ipynb 6: Create a Training Script: Your training script will handle the fine-tuning process and log metrics using MLflow, which is tightly integrated with Azure Machine Learning. This involves - Loading the dataset, defining the model architecture, writing functions to track and log metrics such as training and evaluation loss. def main(args): ################### # Hyper-parameters ################### # Only overwrite environ if wandb param passed if len(args.wandb_project) > 0: os.environ['WANDB_API_KEY'] = args.wandb_api_key os.environ["WANDB_PROJECT"] = args.wandb_project if len(args.wandb_watch) > 0: os.environ["WANDB_WATCH"] = args.wandb_watch if len(args.wandb_log_model) > 0: os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) training_config = {"per_device_train_batch_size" : args.train_batch_size, # Controls the batch size per device "per_device_eval_batch_size" : args.eval_batch_size, # Controls the batch size for evaluation "gradient_accumulation_steps" : args.grad_accum_steps, "warmup_ratio" : args.warmup_ratio, # Controls the ratio of warmup steps "learning_rate" : args.learning_rate, "fp16" : not torch.cuda.is_bf16_supported(), "bf16" : torch.cuda.is_bf16_supported(), "optim" : "adamw_8bit", "lr_scheduler_type" : args.lr_scheduler_type, "output_dir" : args.output_dir, "logging_steps": args.logging_steps, "logging_strategy": "epoch", "save_steps": args.save_steps, "eval_strategy": "epoch", "num_train_epochs": args.epochs, # "load_best_model_at_end": True, "save_only_model": False, "seed" : 0 } peft_config = { "r": args.lora_r, "lora_alpha": args.lora_alpha, "lora_dropout": args.lora_dropout, "bias": "none", #"target_modules": "all-linear", "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], "modules_to_save": None, "use_gradient_checkpointing": "unsloth", "use_rslora": False, "loftq_config": None, } checkpoint_dir = os.path.join(args.output_dir, "checkpoints") train_conf = TrainingArguments( **training_config, report_to="wandb" if use_wandb else "azure_ml", run_name=args.wandb_run_name if use_wandb else None, ) model, tokenizer = load_model(args) model = FastLanguageModel.get_peft_model(model, **peft_config) ############### # Setup logging ############### logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = train_conf.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process a small summary logger.warning( f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}" + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}" ) logger.info(f"Training/evaluation parameters {train_conf}") logger.info(f"PEFT parameters {peft_config}") # Load the dataset train_dataset, eval_dataset = prepare_dataset(tokenizer, args) ########### # Training ########### trainer = SFTTrainer( model=model, args=train_conf, tokenizer = tokenizer, train_dataset=train_dataset, eval_dataset=eval_dataset, dataset_text_field="text", packing = False # Can make training 5x faster for shorter responses ) # Show current memory stats gpu_stats = torch.cuda.get_device_properties(0) start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") logger.info(f"{start_gpu_memory} GB of memory reserved.") last_checkpoint = None if os.path.isdir(checkpoint_dir): checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)] if len(checkpoints) > 0: checkpoints.sort(key=os.path.getmtime, reverse=True) last_checkpoint = checkpoints[0] trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint) ############# # Evaluation ############# tokenizer.padding_side = "left" metrics = trainer.evaluate() metrics["eval_samples"] = len(eval_dataset) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # ############ # # Save model # ############ os.makedirs(args.model_dir, exist_ok=True) if args.save_merged_model: print("Save PEFT model with merged 16-bit weights") model.save_pretrained_merged("outputs", tokenizer, save_method="merged_16bit") else: print(f"Save PEFT model: {args.model_dir}/model") model.save_pretrained(f"{args.model_dir}/model") tokenizer.save_pretrained(args.model_dir) Reference : train.py 7: Create the Compute Cluster: . For this experiment, we are using Standard_NC24ads_A100_v4 which has 1 GPU and 80 GB of VRAM. Select the compute based on the model size and batch size. from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) print("The compute cluster already exists! Reusing it for the current run") except Exception as ex: print( f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!" ) try: print("Attempt #1 - Trying to create a dedicated compute") tier = 'LowPriority' if USE_LOWPRIORITY_VM else 'Dedicated' compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print("Error") 8: Submit the Fine-Tuning Job With everything set up, you can now submit your fine-tuning job: from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration job = command( inputs=dict( #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path train_dir=Input(path=f"{AZURE_DATA_NAME}_train@latest"), # Get data from Data asset val_dir = Input(path=f"{AZURE_DATA_NAME}_val@latest"), epoch=d['train']['epoch'], train_batch_size=d['train']['train_batch_size'], eval_batch_size=d['train']['eval_batch_size'], ), code=f"{CLOUD_DIR}/train", # local path where the code is stored compute=azure_compute_cluster_name, command="python train_v3.py --train_dir ${{inputs.train_dir}} --val_dir ${{inputs.val_dir}} --train_batch_size ${{inputs.train_batch_size}} --eval_batch_size ${{inputs.eval_batch_size}}", #environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/77", # Use built-in Environment asset environment=f"{azure_env_name}@latest", distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) 9: Monitor Training Metrics: After initiating the job, keep an eye on the output for key metrics like training loss and evaluation loss. Since we've logged the results to MLflow, which is seamlessly integrated with Azure Machine Learning, we can easily review the loss function by navigating to the metrics tab within the jobs section. Key Takeways: Both the training and evaluation loss decrease significantly in the initial steps, suggesting effective learning. The gradual reduction in loss in subsequent steps indicates that the model continues to refine its parameters, but at a slower rate. The consistency in the downward trend for both training and evaluation loss implies that the model is not overfitting and is generalizing well to new data. However, the slight uptick towards the end in the evaluation loss might need monitoring to ensure it doesn't indicate overfitting at later stages. Overall, it looks promising, so lets go ahead and register the model. 10: Register the Model: After fine-tuning, register the model to make it available for deployment: from azureml.core import Workspace, Run import os # Connect to your workspace ws = Workspace.from_config() experiment_name = 'experiment_name' run_id = 'job_name' run = Run(ws.experiments[experiment_name], run_id) # Register the model model = run.register_model( model_name=d["serve"]["azure_model_name"], # this is the name the model will be registered under model_path="outputs" # this is the path to the model file in the run's outputs ) # Create a local directory to save the outputs local_folder = './model_v2' os.makedirs(local_folder, exist_ok=True) # Download the entire outputs folder run.download_files(prefix='outputs', output_directory=local_folder) Step 11: Deploy the Model to a Managed Online Endpoint: Managed online endpoints provide a seamless way to deploy models without managing underlying infrastructure. They offer scalability, versioning, and easy rollback compared to deploying on an Azure Kubernetes Service (AKS) cluster. 11 a. Build the enviornment: For deploying the model to managed online endpoint, first create the environment with required dependencies and webserver for inference. %%writefile {CLOUD_DIR}/serve/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2 # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir # Inference requirements COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/ RUN /var/requirements/install_system_requirements.sh && \ cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ rm -f /etc/nginx/sites-enabled/default ENV SVDIR=/var/runit ENV WORKER_TIMEOUT=400 EXPOSE 5001 8883 8888 # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update RUN apt-get install -y openssh-server openssh-client RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Reference : serving.ipynb 11b. Create a serving script: Creating a serve script for inference is a crucial step in deploying your machine learning model to a production environment. This script handles incoming requests, processes input data, runs the model inference, and returns the results. In Azure Machine Learning, the serve script is part of the deployment package for your model, typically used in conjunction with a managed endpoint or a Kubernetes service. A serve script in Azure ML typically consists of two main functions: init(): This function initializes the model and any other necessary resources. It is called once when the deployment is first loaded. run(data): This function is called every time a request is made to the deployed model. It processes the incoming data, performs inference using the model, and returns the results. import os import re import json import torch import base64 import logging from io import BytesIO from transformers import AutoTokenizer, AutoProcessor, pipeline from transformers import AutoModelForCausalLM, AutoProcessor device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def init(): """ This function is called when the container is initialized/started, typically after create/update of the deployment. You can write the logic here to perform init operations like caching the model in memory """ global model global tokenizer # AZUREML_MODEL_DIR is an environment variable created during deployment. # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION) # Please provide your model's folder name if there is one model_name_or_path = os.path.join( os.getenv("AZUREML_MODEL_DIR"), "outputs" ) model_kwargs = dict( trust_remote_code=True, device_map={"":0}, torch_dtype="auto" ) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map ={"" : 0}, **model_kwargs) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) logging.info("Loaded model.") def run(json_data: str): logging.info("Request received") data = json.loads(json_data) input_data = data["input_data"] params = data['params'] pipe = pipeline("text-generation", model = model, tokenizer = tokenizer) output = pipe(input_data, **params) result = output[0]["generated_text"] logging.info(f"Generated text : {result}") json_result = {"result" : str(result)} return json_result Reference : score.py 11c. Create a managed online endpoint and deploy the model to endpoint: Creating an endpoint and deploying your model on Azure Machine Learning is the final step to make your model accessible for real-time inference. This process involves setting up a service that can handle incoming requests, execute the model, and return the results. Why Create an Endpoint? An endpoint is a network-accessible interface that allows external applications or users to interact with your deployed machine learning model. Creating an endpoint is crucial for the following reasons: Accessibility: Endpoints make your model accessible over the internet or within a secured network, enabling other applications, services, or users to send requests and receive responses. API Integration: By exposing your model as a RESTful API, endpoints facilitate integration with various applications, allowing seamless communication and data exchange. Load Management: An endpoint can manage requests from multiple clients, handling concurrent requests and distributing the load appropriately. Security: Endpoints provide mechanisms for authentication and authorization, ensuring that only authorized users can access the model. Scalability: Azure-managed endpoints can automatically scale based on demand, ensuring that your model can handle varying workloads without manual intervention. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) azure_endpoint_name = d['serve']['azure_endpoint_name'] # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(azure_endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=azure_endpoint_name, description=f"Test endpoint for {model.name}", ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Why Deploy a Model? Deployment is the process of transferring your trained machine learning model from a development environment to a production environment where it can serve real-time predictions. Deployment is critical because: Operationalization: Deployment operationalizes your model, moving it from an experimental or development phase to a live environment where it can deliver value to end-users or systems. Resource Allocation: Deploying a model involves configuring the necessary compute resources (such as CPU, memory, and GPUs) to ensure optimal performance during inference. Environment Consistency: During deployment, the model is packaged with its dependencies in a consistent environment, ensuring reproducibility and minimizing discrepancies between development and production. Monitoring and Maintenance: Deployment sets up the infrastructure to monitor the model's performance, usage, and health, allowing for ongoing maintenance and updates. Version Control: Deployment allows you to manage and update different versions of your model, providing flexibility to roll back or switch to newer versions as needed. from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) azure_deployment_name = f"{d['serve']['azure_deployment_name']}-v1" deployment = ManagedOnlineDeployment( name=azure_deployment_name, endpoint_name=azure_endpoint_name, model=model, instance_type=azure_compute_cluster_size, instance_count=1, #code_configuration=code_configuration, environment = env, scoring_script="score.py", code_path=f"./{CLOUD_DIR}/inference", #environment_variables=deployment_env_vars, request_settings=OnlineRequestSettings(max_concurrent_requests_per_instance=20, request_timeout_ms=90000, max_queue_wait_ms=60000), liveness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() print("\n---Deployment created successfully---\n") except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {azure_deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Step 12: Run Inference on Sample Data: Test the deployed model using sample data that expects function calls: import json import os sample = { "input_data": [ {'role': 'system', 'content': 'You are an helpful assistant who has access to the following functions to help the user, you can use the functions if needed- { "name": "calculate_shipping_cost", "description": "Calculate the cost of shipping a package", "parameters": { "type": "object", "properties": { "weight": { "type": "number", "description": "The weight of the package in pounds" }, "destination": { "type": "string", "description": "The destination of the package" } }, "required": [ "weight", "destination" ] }}}"'}, {'role': 'user', 'content': 'Can you help me with shipping cost for a package?'}, {'role': 'assistant', 'content': 'Sure! I can help you with that. Please provide me with the weight and destination of the package.'}, {'role': 'user', 'content': 'The weight of the package is 10 pounds and the destination is New York.'} ], "params": { "temperature": 0.1, "max_new_tokens": 512, "do_sample": True, "return_full_text": False } } # Dump the sample data into a json file with open(request_file, "w") as f: json.dump(sample, f) result = ml_client.online_endpoints.invoke( endpoint_name=azure_endpoint_name, deployment_name=azure_deployment_name, request_file=request_file ) result_json = json.loads(result) result = result_json['result'] print(result) Step 13: Compare with Base Model: Now, lets run the same sample through the base model to observe the difference in performance. As we can see, while the fine-tuned model did a perfect job of generating response with the right function and arguments, the base model struggles to generate the desired output Step 14: Rerun the fine-tuning job by removing function descriptions from the system message: Now, lets rerun the experiment, but this time we will drop the function description from the dataset for context length optimization def remove_desc_from_prompts(data): system_message = data['system'] pattern = r'"description":\s*"[^"]*",?\n?' # Remove the "description" fields cleaned_string = re.sub(pattern, '"description":"",', system_message) return cleaned_string ## Update the system message by removing function descriptions and argument description train_dataset = train_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) test_dataset = test_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) val_dataset = val_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"]) train_dataset.save_to_disk(f"{DATA_DIR}/train") test_dataset.save_to_disk(f"{DATA_DIR}/test") val_dataset.save_to_disk(f"{DATA_DIR}/val") Reference : preprocess.py As can be seen from the results, removing the function description doesn't degrade the model performance but instead this fine-tuned model version requires lesser input tokens resulting in a significant reduction in token consumption with improved latency. Step 15: Further Exploration: Consider removing arguments or even the function itself in subsequent experiments to evaluate performance. Conclusion This blog post has walked through the process of fine-tuning an SLM for function-calling on Azure Machine Learning. By following these steps, you can effectively tailor a model to meet specific functional requirements. You can access the full code here. For a deeper dive into evaluating fine-tuned models, including metrics and code samples, check out the next blog post. By leveraging Azure's powerful tools, you can streamline the development and deployment of machine learning models, making them more efficient and effective for your specific tasks. Reference: Fine tuning for function calling | OpenAI Cookbook Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn michaelnny/Llama3-FunctionCalling: Fine-tune Llama3 model to support function calling Fine Tuning LLMs for Function Calling w/Pawel Garbacki - YouTube slm-innovator-lab/2_slm-fine-tuning-mlstudio at main · Azure/slm-innovator-lab1.6KViews1like1CommentAutomate Quota Discovery in Azure AI Foundry: A Tale of 3 APIs
Automate the discovery of Azure regions that meet your AI deployment needs using three essential APIs: Models API, Usages API, and Locations API. This process helps reduce decision fatigue and ensures compliance with enterprise-wide model deployment standards. Key learnings: Model Deployment Requirements: Understand the needs of a standard Retrieval-Augmented Generation (RAG) application, which involves deploying multiple models. Automation Benefits: Streamline your deployment process and ensure compliance with enterprise standards. Three Essential APIs: Models API: Query available models for a specific subscription within a chosen location. Usages API: Assess current usages and limits to infer available quotas. Locations API: Obtain a list of all available regions. A comprehensive Jupyter notebook with the implementation steps is available in the accompanying GitHub repository. This resource is invaluable for AI developers looking to streamline their deployment processes and ensure their applications meet all necessary requirements305Views3likes0Comments