evaluation
2 TopicsUsing Advanced Reasoning Model on EdgeAI Part 2 - Evaluate local models using AITK for VSCode
Quantizing a generative AI model assists in ease of deployment of edge devices. For development purposes we would like to have 1) quantization 2) optimization and 3) evaluation of models all in one tool. We recommend AI Toolkit for Visual Studio Code (AITK) for as an all-in-one tool for developers experimenting with GenAI models. This is a lightweight open-source tool that covers model selection, fine-tuning, deployment, application development, batch data for testing, and evaluation. In Using Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance, we performed quantization and format conversion for Phi-4 and DeepSeek-R1-Distill-Qwen-1.5B. In This blog will evaluate Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU using AITK. About model quantization Quantization refers to the process of approximating the continuous value of a signal to a finite number of discrete values. It can be understood as a method of information compression. When considering this concept on a computer system, it is generally represented by "low bits". Some people also call quantization "fixed-point", but strictly speaking, the range represented is reduced. Fixed-point refers specifically to linear quantization with a scale of 2, which is a more practical quantization method. Our trained models generally have the characteristics of large number of parameters, large amount of calculation, high memory usage, and high precision. However, after model quantization, we can compress parameters, increase the calculation speed of the model, and reduce memory usage, but there is a problem that the quantization approximation algorithm reduces the precision. In the use of generative AI, we will face some errors and omissions in the results. We therefore need to evaluate the quantization model. Use AITK to evaluate quantitative models Model deployment AITK is an open-source tool with GenAIOps, based on Visual Studio Code. If you want to evaluate the model, you first need to deploy the model, which can be deployed locally or remotely. Currently AITK only supports the deployment of ONNX models in the Model Catalog (Phi-3/Phi-3.5 Family/Mistral 7B) To test models that are not currently supported within model catalog we propose API that is compatible with AITK. This uses the remote deployment method of the Open AI Chat completion. We use Python + Flask to create the Open AI Chat completion service, the code is as follows: from flask import Flask, request, jsonify, stream_with_context, Response import json import onnxruntime_genai as og import numpy as np import requests import uuid, time app = Flask(__name__) model_path = "Your Phi-4-14B-ONNX-INT4-GPU or DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU path" model = og.Model(model_path) tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() # Your Phi-4 chat template chat_template = "<|user|>{input}<|assistant|>" # Your DeepSeek-R1-Distill-Qwen chat template # chat_template = "<|im_start|> user {input}<|im_end|><|im_start|> assistant" @app.route('/v1/chat/completions', methods=['POST']) def ort_chat(): data = request.get_json() messages = data.get("messages") if not messages or not isinstance(messages, list): return jsonify({ "error": { "message": "Invalid request. 'messages' is required and must be a list of strings.", "type": "invalid_request_error", "param": "messages", "code": None } }), 400 prompt = '' if messages: last_message = messages[-1] if last_message.get("role") == "user" and last_message.get("content"): prompt = last_message['content'] search_options = {} search_options['max_length'] = data.get("max_tokens") search_options['temperature'] = data.get("temperature") search_options['top_p'] = data.get("top_p") search_options['past_present_share_buffer'] = False prompt = f'{chat_template.format(input=messages)}' input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) params.input_ids = input_tokens generator = og.Generator(model, params) reply = "" while not generator.is_done(): generator.compute_logits() generator.generate_next_token() new_token = generator.get_next_tokens()[0] reply += tokenizer.decode(new_token) if data.get("stream"): def generate(): tokens = reply.split() for token in tokens: chunk = { "id": "chatcmpl-"+ str(uuid.uuid4()), "object": "chat.completion.chunk", "created": int(time.time()), "model": data.get("model"), "choices": [ { "delta": {"content": token + " "}, "index": 0, "finish_reason": None } ] } yield "data: " + json.dumps(chunk) + "\n\n" time.sleep(0.5) yield "data: [DONE]\n\n" return Response(stream_with_context(generate()), mimetype="text/event-stream") response_data = { "id": "chatcmpl-"+ str(uuid.uuid4()), "object": "chat.completion", "created": int(time.time()), "model": data.get("model"), "choices": [ { "index": 0, "message": { "role": "assistant", "content": reply }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": len(messages), "completion_tokens": len(reply.split()), "total_tokens": len(reply.split()) } } return jsonify(response_data) if __name__ == '__main__': app.run(debug=True, port=5000) After create service, through AITK's My MODELS Remote models select "Add a custom model" select the API (http://127.0.0.1:5000/v1/chat/completions) set the name such as DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU or Phi-4-14B-ONNX-INT4-GPU. We can test it in playground DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Playground Phi-4-14B-ONNX-INT4-GPU Playground Batch Data We can use AITK's Bulk Data to set up the execution environment for batch data. I simulate the execution of 10 data here. We can execute in batches to see the relevant results. This is an important step in the evaluation because we need to know the results based on a specific problem before the evaluation. After the run, we can export the run results Evaluation You can evaluate the exported batch data results. AITK supports different evaluation methods. Create an evaluation by selecting Evaluation from Tools and importing the data exported from Bulk Data. You can choose different evaluation methods. We select F1 to complete the evaluation selection, Let's take a look at the evaluation results of DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Summary The AI Toolkit (AITK) enables us to assess various models, which is instrumental for our edge deployments. As Responsible AI gains importance, continuous testing and evaluation post-deployment become essential. In this context, AITK serves as a crucial tool for our daily operations. Resource Learn about AI Toolkit for Visual Studio Code https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1 DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4 Learn about Microsoft Olive https://github.com/microsoft/olive Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai285Views0likes0CommentsEvaluate Fine-tuned Phi-3 / 3.5 Models in Azure AI Studio Focusing on Microsoft's Responsible AI
Fine-tuning a model can sometimes lead to unintended or undesired responses. To ensure that the model remains safe and effective, it's important to evaluate the model's potential to generate harmful content and its ability to produce accurate, relevant, and coherent responses. In this tutorial, you will learn how to evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model integrated with Prompt flow in Azure AI Studio. Before beginning the technical steps, it's essential to understand Microsoft's Responsible AI Principles, an ethical framework designed to guide the responsible development, deployment, and operation of AI systems. These principles guide the responsible design, development, and deployment of AI systems, ensuring that AI technologies are built in a way that is fair, transparent, and inclusive. These principles are the foundation for evaluating the safety of AI models.19KViews1like1Comment