Quantizing a generative AI model assists in ease of deployment of edge devices. For development purposes we would like to have 1) quantization 2) optimization and 3) evaluation of models all in one tool. We recommend AI Toolkit for Visual Studio Code (AITK) for as an all-in-one tool for developers experimenting with GenAI models. This is a lightweight open-source tool that covers model selection, fine-tuning, deployment, application development, batch data for testing, and evaluation. In Using Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance, we performed quantization and format conversion for Phi-4 and DeepSeek-R1-Distill-Qwen-1.5B. In This blog will evaluate Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU using AITK.
About model quantization
Quantization refers to the process of approximating the continuous value of a signal to a finite number of discrete values. It can be understood as a method of information compression. When considering this concept on a computer system, it is generally represented by "low bits". Some people also call quantization "fixed-point", but strictly speaking, the range represented is reduced. Fixed-point refers specifically to linear quantization with a scale of 2, which is a more practical quantization method. Our trained models generally have the characteristics of large number of parameters, large amount of calculation, high memory usage, and high precision. However, after model quantization, we can compress parameters, increase the calculation speed of the model, and reduce memory usage, but there is a problem that the quantization approximation algorithm reduces the precision. In the use of generative AI, we will face some errors and omissions in the results. We therefore need to evaluate the quantization model.
Use AITK to evaluate quantitative models
- Model deployment
AITK is an open-source tool with GenAIOps, based on Visual Studio Code. If you want to evaluate the model, you first need to deploy the model, which can be deployed locally or remotely. Currently AITK only supports the deployment of ONNX models in the Model Catalog (Phi-3/Phi-3.5 Family/Mistral 7B)
To test models that are not currently supported within model catalog we propose API that is compatible with AITK. This uses the remote deployment method of the Open AI Chat completion.
We use Python + Flask to create the Open AI Chat completion service, the code is as follows:
from flask import Flask, request, jsonify, stream_with_context, Response
import json
import onnxruntime_genai as og
import numpy as np
import requests
import uuid, time
app = Flask(__name__)
model_path = "Your Phi-4-14B-ONNX-INT4-GPU or DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU path"
model = og.Model(model_path)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Your Phi-4 chat template
chat_template = "<|user|>{input}<|assistant|>"
# Your DeepSeek-R1-Distill-Qwen chat template
# chat_template = "<|im_start|> user {input}<|im_end|><|im_start|> assistant"
@app.route('/v1/chat/completions', methods=['POST'])
def ort_chat():
data = request.get_json()
messages = data.get("messages")
if not messages or not isinstance(messages, list):
return jsonify({
"error": {
"message": "Invalid request. 'messages' is required and must be a list of strings.",
"type": "invalid_request_error",
"param": "messages",
"code": None
}
}), 400
prompt = ''
if messages:
last_message = messages[-1]
if last_message.get("role") == "user" and last_message.get("content"):
prompt = last_message['content']
search_options = {}
search_options['max_length'] = data.get("max_tokens")
search_options['temperature'] = data.get("temperature")
search_options['top_p'] = data.get("top_p")
search_options['past_present_share_buffer'] = False
prompt = f'{chat_template.format(input=messages)}'
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
params.input_ids = input_tokens
generator = og.Generator(model, params)
reply = ""
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
reply += tokenizer.decode(new_token)
if data.get("stream"):
def generate():
tokens = reply.split()
for token in tokens:
chunk = {
"id": "chatcmpl-"+ str(uuid.uuid4()),
"object": "chat.completion.chunk",
"created": int(time.time()),
"model": data.get("model"),
"choices": [
{
"delta": {"content": token + " "},
"index": 0,
"finish_reason": None
}
]
}
yield "data: " + json.dumps(chunk) + "\n\n"
time.sleep(0.5)
yield "data: [DONE]\n\n"
return Response(stream_with_context(generate()), mimetype="text/event-stream")
response_data = {
"id": "chatcmpl-"+ str(uuid.uuid4()),
"object": "chat.completion",
"created": int(time.time()),
"model": data.get("model"),
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": reply
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": len(messages),
"completion_tokens": len(reply.split()),
"total_tokens": len(reply.split())
}
}
return jsonify(response_data)
if __name__ == '__main__':
app.run(debug=True, port=5000)
After create service, through AITK's My MODELS Remote models
- select "Add a custom model"
- select the API (http://127.0.0.1:5000/v1/chat/completions)
- set the name such as DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU or Phi-4-14B-ONNX-INT4-GPU. We can test it in playground
DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Playground
Phi-4-14B-ONNX-INT4-GPU Playground
- Batch Data
We can use AITK's Bulk Data to set up the execution environment for batch data. I simulate the execution of 10 data here. We can execute in batches to see the relevant results. This is an important step in the evaluation because we need to know the results based on a specific problem before the evaluation. After the run, we can export the run results
- Evaluation
You can evaluate the exported batch data results. AITK supports different evaluation methods.
Create an evaluation by selecting Evaluation from Tools and importing the data exported from Bulk Data. You can choose different evaluation methods.
We select F1 to complete the evaluation selection, Let's take a look at the evaluation results of DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU
Summary
The AI Toolkit (AITK) enables us to assess various models, which is instrumental for our edge deployments. As Responsible AI gains importance, continuous testing and evaluation post-deployment become essential. In this context, AITK serves as a crucial tool for our daily operations.
Resource
- Learn about AI Toolkit for Visual Studio Code https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio
- DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1
- DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek
- Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4
- Learn about Microsoft Olive https://github.com/microsoft/olive
- Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai
Updated Feb 13, 2025
Version 1.0kinfey
Microsoft
Joined September 17, 2021
Educator Developer Blog
Follow this blog board to get notified when there's new activity