Using Advanced Reasoning Model on EdgeAI Part 2 - Evaluate local models using AITK for VSCode

Microsoft

Feb 13, 2025

Quantizing a generative AI model assists in ease of deployment of edge devices. For development purposes we would like to have 1) quantization 2) optimization and 3) evaluation of models all in one tool. We recommend AI Toolkit for Visual Studio Code (AITK) for as an all-in-one tool for developers experimenting with GenAI models. This is a lightweight open-source tool that covers model selection, fine-tuning, deployment, application development, batch data for testing, and evaluation. In Using Advanced Reasoning Model on EdgeAI Part 1 - Quantization, Conversion, Performance, we performed quantization and format conversion for Phi-4 and DeepSeek-R1-Distill-Qwen-1.5B. In This blog will evaluate Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU using AITK.

About model quantization

Quantization refers to the process of approximating the continuous value of a signal to a finite number of discrete values. It can be understood as a method of information compression. When considering this concept on a computer system, it is generally represented by "low bits". Some people also call quantization "fixed-point", but strictly speaking, the range represented is reduced. Fixed-point refers specifically to linear quantization with a scale of 2, which is a more practical quantization method. Our trained models generally have the characteristics of large number of parameters, large amount of calculation, high memory usage, and high precision. However, after model quantization, we can compress parameters, increase the calculation speed of the model, and reduce memory usage, but there is a problem that the quantization approximation algorithm reduces the precision. In the use of generative AI, we will face some errors and omissions in the results. We therefore need to evaluate the quantization model.

Use AITK to evaluate quantitative models

Model deployment

AITK is an open-source tool with GenAIOps, based on Visual Studio Code. If you want to evaluate the model, you first need to deploy the model, which can be deployed locally or remotely. Currently AITK only supports the deployment of ONNX models in the Model Catalog (Phi-3/Phi-3.5 Family/Mistral 7B)

To test models that are not currently supported within model catalog we propose API that is compatible with AITK. This uses the remote deployment method of the Open AI Chat completion.

We use Python + Flask to create the Open AI Chat completion service, the code is as follows:



from flask import Flask, request, jsonify, stream_with_context, Response

import json

import onnxruntime_genai as og

import numpy as np

import requests

import uuid, time



app = Flask(__name__)



model_path = "Your Phi-4-14B-ONNX-INT4-GPU or DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU path"

model = og.Model(model_path)

tokenizer = og.Tokenizer(model)

tokenizer_stream = tokenizer.create_stream()



# Your Phi-4 chat template

chat_template = "<|user|>{input}<|assistant|>"

# Your DeepSeek-R1-Distill-Qwen chat template

# chat_template = "<|im_start|> user {input}<|im_end|><|im_start|> assistant"



@app.route('/v1/chat/completions', methods=['POST'])

def ort_chat():

        data = request.get_json()



        messages = data.get("messages")

        if not messages or not isinstance(messages, list):

            return jsonify({

                "error": {

                    "message": "Invalid request. 'messages' is required and must be a list of strings.",

                    "type": "invalid_request_error",

                    "param": "messages",

                    "code": None

                }

            }), 400

       

        prompt = ''



        if messages:

            last_message = messages[-1]

            if last_message.get("role") == "user" and last_message.get("content"):

                prompt = last_message['content']



       

        search_options = {}

        search_options['max_length'] = data.get("max_tokens")

        search_options['temperature'] = data.get("temperature")

        search_options['top_p'] = data.get("top_p")

        search_options['past_present_share_buffer'] = False

       

        prompt = f'{chat_template.format(input=messages)}'

        input_tokens = tokenizer.encode(prompt)

        params = og.GeneratorParams(model)

        params.set_search_options(**search_options)

        params.input_ids = input_tokens



        generator = og.Generator(model, params)

       

       

        reply = ""

        while not generator.is_done():

                generator.compute_logits()

                generator.generate_next_token()



                new_token = generator.get_next_tokens()[0]

                reply += tokenizer.decode(new_token)



        if data.get("stream"):

            def generate():

                tokens = reply.split()

                for token in tokens:

                    chunk = {

                        "id": "chatcmpl-"+ str(uuid.uuid4()),

                        "object": "chat.completion.chunk",

                        "created": int(time.time()),

                        "model": data.get("model"),

                        "choices": [

                            {

                                "delta": {"content": token + " "},

                                "index": 0,

                                "finish_reason": None

                            }

                        ]

                    }

                    yield "data: " + json.dumps(chunk) + "\n\n"

                    time.sleep(0.5)

                yield "data: [DONE]\n\n"



            return Response(stream_with_context(generate()), mimetype="text/event-stream")



        response_data = {

            "id": "chatcmpl-"+ str(uuid.uuid4()),

            "object": "chat.completion",

            "created": int(time.time()),

            "model": data.get("model"),

            "choices": [

                {

                    "index": 0,

                    "message": {

                        "role": "assistant",

                        "content":  reply

                    },

                    "finish_reason": "stop"

                }

            ],

            "usage": {

                "prompt_tokens": len(messages),

                "completion_tokens": len(reply.split()),

                "total_tokens": len(reply.split())

            }

        }



        return jsonify(response_data)





if __name__ == '__main__':

    app.run(debug=True, port=5000)

After create service, through AITK's My MODELS Remote models

select "Add a custom model"
select the API (http://127.0.0.1:5000/v1/chat/completions)
set the name such as DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU or Phi-4-14B-ONNX-INT4-GPU. We can test it in playground

DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU Playground

Phi-4-14B-ONNX-INT4-GPU Playground

Batch Data

We can use AITK's Bulk Data to set up the execution environment for batch data. I simulate the execution of 10 data here. We can execute in batches to see the relevant results. This is an important step in the evaluation because we need to know the results based on a specific problem before the evaluation. After the run, we can export the run results

Evaluation

You can evaluate the exported batch data results. AITK supports different evaluation methods.

Create an evaluation by selecting Evaluation from Tools and importing the data exported from Bulk Data. You can choose different evaluation methods.

We select F1 to complete the evaluation selection, Let's take a look at the evaluation results of DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU

Summary

The AI Toolkit (AITK) enables us to assess various models, which is instrumental for our edge deployments. As Responsible AI gains importance, continuous testing and evaluation post-deployment become essential. In this context, AITK serves as a crucial tool for our daily operations.

Resource