DeepSeek-R1 is very popular, and it can achieve the same capabilities as OpenAI o1 in advanced reasoning. Microsoft has also added DeepSeek-R1 models to Azure AI Foundry and GitHub Models. We can compare DeepSeek-R1 ith other available models through GitHub Models Playground
Note This series revolves around deployment of SLMs to Edge Devices 'Edge AI' we will focus on the deployment advanced reasoning models, with different application scenarios. You can learn more in the following session AI Tour BRK453.
In this experiement we want to deploy advanced reasoning models to the edge, so that they can run on edge devices with limited computing power and offline environments. At this time, the recommendation is to use the traditional ONNX model . We can use Microsoft Olive to convert the DeepSeek-R1 Distrill model. Getting started with Microsoft Olive is very straightforward. Install the Microsoft Olive library through the command line and Python 3.10+ (recommended)
pip install olive-ai
The DeepSeek-R1 Distrill model series has different parameters such as 1.5B, 7B, 8B, 14B, 32B, 70B, etc. This article is mainly based on the 1.5B, 7B, and 14B models (so a Small Language Model).
CPU Inference
Let's discuss 1.5B and 7B, which are models with lower parameter. We can directly use the CPU as computing for inference to test the effect (hardware environment Azure DevBox, AMD EPYC 7763 64-Core + 64GB Memory + 2T SSD)
Quantization conversion
olive auto-opt --model_name_or_path <Your DeepSeek-R1-Distill-Qwen-1.5B/7B local location> --output_path <Your Convert ONNX INT4 Model local location> --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1
You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety or provided as an Offical Model)
Running with ONNX Runtime GenAI
Install ONNX Runtime GenAI and ONNX Runtime CPU support libraries
pip install onnxruntime-genai pip install onnxruntime
Sample Code
https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-1.5b.ipynb
https://github.com/kinfey/EdgeAIForAdvancedReasoning/blob/main/notebook/demo-7b.ipynb
Performance comparison 1.5B vs 7B
We compare two different inference scenarios
- explain 1+1=2
- 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated:
- 7B quantized ONNX model memory occupied, time consumption and number of tokens generated
2. Find all pairwise different isomorphism groups with order 147 and no elements with order 49
- 1.5B quantized ONNX model memory occupied, time consumption and number of tokens generated:
- 7B quantized ONNX model memory occupied, time consumption and number of tokens generated Results of the numbers
Through the test, we can see that the 1.5B model of DeepSeek is more suitable for use on CPU inference and can be deployed on traditional PCs or IoT devices. As for 7B, although it has better inference, it is not very effective on CPU operation.
GPU Inference
It is ideal if we have a GPU on the edge device. We can quantize and convert it to an ONNX model for CPU inference through Microsoft Olive. Of course, it can also be converted to a model for GPU inference. Here I take the 14B DeepSeek-R1-Distill-Qwen-14B as an example and make an inference comparison with Microsoft's Phi-4-14B
Quantization conversion
olive auto-opt --model_name_or_path <Your Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --output_path <Your converted Phi-4-14B or DeepSeek-R1-Distill-Qwen-14B local path > --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1
You can download it directly from my Hugging face Repo (Note: This model is for testing and has not been fully tested by AI Content Safety and not an Official Model)
Running with ONNX Runtime GenAI CUDA
Install ONNX Runtime GenAI and ONNX Runtime GPU support libraries
pip install onnxruntime-genai-cuda
pip install onnxruntime-gpu
Compare the results in the GPU environment with Gradio
It is recommended to use a GPU with more than 8G memory
To increase the comparison of the results, we compare it with Phi-4-14B-ONNX-INT4-GPU and DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU to see the different results. We also show we use OpenAI o1-mini (it is recommended to use o1-mini through GitHub Models),
You can test any prompt on Gradio to compare the results of Phi-4-14B-ONNX-INT4-GPU, DeepSeek-R1-Distill-Qwen-14B-ONNX-INT4-GPU and OpenAI o1 mini.
DeepSeek-R1 reduces the cost of inference models and produces more instructive results on professional problems, but Phi-4-14B also has advantages in reasoning and uses lower computing power to complete inference. As for OpenAI o1 mini, it is more comprehensive and can touch all problems. If you want to deploy to Edge Device, Phi-4-14B and quantized DeepSeek-R1 are good choices for you.
This blog is just a simple test and the first in this series. Please share your feedback and continue the discussion in the Microsoft AI Discord Channel. Feel free to me a message or comment. We look forward to sharing more around the opportunity of EdgeAI and more content in this series.
Resource
- DeepSeek-R1 in GitHub Models https://github.com/marketplace/models/azureml-deepseek/DeepSeek-R1
- DeepSeek-R1 in Azure AI Foundry https://ai.azure.com/explore/models/DeepSeek-R1/version/1/registry/azureml-deepseek
- Phi-4-14B in Hugging face https://huggingface.co/microsoft/phi-4
- Learn about Microsoft Olive https://github.com/microsoft/olive
- Learn about ONNX Runtime GenAI https://github.com/microsoft/onnxruntime-genai
- Microsoft AI Discord Channel
- BRK453 Exploring cutting-edge models: LLMs, SLMs, local development and more https://aka.ms/aitour/brk453
Updated Jan 31, 2025
Version 3.0kinfey
Microsoft
Joined September 17, 2021
Educator Developer Blog
Follow this blog board to get notified when there's new activity