Falcon LLMs in Azure Machine Learning

Microsoft

Jul 18, 2023

We are excited to announce that the Falcon family of state-of-the-art language models, created by the Technology Innovation Institute (TII) and hosted on Hugging Face hub, is now available in Model Catalog on the Azure Machine Learning platform. This exciting addition is a result of the partnership between Microsoft and Hugging Face. With their flagship library, Transformers, and Models hub with over 200,000 open-source models, Hugging Face has become the go-to source for state-of-the-art pre-trained models and tools in the NLP space.

Falcon is a new family of language models comprising two base models: Falcon-40B and Falcon-7B. Falcon-40B tops the charts of the Open LLM Leaderboard, while Falcon-7B is the best in its weight class. Falcon family also has instructive versions of the models, Falcon-7B-Instruct and Falcon-40B-Instruct, which are finetuned on instructions and conversational data making them better suited for assistant-style tasks.

To find and deploy Falcon models in Azure Machine Learning, follow these simple steps:

Log in to your workspace in AzureML Studio.
Go to the Model Catalog
Click ‘View models’ on the Falcon announcement card or search for tiiuae-falcon.
Open a model. You can find a link to the original model card where you can review detailed information about the model such as training approach, limitations, bias, etc.
Click on Deploy, select the template, instance type and click deploy. The deployment will take 10-15 minutes. Once done, you can test the model or find the REST API to send scoring requests from your application.

Follow this notebook sample to deploy using the Python SDK: https://aka.ms/hugging-face-text-generation-streaming-online-endpoint. These models need Nvidia A100 GPUs to run. You will need quota for one of the following Azure VM instance types that have the A100 GPU: "Standard_NC48ads_A100_v4", "Standard_NC96ads_A100_v4", "Standard_ND96asr_v4" or "Standard_ND96amsr_A100_v4".

The falcon models from the Hugging Face collection are deployed with the Text Generation Inference container developed by Hugging Face. This production-ready container offers continuous batching, token streaming using Server-Sent Events (SSE), Tensor Parallelism for faster inference on multiple GPUs, and optimized transformers code using custom CUDA kernels.

It's important to use state-of-the-art language models such as Falcon responsibly to mitigate potential harm that such models can cause. Azure AI Content Safety is an Azure service that can help you detects harmful user-generated and AI-generated content in applications and services. Using Text Moderation API through REST or client SDKs, you can scans user input and response generated by models for sexual content, violence, hate, and self harm with multi-severity levels. Learn more: https://learn.microsoft.com/en-us/azure/cognitive-services/content-safety/overview.

The addition of Falcon models to the Model Catalog in Azure Machine Learning showcases the power of collaboration between Microsoft and Hugging Face, enabling users to harness the latest advancements in exciting Generative AI technology responsibly, leveraging Azure AI functionality such as Prompt flow with Azure AI Content Safety. We are excited to see what incredible applications and solutions our users will create using these state-of-the-art language models.

Resources to get started

Get an Azure free account and setup your AzureML workspace.
Explore the model catalog in AzureML studio and deploy models.
Review the documentation to learn how to programmatically deploy using the AzureML Python SDK or CLI, find options for support and understand how the model catalog is populated.