The Microsoft Phi-3 SLM (small language model) is an AI model that you can run locally or in your applications hosted almost anywhere. I’ve been exploring the world of hosting AI models in code lately, and this stuff is getting really interested. It turns out that among the tools to run AI models locally, there are developer tools you can use to build your own generative AI apps. I previously wrote about running Phi-3 locally using Ollama, so I thought I’d explore how to build your own Generative AI apps using Phi-3 too. This article goes through the steps to use C# and the ONNX Runtime to build your own generative AI app using Microsoft’s Phi-3 Generative AI model.

Let’s build our own local, generative AI app with C# and Phi-3!

Table of Contents

What is Microsoft Phi-3 SLM?

Phi-3 is the latest version of the “Phi” SLM (small language model) from Microsoft. This new version represents a significant leap forward in the quest for smaller yet powerful language models. This SLM offers a good balance between capabilities and performance, and it’s cost-effective to use in scenarios where there aren’t a lot of computing resources to run a larger model.

The Phi-3 model can be run just about anywhere, from your local computer to the cloud with Microsoft Azure. While running in the cloud with less resources is good, it’s the ability to run the SLM locally that’s really the intriguing part.

Generative AI models require compute and memory resources on the machine they are run on. Large language models like OpenAI’s GPT-4 are too big to run on your local computer. This is the reason Microsoft developed the Phi-3 SLM. With small language models like Phi-3, you don’t need such large computers to run Generative AI models on local devices.

Additionally, Microsoft has released different sizes of Phi-3 to target different use cases. These balance the scale between AI models that are the most capable and cost-effective. Microsoft even reports that the Phi-3 models outperform other models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.

The Phi-3-mini model, a 3.8B parameter model significantly outperforms LLMs that twice its size. Also, the Phi3-small (7B parameters) and Phi-3-medium (14B parameters) SLM models outperform much larger models, including OpenAI’s GPT-3.5T.

Source: Microsoft Build 2024 Keynote with Satya Nadella announcing Phi-3-small and Phi-3-medium models — Source: Microsoft Build 2024 Keynote with Satya Nadella

What is the ONNX Runtime?

The ONNX Runtime is an inference engine for deploying machine learning (ML) models. It is designed to be platform-agnostic and can run on a variety of hardware, providing optimized performance across different environments. ONNX (Open Neural Network Exchange) models are used with ONNX Runtime, enabling developers to take advantage of various ML frameworks and deploy them efficiently.

In the context of C#, ONNX Runtime allows developers to integrate ML models into C# apps. It also provides a .NET API that makes it easier to load and run ONNX models within .NET apps.

Where to Download the Microsoft Phi-3 SLM?

The Microsoft ONNX Runtime team has put in the work to optimize the Phi-3-mini model, as well as the later released Phi-3-small and Phi-3-medium models for the ONNX Runtime. This means they did all the work necessary for you to pull in the model into your app without the need to reconfigure it or do any other work when using ONNX Runtime in your application.

The Phi-3 ONNX optimized models are hosted over at Hugging Face, and available in the following two flavors:

Phi-3 Mini
- phi3-mini-4k-instruct-onnx (cpu, cuda, directml)
- phi3-mini-128k-instruct-onnx (cpu, cuda, directml)
Phi-3 Small
- Phi-3-small-8k-instruct-onnx (cuda)
- Phi-3-small-128k-instruct-onnx (cuda)
Phi-3 Medium
- Phi-3-medium-4k-instruct-onnx (cpu, cuda, directml)
- Phi-3-medium-128k-instruct-onnx (cpu, cuda, directml)

The 4k and 128k in the name of these is referring to the tokens that make up the context length. This means the 4k flavor will require less resources to run, but the 128k will support larger context length.

This context lengths means how much text in the prompt can be passed the generative AI model. This text context length will limit the length of the full conversation with user prompt that the app will be able to pass the AI model when each user prompt is passed in. When building an app that needs to pass more context to the model when generative responses, you’ll want to use the flavor that supports the larger number of tokens.

To download the model you want to use from Hugging Face, you can use the Git command-line. Just pass the URL to the page for the model on the Hugging Face website as the git url.

Here are example commands using Git to download the phi3-mini-4k-instruct-onnx model to your local machine:

git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx C:/onnx

In this example, the command is putting the model into the C:/onnx folder. The rest of this article will assume that’s the location where the Phi-3 model is located.

Git Large File System (LFS) Required: It’s worth noting that the Git repos for the models contain some large files. This means you will need to install Git LFS before running a git clone. You can find information on installing Git LFS here: https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage

Once you have the Phi-3 model downloaded, you can browse the folder to see the files. Take a minute to look specifically at the \cuda\cuda-int4-rtn-block-32 folder. This folder contains all the necessary files for the phi3-mini-4k-instruct-onnx model as will be referenced through the rest of this article.

Build a Generative AI App using C#, ONNX and Phi-3

Let’s walk through the steps to create a C# console application using the ONNX Runtime to host the Phi-3 SLM to build your own simple Generative AI application.

Step 1: Create new C# console app

Use the dotnet CLI to create a new C# console application. Create a new folder, navigate to it in the command-line, then run the following command:

dotnet new console

This will create a new C# console application in the folder with a couple templated files to get started.

Step 2: Install ONNX Runtime Nuget Packages

To use the ONNX Runtime within the C# application, run the following commands to install the necessary Nuget packages:

dotnet add package Microsoft.ML.OnnxRuntime --version 1.17.3
dotnet add package Microsoft.ML.OnnxRuntimeGenAI --version 0.2.0-rc7
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --version 0.2.0-rc7

The ONNX Runtime is an open source project that is free to use, so you do not need any licensing.

Step 3: Write the Generative AI Code

Now that the necessary Nuget packages for the ONNX Runtime are added to the project, the next step is to write the code to make the console app into a simple Generative AI application.

Open the Program.cs file for the C# app and paste in the following code:

using System;
using Microsoft.ML.OnnxRuntimeGenAI;

class Program
{
    static void Main(string[] args)
    {
        // The absolute path to the folder where the Phi-3 model is stored (folder to the ".onnx" file)
        var modelPath = $"C:\\onnx\\cpu_and_mobile\\cpu-int4-rtn-block-32";
        var model = new Model(modelPath);
        var tokenizer = new Tokenizer(model);

        // System prompt will be used to instruct the AI how to response to the user prompt
        var systemPrompt = "You are a knowledgeable and friendly assistant made by Build5Nines named Jarvis. Answer the following question as clearly and concisely as possible, providing any relevant information and examples.";

        // Create a loop for taking input from the user
        while (true) {
            // Get user prompt
            Console.Write("Type Prompt then Press [Enter] or CTRL-C to Exit: ");
            var userPrompt = Console.ReadLine();
            
            // show in console that the assistant is responding
            Console.WriteLine("");
            Console.Write("Assistant: ");

            
            // Build the Prompt
            // Single User Prompt with System Prompt
            var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userPrompt}<|end|><|assistant|>";
            
            // Tokenize the prompt
            var tokens = tokenizer.Encode(fullPrompt);
            
            // Set generator params
            var generatorParams = new GeneratorParams(model);
            generatorParams.SetSearchOption("max_length", 2048);
            generatorParams.SetSearchOption("past_present_share_buffer", false);
            generatorParams.SetInputSequences(tokens);

            // Generate the response
            var generator = new Generator(model, generatorParams);
            // Output response as each token in generated
            while (!generator.IsDone()) {
                generator.ComputeLogits();
                generator.GenerateNextToken();
                var outputTokens = generator.GetSequence(0);
                var newToken = outputTokens.Slice(outputTokens.Length - 1, 1);
                var output = tokenizer.Decode(newToken);
                Console.Write(output);
            }
            Console.WriteLine();
        }
    }
}

Be sure to update the code by setting the modelPath variable’s value to the full folder path to where you downloaded (or git cloned) the ONNX optimized Phi-3 model files.

Step 4: Run the Generative AI App

When running the app, you will be able to type in prompts and see the generated AI response output to the console.

dotnet run

Here’s a screenshot of what the app looks like running:

Build a Generative AI App in C# with Phi-3 SLM and ONNX 2 — Screenshot: example of C# generative AI app running

Generative AI Concepts Explained

Assuming you are familiar with C# and the while loop used to continually capture user input, let’s look at a few of the key items in the code. These are the specific pieces you will need to understand when taking this code example and adapting it for your own needs.

System Prompt

The System Prompt is used to tell the AI model the initial context to use when generating responses to the prompt. The system prompt is given to the model before the main user prompt.

It is defined in this code as the systemPrompt variable:

var systemPrompt = "You are a knowledgeable and friendly " +
"assistant made by Build5Nines named Jarvis. Answer the " +
"following question as clearly and concisely as possible, " +
"providing any relevant information and examples.";

Full Prompt Format

When using the Phi-3 model to build a generative AI application, the tokens passed in for the prompt is the entire prompt the AI should evaluate to generate a response. This means you don’t want to only pass in the user prompt (the prompt the user types in). You will also want to pass in the system prompt, and format it in a special way so the model understands what the system prompt and the user prompt are so it can generate an appropriate repsponse.

When building the full prompt to pass in the model, the example code uses the following format:

var fullPrompt = $"<|system|>{systemPrompt}<|end|><|user|>{userPrompt}<|end|><|assistant|>";

The code uses this full prompt format to build the full string that contains both the system prompt and the user prompt before it’s tokenized and then passed to the model for a response to be generated.

Through searching the only documentation I found shows the chat format for the prompt only having the <|user|> and <|assistant|> elements. The documentation didn’t say how to specify the system prompt, but I made a guess using a similar format to the user prompt that seems to work.

Remember User Conversation with the C# + Phi-3 Generative AI App

With the previous C# code example, you can extend the generative AI app to remember the full users conversation context. This is done by storing all the user prompts and assistant responses in memory, then passing them into the model each time a new user prompt is processed. This enable the AI model to know the context of the full conversation with each prompt. This will give the generative AI the expected behavior that users are getting used to when they open new conversations with other generative AI’s like OpenAI’s ChatGPT.

Here’s the full code example from above, but modified to use a List<string>() variable to store all the prompts and responses, then pass in the full conversation context with each prompt:

using System;
using Microsoft.ML.OnnxRuntimeGenAI;

class Program
{
    static void Main(string[] args)
    {
        // The absolute path to the folder where the Phi-3 model is stored (folder to the ".onnx" file)
        var modelPath = $"C:\\onnx\\cpu_and_mobile\\cpu-int4-rtn-block-32";
        var model = new Model(modelPath);
        var tokenizer = new Tokenizer(model);

        // System prompt will be used to instruct the AI how to response to the user prompt
        var systemPrompt = "You are a knowledgeable and friendly assistant made by Build5Nines named Jarvis. Answer the following question as clearly and concisely as possible, providing any relevant information and examples.";

        var allPrompts = new List<string>();
        allPrompts.Add($"<|system|>{systemPrompt}<|end|>");

        // Create a loop for taking input from the user
        while (true) {
            // Get user prompt
            Console.Write("Type Prompt then Press [Enter] or CTRL-C to Exit: ");
            var userPrompt = Console.ReadLine();

            // show in console that the assistant is responding
            Console.WriteLine("");
            Console.Write("Assistant:");

            // Build the Prompt
            // Build prompt conversation
            // add user prompt to conversation
            allPrompts.Add($"<|user|>{userPrompt}<|end|>");
            // add format to tell AI to generate assistant prompt
            allPrompts.Add("<|assistant|>");
            // put full conversation together into full prompt to use            
            var fullPrompt = String.Join(string.Empty, allPrompts.ToArray());

            // Tokenize the prompt
            var tokens = tokenizer.Encode(fullPrompt);

            // Set generator params
            var generatorParams = new GeneratorParams(model);
            generatorParams.SetSearchOption("max_length", 2048);
            generatorParams.SetSearchOption("past_present_share_buffer", false);
            generatorParams.SetInputSequences(tokens);


            // Define variable to hold full response to add to conversation after
            var fullResponse = new System.Text.StringBuilder();

            // Generate the response
            var generator = new Generator(model, generatorParams);
            // Output response as each token in generated
            while (!generator.IsDone()) {
                generator.ComputeLogits();
                generator.GenerateNextToken();
                var outputTokens = generator.GetSequence(0);
                var newToken = outputTokens.Slice(outputTokens.Length - 1, 1);
                var output = tokenizer.Decode(newToken);

                // build full response string as it's generated
                fullResponse.Append(output);

                Console.Write(output);
            }

            // add full response to conversation now that generation is complete
            allPrompts[allPrompts.Count() - 1] = $"<|assistant|>{fullResponse.ToString()}<|end|>";

            Console.WriteLine();
        }
    }
}

Conclusion

It’s really exciting to see how C# can be used with ONNX Runtime and Microsoft’s Phi-3 SLM model to build your own custom, generative AI application. This article shows the code for a simple console application, however, this code could be run in any .NET application. The use of ONNX Runtime will run the AI model on GPU if able, but will fall back to CPU. This means the ONNX Runtime can be used to host the Phi-3 SLM, or any other ONNX supported model, in any .NET application running on just about any device. The speed of the model will depend on the amount of compute power on the device it’s running on.

Happy building your own generative AI applications!

6 Comments

Francisco Merizalde on May 23, 2024 at 7:06 pm

This is a great demo Chris thanks. Sometimes when you look at all of the components, it can be a little daunting but nothing cuts through that like a nice simple example.
- Chris Pietschmann on May 24, 2024 at 8:40 am
  
  Thanks Francisco!
Neil Benson on May 24, 2024 at 6:43 am

This didnt work for me. On new Model(modelPath) it exceptions with “Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: ‘CUDA execution provider is not enabled in this build.’
- Chris Pietschmann on May 24, 2024 at 8:40 am
  
  CUDA is more dependent on your local GPU. It’s possible the CUDA dlls aren’t in the system path or something so the ONNX Runtime is having difficulty finding them.
George Danila on May 24, 2024 at 8:21 am

Did you manage to make this work using CUDA? I keep getting this following error: : ‘Unable to load DLL ‘onnxruntime-genai’ or one of its dependencies: The specified module could not be found. (0x8007007E)’
- Chris Pietschmann on May 24, 2024 at 8:40 am
  
  I was able to. CUDA is more dependent on your local GPU. It’s possible the CUDA dlls aren’t in the system path or something so the ONNX Runtime is having difficulty finding them.