Getting Started on Nvidia NIM

Goh Soon Heng
5 min readAug 2, 2024

--

NVIDIA NIM which is part of NVIDIA AI Enterprise is a set of easy-to-use inference microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations. NIM supports AI use cases across multiple domains, includes large language models (LLMs), vision language models (VLMs), and models for speech, images, video, 3D, drug discovery, medical imaging, and more. For this paper, we focus on LLM.

At their core, NIM are containers that provide interactive APIs for executing inference for AI Models. In general, NIMs have a server layer, a runtime layer and a model engine. The model engine is the part of the NIM that holds the information about the model itself — the parameters or the weights of the model as well as the execution graph needed to pass an inference request through. The runtime layer is responsible for executing the desired engine, producing results based on the engine weights and the incoming requests. The server layer of NIM is the part that is exposed to external programs, which can interact with NIM through a REST API. For large language models, NIM provides an OpenAI API compatible server, making it easy to integrate NIM into existing LLM applications.

NIM makes it easy for IT and DevOps teams to self-host LLM in their own managed environments while still providing developers with industry standard APIs that allow them to build powerful copilots, chatbots, and AI assistants that can transform their business. Let’s explore how to use Nvidia launchpad to test our NIM. First step is to get to this site, choose a lab (I am showing the most simplest lab — 5 minutes GenAI Inferencing), fill up the form, send the request and wait. If it is approved, you get an email and you can sign in to get to this page.

Click “Get Started” and choose the foundation model you want. In this case, I am usingLlama3–8b-instruct.

Click on the model link and you will prompted to login to get to NIM. Provide your email address and password, you should be able to sign in easily if you had registered for nvidia account previously. If you haven’t, it is a good time to register one and revisit the model link after that.

Once in, choose “Build with this NIM”.

You should see two options : ”Hosted API” or “Self-Hosted API”. For our case, we will use self-hosted API since we are going to spin off a NIM and experiment with it. If you prefer not to do anything and just want to consume the services, then, choose “Hosted API”. Either option will use the same API key. Make sure make a copy of this API key as you would need it later.

Launch the VScode (CodeIDE) within launchpad and start a terminal session. Let’s look at what GPU is available by running “nvidia-smi”. From the output, we are given a H100 NVL GPU. That’s awesome!

The next step, we are going to login to nvidia docker repository “nvcr.io”. As we are not using username and password, we will furnish “$oauthtoken” as user name and put the API key as the password. You should be able to login successfully.

Once login, we are going to export the API and configure a local cache. This caching is needed to expedite the model loading process for subsequent load, as illustrated in “Nvidia Deployment Lifecycle”.

Next, we use “docker run” to create the container to host the GenAI model from “meta/llama3–8b-instruct”. In the docker command, we are running it as interactive mode and will delete the container when it stops. In addition, we will use all the available GPUs and define a cache size of 16GB for inter-GPU communication in case if we have two GPUs. Since I have one H100 NVL, this “shm-size” is not needed. We will be using the API to login to NGC and download the model. We will be using the same user as the system user to avoid permission mismatch. Finally, we will forward the port “8000” where the NIM server is published inside the container to access from the host system. Once the container is deployed. You can start to access the model via OpenAI API. To do that, I launch a juypter notebook, put in the API key to access the model. Send in the prompt “Who is Tom Cruise?” and it responded well. This indicate we had successfully deploy GenAI using NIM and able to communicate with it using OpenAI API via python.

More interesting, let’s look at what backend engine is it running. Notice TensorRT-LLM was originally initiated but the system ultimately uses vLLM since this is the most optimized one during runtime.

As you can see, nvidia NIM make it super easy for us to deploy GenAI model. With the OpenAI compatible API, it make it even easy to consume via any application written in python

--

--

Goh Soon Heng
Goh Soon Heng

Written by Goh Soon Heng

I aim to simplify GenAI and DS, making it easy for everyone to read and understand. Alternate site: https://soonhengblog.wordpress.com/author/soonhenghpe/

Responses (1)