What is InstructLab?

Goh Soon Heng
6 min readNov 19, 2024

--

InstructLab is an open-source initiative designed to enhance large language models (LLMs) for generative AI applications. Developed collaboratively by IBM and Red Hat, this community-driven project offers a cost-effective way to improve LLM alignment while empowering individuals with minimal machine learning expertise to actively participate and contribute.

Why do you need InstructLab? The rapid growth of open-source AI projects is driving innovation, but adapting and extending permissively licensed LLMs comes with significant challenges. Fine-tuning LLMs often requires forking existing models, a resource-intensive process that limits accessibility and forces users to settle for “best-fit” forks that are difficult to maintain or improve collaboratively. For AI practitioners and enthusiasts alike, the lack of direct community governance, best practices, and accessible pathways to contribute new ideas poses a high barrier to entry, particularly for those without extensive AI/ML expertise. Moreover, the absence of a mechanism to merge improvements back into upstream models prevents continuous refinement and community-driven evolution. Together, these challenges underscore the need for more inclusive, scalable, and collaborative approaches to LLM development.

InstructLab is designed to address these challenges head-on.

The project empowers community contributors to seamlessly integrate new “skills” or “knowledge” into existing models, making enhancements more accessible to all.

With its model-agnostic technology, InstructLab enables model maintainers with adequate infrastructure to create regular builds of their open-source models. Instead of rebuilding and retraining entire models, this approach allows for the efficient composition of new skills, streamlining the enhancement process and fostering continuous innovation.

How Does InstructLab Work?

Let’s give it a shot! We shall use the opensource granite quantized model. First, we need to serve the model. We shall use the command “ilab serve” the model name (as shown below).

Next, we run a chat with the model and ask it a question “What is the Instructlab project?”. Take note that this model was trained before Instructlab project was released. As such, it does not have knowledge about InstructLab. Notice it provide an answer as if it know InstructLab but it is completely wrong. In the area of LLM, this is termed hallucination. This also point out the fact that it is always good to have a human intervention especially as we start on a new project.

If we try to fine-tune it. The process can be both complex and complicated especially if you are not a data scientist. Let’s use InstructLab to fine tune the model through three key components:

Taxonomy-Driven Data Curation: A diverse set of training data, curated by humans, serves as the foundation. These examples represent the new knowledge and skills to be added to the model.

Taxonomy is how you can actually train the model with your specific knowledge and skills and we can simply think of it as some files in a directory structure to house our knowledge. In the example below, we open a file call qna.yaml in the taxonomy directory.

If you look at the content of the file (below), it is just simple plain English. You don’t need team of Data Scientist to add your specific knowledge to the model. You just format your knowledge in a question/answer pairing. create your knowledge here and

Large-Scale Synthetic Data Generation: The model generates additional examples based on the curated seed data. To ensure quality, an automated refinement process evaluates and adjusts the synthetic data, ensuring outputs are accurate, grounded, and safe. To help understand this better, let’s include the “ReadME” file for the actual InstructLab itself from github as a repo inside the same qna.yaml file (see last line in the diagram). This is known as backing data.

We are going to use the granite model itself to take the question and answer
examples, plus the backing data in order to create additional examples. This is call synthetic data generation. To do that, we run the iLab generate num instructions, for this case, we generate 10 insturction examples for simplicity. Generally, you might want to do more instructions but it depends on the size of the backing data. As you can see, it starts to generate more data (it show 10% in the screenshot below but gradually it will reach 100%). The outcome is a more comprehensive dataset that can be used to train your model. The above use case basically incorporate all your company internal data.

The output is actually two json files. One question and answer json file that would be used to train the model. Another critic json file that will be excluded from the model training. This critic json file was created and contain some possible hallucination content. Both files can be vetted by human before it goes into the training.

Iterative Large-Scale Alignment Tuning: Finally, the model is retrained based on the set of synthetic data. The model undergoes retraining in two phases: knowledge tuning, to incorporate the new information, followed by skill tuning, to refine its application. This iterative approach ensures a precise and robust alignment of the model with its intended enhancements. To demonstrate this, let use use the command “ilab train”.

Once the model get trained, we want to validate and check the newly trained model. We will serve the newly trained model. Notice the new model name is different “ggml-model”.

Let’s run a chat against this new model. Notice, now the model provide the correct answer!

How is InstructLab different from RAG (Retrieval Augmented Generation)? InstructLab and RAG address distinct challenges in enhancing LLMs.

RAG is a cost-effective solution for augmenting an LLM with domain-specific knowledge that wasn’t included in its pretraining. It allows chatbots to provide accurate answers on specialized topics or business domains without requiring model retraining. In this approach, knowledge documents are stored in a vector database, retrieved in chunks, and incorporated into user queries sent to the model. RAG is particularly beneficial for organizations seeking to integrate proprietary data into an LLM while retaining control over their information or enabling the model to access up-to-date content.

On the other hand, InstructLab focuses on leveraging end-user contributions to iteratively build enhanced versions of an LLM. This method enables the model to acquire new knowledge and skills through continuous refinement and updates, expanding its capabilities over time.

Hopefully, the above give you a good idea of how InstructLab can help you quickly fine-tuned to a new model without all the hassle and complexity. Best of all, it is an opensource community project and there are lots of resources available to get you onboarded.

--

--

Goh Soon Heng
Goh Soon Heng

Written by Goh Soon Heng

I aim to simplify GenAI and DS, making it easy for everyone to read and understand. Alternate site: https://soonhengblog.wordpress.com/author/soonhenghpe/

No responses yet