Ayoola’s Substack

Single Agent vs Multi-Agent: When to Build a Multi-Agent System

Ayoola Olafenwa — Wed, 29 Apr 2026 08:50:53 GMT

AI Agents

AI agents have become popular because modern LLMs are now highly capable at tasks like coding, writing, reasoning, and solving problems across different fields. This has reduced the need to train custom models and shifted more attention toward building practical applications around existing LLMs. Tools like Codex, Claude Code, Cursor and Windsurf are already helping software engineers work faster, while businesses use agents for customer support, automation and other real-world tasks.

An AI agent is an application that uses an LLM to reason, plan and use tools to perform tasks, allowing the model to interact with its environment in a practical and useful way.

Components of an AI Agent

Some of the major components of most AI agents are the LLM, tools and memory.

LLM: This is the brain of the AI agent. It is the large language model that enables the agent to reason, plan, and decide how to solve a given task.
Tools: These are helpers, usually in the form of code functions, that allow the LLM to interact with its environment. Tools help the agent connect to external data sources, search the internet, retrieve information from databases, access files, and carry out specific actions. For example, coding agents can use tools to write, debug, and save files, research agents can use web search or vector databases to gather information and customer support agents can use internal company documents to answer questions based on trusted business knowledge.
Memory: This allows the agent to store relevant information from interactions and use it later to provide better and more consistent assistance. It helps the agent maintain context across tasks and improve the overall user experience.
Memory may be optional during early development, but it becomes an important part of many real-world AI agent systems, especially when the agent needs to handle follow-up questions, multi-step workflows or personalised interactions.
There are two major types of memory commonly used in AI agents: short-term memory and long-term memory. Short-term memory keeps track of information within the current session or task, while long-term memory stores useful information across multiple sessions or chats so the agent can use it later.

ReAct (Reasoning + Acting) in Agents

An AI agent differs from a basic chatbot because a chatbot usually follows a more direct workflow: user query → LLM → response. The LLM receives the user’s message and generates a reply based mainly on the prompt and its existing context.

An AI agent goes beyond this by using the LLM to reason about the task, decide what needs to be done, choose whether tools are needed, call those tools, observe the results and continue until it can produce a useful answer.

This is where the ReAct approach comes in. ReAct means Reasoning + Acting. It is an agent pattern where the LLM reasons about a task and takes actions, usually through tools, based on that reasoning. It involves designing a core logic loop around an LLM.

A basic ReAct workflow in an AI agent usually looks like this:

Step 1: The agent receives a user query
The LLM reasons over the task and decides whether it can answer directly or needs to use tools. It checks what tools are available and decides which ones are needed to solve the task.
Step 2: The agent calls the required tools
Based on its reasoning, the agent takes action by calling the necessary tools. These tools may search the web, retrieve documents from a vector database, access files, run code or connect to an external API. The results returned from these tools are known as tool outputs.
Step 3: The tool outputs are sent back to the LLM
The tool outputs are passed back to the LLM as additional context. This gives the agent more relevant information to work with instead of relying only on the original prompt.
Step 4: The LLM checks the evidence and generates a response
The LLM reviews the tool outputs and checks whether they are enough to solve the task. If the evidence is sufficient, it generates a grounded response for the user. If not, the agent may repeat the reasoning, tool-calling and observation steps until it has enough information to provide a useful answer.

Structure of AI Agents

AI Agents can either be single or multi depending on the design structure.

Single Agent vs Multi-Agent

A single agent is an agent design where one LLM handles the whole task. It reasons, plans and calls the required tools when needed. Most AI agents start as single-agent systems because they are simpler, easier to maintain and usually enough for many tasks.

A multi-agent system uses specialised agents to solve different parts of a task. It often has a central agent, usually called an orchestrator, supervisor or planner, that coordinates the other agents and decides when each one should act. Each specialised agent can have its own role, tools and reasoning logic, making the system more modular and suitable for complex workflows.

When to Build A Multi-Agent System

A single-agent design works well for simple tasks that require limited tool use. For example, a personal assistant agent that can access your calendar to book reminders, a calculator agent that only uses a calculator tool, or a web search agent that uses a web search API to retrieve up-to-date information.

However, a single agent can become overloaded when the task requires many tools, multi-step reasoning, different responsibilities or verification before the final response is returned to the user. Common issues include overloaded prompting, poor tool routing, unclear agent responsibilities and reduced reliability due to too much complexity in one agent.

A multi-agent system is a better choice when the task may overwhelm a single-agent design and when you need specialised agents with clear roles, their own tools and separate responsibilities.

For example, a software engineering agent may work better as a multi-agent system:

Orchestrator → Coder → Tester → Reviewer

The Orchestrator coordinates the workflow, the Coder agent generates the code, the Tester agent checks whether the code works, and the Reviewer agent reviews the solution to check for missing parts or possible improvements.

Another example is a research agent that researches a topic, retrieves information from different data sources and generates grounded content:

Orchestrator → Retriever → Writer → Verifier

The Retriever agent gathers information from the web and local documents stored in a vector database. The Writer agent writes based on the retrieved content. The Verifier agent checks the written content for errors, citations and factual accuracy before the final response is returned.

Multi-agent systems make the workflow more modular and give each stage a clear role. However, they should be used only when the task genuinely needs that design, because they usually increase latency, cost and maintenance complexity due to more LLM calls and more moving parts.

A simple rule is:

Use a single agent when the task is simple, has fewer steps and needs only a few tools. Use a multi-agent system when the task requires specialised roles, multi-step reasoning, stronger verification or coordination across different tools and workflows.

Walkthrough of A Multi-Agent Project

To make the idea of multi-agent systems more practical, I built a project called Multi-Agent RAG Researcher.

The goal of the project is to show how a central agent can coordinate multiple specialised agents to research a topic, retrieve evidence from documents and the web, write a grounded content and verify the content before returning it to the user. Instead of using one agent to handle everything, the system splits the workflow into different responsibilities.

Check the project on github: https://github.com/ayoolaolafenwa/multi-agent-rag-researcher

Clone Project repo

git clone https://github.com/ayoolaolafenwa/multi-agent-rag-researcher.git

Clone the repo to followup with the code along the post. When the repo is cloned, the project structure will look like this:

.
├── docs/                         # Default PDF files
├── memory/                       # SQLite-backed session memory helpers
├── qdrant_vector_database/       # PDF ingestion and similarity search
├── ui/                           # Gradio app and UI handlers
├── utils/
│   ├── requirements.txt          # Python dependencies
├── worker_agents/                # Retriever, writer, and verifier
├── orchestrator_agent.py         # Main coordinator
└── run_orchestrator.py           # CLI entry point

Multi-Agent Architecture

Data Sources

There are two major data sources:

Qdrant Vector Database

Information retrieval from PDFs is handled in the following stages:

Multiple PDFs can be loaded from the docs/ folder or uploaded through the UI.
Documents are split into chunks, converted into embeddings, and stored in a local Qdrant collection.
Similarity search is then used to retrieve the most relevant chunks across the indexed documents.
The retrieved chunks include citation metadata such as document name and page number.

The document retrieval part of the project where Qdrant vector database is setup, PDF ingestion, chunking, embedding, and similarity search are managed is handled in qdrant_vector_database/vector_store.py .

Tavily Web Search

Tavily is used to retrieve up-to-date or external information from the web. The retriever agent can use it when:

the indexed PDFs do not cover the query
document evidence is weak or incomplete
newer information is needed

Worker Agents

Retriever Agent

The role is:

It uses two tools: PDF document retrieval and web search.
Given a query, it decides whether to use local documents, web search or both.
If local document evidence is missing or weak, it can fall back to web search to gather broader or more up-to-date context.

The code for the retriever agent with tavily web search available in worker_agents/retriever.py . It uses gpt-5.4-mini with low reasoning effort.

Writer Agent

The role is:

It receives the retrieved information from the Retriever Agent.
It writes a grounded draft based on the available evidence.
It includes supporting citations from PDFs or web sources when they are available.

The code for the writer agent available in worker_agents/writer.py . It uses gpt-5.4 with low reasoning effort.

Verifier Agent

The role is:

It receives the draft from the Writer Agent together with the evidence.
It checks whether the claims in the draft are supported by the retrieved evidence.
It returns the final verified response.

The code for the worker agent is available in worker_agents/verifier.py . It uses gpt-5.4 with low reasoning effort.

Memory

SQLite is used to provide short-term memory for the multi-agent workflow. For a given session ID, the system stores:

the latest user query
the latest retrieved evidence for that session

This allows the orchestrator to reuse relevant evidence for follow-up questions instead of retrieving the same information again every time.

The code for the memory is available in memory/memory.py .

Orchestrator

The orchestrator coordinates the three worker agents: Retriever, Writer and Verifier.

How the Orchestrator coordinates the Multi-Agent Workflow

It receives the user query and, depending on the query, may respond directly or begin the evidence-based workflow.
For a research query, it first checks whether relevant cached evidence from the memory for the current session can be reused.
If cached evidence is not enough, it calls the Retriever Agent to gather evidence from PDFs, the web or both.
If there is document evidence but the evidence is weak, the Retriever Agent can also fetch up-to-date information from the web to supplement the local document information.
The orchestrator then passes the active evidence and the user query to the Writer Agent so it can generate a grounded draft.
Next, it sends the draft and evidence to the Verifier Agent, which checks the claims and returns the final verified report.
During the session, the latest query and retrieved evidence are stored in memory for follow-up questions.
In follow-up questions, the orchestrator may reuse cached evidence instead of calling the Retriever Agent again, then continue with the Writer Agent and Verifier Agent to generate the final response.

The code for the orchestrator is in orchestrator_agent.py . It uses gpt-5.4-mini with low reasoning effort.
The orchestrator has a guardrail that keeps the system focused on research and factual questions. It refuses unrelated general tasks such as coding help or simple math because the goal of the system is to function as a research assistant.

Note: For the models used in the orchestrator and worker agents, you can change them from gpt-5.4 to any openai provided model of your choice.

Setup Project

Prerequisites

Python 3.10 or newer
OpenAI API key: Create an OpenAI Account if you don’t have one and Generate an API Key.
Tavily API key: Tavily is a specialized web-search tool for AI agents. Create an account on Tavily.com, once your profile is set up, an API key will be generated that you can copy into your environment. New account receives 1000 free credits that can be used for up to 1000 web searches.

Installation

Create and activate a virtual environment:

python3 -m venv env
source env/bin/activate

Install the dependencies:

cd multi-agent-rag-researcher
pip3 install -r utils/requirements.txt

Create a utils/var.env file and store your API keys:

OPENAI_API_KEY=your_openai_api_key
TAVILY_API_KEY=your_tavily_api_key

Place the PDFs you want to index in the docs/ folder, or upload PDFs later through the UI. The project already includes existing PDFs in docs/, currently Gemma 3 Technical Report.pdf and DeepSeek-V3.2.pdf, so you can use those directly or replace them with your own documents.

Run Project

Start the command-line app:

python3 run_orchestrator.py

When the CLI starts, it ingests the PDFs in docs/ into the local Qdrant store. Type q or exit to end the session.

Run UI for Multi-Agent Chat

Start the Gradio UI:

python3 ui/gradio_app.py

The UI automatically loads the default PDFs from docs/ on startup. If you upload new PDFs, they replace the active indexed document set for that UI session.

Demo Video Showing the Multi Agent Agent RAG Researcher in Action

Notes

Session memory is stored in utils/memory.db.
Local Qdrant data is stored in utils/qdrant_storage/.
The system is designed for research and factual question answering, not for unrelated general-purpose tasks.

Conclusion

In this post, I explained how an AI agent works, how it uses tools to interact with its environment, and how the ReAct approach helps it reason, plan, select tools and execute specific tasks.

I also covered the structural design of AI agents, which can be single-agent or multi-agent systems. I explained how both designs work, when to choose each one based on the workflow, and compared single-agent implementation with multi-agent architecture.

Finally, I did a walkthrough of the multi-agent design behind my Multi-Agent RAG Researcher project, showing how it uses an orchestrator to coordinate three worker agents, retrieve information from the web and local documents, use memory for consistency and write and verify grounded content before returning the final output.

Reach to me via:

Email: olafenwaayoola@gmail.com
Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References

https://developers.openai.com/cookbook

https://developers.openai.com/api/docs/guides/function-calling

Building an Agentic RAG with Function Calling

Ayoola Olafenwa — Fri, 31 Oct 2025 11:53:35 GMT

AI agent connects a Large Language Model (LLM) to external tools, allowing it to interact with the world and complete tasks like writing code and building projects. Retrieval-Augmented Generation (RAG) improves LLM responses by retrieving relevant information. RAG systems apply the core ideas of AI agents, which makes them more accurate, efficient and more autonomous.

Vanilla RAG vs Agentic RAG

A basic Retrieval-Augmented Generation (RAG) setup includes a Large Language Model (LLM) for text generation, an embedding model for finding similar text, and a vector database for fast storage and retrieval. For each user prompt, the embedding model processes the prompt and searches the vector database for similar context. That context is then added to the prompt and sent to the LLM to generate a more accurate response.

Basic Structure of Vanilla RAG

The Vanilla RAG system follows a static workflow and is not very efficient because the Large Language Model cannot decide when to retrieve information and when not to. It is also limited in retrieving data from multiple sources.

Basic Structure of an Agentic RAG

Agentic Retrieval-Augmented Generation (RAG) builds on standard RAG by adding automation that lets the Large Language Model decide when it should retrieve information and when it doesn’t need to. An Agentic RAG system can use multiple data sources for context, such as PDFs or web search. When a query is asked, the system first checks if it can answer from its own knowledge or if it needs to retrieve something.

For example, if the system has a PDF of financial records and the user asks about those records, it will retrieve the information from the PDF. If the user asks “What is the weather like today?”, it will run a web search. But if the user asks “Write a tutorial on Python,” it will generate the answer directly from what it already knows without calling any retrieval function.

Building an Agentic RAG with Function Calling

Function calling is a technique that connects a Large Language Model (LLM) to an external tool like an API or a database. It is used to create AI agents that let LLMs interact with tools.

In an agent, this usually works like this:

A function is defined to call a specific API, such as a weather API to fetch the latest forecast.
A function can also be defined in a RAG-based agent to extract information from unstructured data like PDFs.
Custom instructions, usually written as a JSON schema, tell the LLM what to do when it sees a certain task and which tool (function) it should call to solve it.

In this post, I will build an Agentic RAG system using function calling, with one function that retrieves information from a local document and another that searches the internet for up-to-date information.

Prerequisite

Create OpenAI account and generate an API key

1: Create an OpenAI Account if you don’t have one

2: Generate an API Key

Set up and Activate Environment

python3 -m venv env
source env/bin/activate

Export OpenAI API Key

export OPENAI_API_KEY=”Your Openai API Key”

Setup Tavily for Web Search

Tavily is a specialized web-search tool for AI agents. Create an account on Tavily.com, once your profile is set up, an API key will be generated that you can copy into your environment. New account receives 1000 free credits that can be used for up to 1000 web searches.

export TAVILY_API_KEY=”Your Tavily API Key”

Install Packages

openai
tavily-python
qdrant-client
spacy
langchain
langchain-community
langchain-text-splitters
pypdf

Paste the packages above in a requirements.txt file and install using:

pip3 install -r requirements.txt

Process Local Document for RAG

The local document to be used will be Veo 3 Model Card Paper. Veo 3 is a state of the art video generation model from Google. The goal of the agentic RAG project is for us to have an agent that can answer questions based on the local document whenever it is asked question about Veo 3 video Model and do a web search whenever a query requires up to date information.

Load and Process PDF File

Download Veo 3 Model Card paper. It is a 4-page PDF paper containing information on how Veo 3 was trained, the data used, its architecture, evaluation methods, and its ethics and safety.

from langchain_community.document_loaders import PyPDFLoader

def load_pdf(pdf_path: str, page1_on_second: bool = True) -> str:
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    parts = []
    
    for p in pages:
        idx0 = p.metadata.get("page", 0)  
        label = idx0 if page1_on_second else (idx0 + 1)
        parts.append(f"[PAGE {label}]\n{p.page_content.strip()}")

    return "\n\n".join(parts).strip()

The load_pdf function extracts and returns content from a PDF file.

Data Chunking and Overlapping

LLMs can only read a fixed number of tokens per inference (the context window), which varies by model, from a few thousand to over a million. Since PDFs and other unstructured data often exceed that limit, chunking splits a document into smaller pieces that fit the window and support efficient retrieval. The main drawback is boundary loss, where important details can get cut at chunk edges. Overlapping fixes this by repeating a small slice of text between adjacent chunks, preserving continuity and improving accuracy and response quality in RAG.

import spacy
from typing import List
from langchain_text_splitters import TokenTextSplitter
import re

def chunk_data(
    text: str,
    chunk_size: int = 1500,
    chunk_overlap: int = 225
) -> List[str]:
    nlp = spacy.blank("en")
    if "sentencizer" not in nlp.pipe_names:
        nlp.add_pipe("sentencizer")

    splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        encoding_name="cl100k_base",
    )

    chunks: List[str] = []

    parts = re.split(r"\n?\[PAGE\s+(\d+)\]\n", text)
    it = iter(parts[1:])
    for page_label, page_text in zip(it, it):
        sentences = [s.text.strip() for s in nlp(page_text).sents if s.text.strip()]
        page_chunks = splitter.split_text("\n".join(sentences))
        for chunk in page_chunks:
            chunks.append(f"[Page {page_label}] {chunk}")

    return chunks

The chunk_data function takes in raw text, breaks it into page sections, cleans and splits each section into sentences with spaCy, then uses TokenTextSplitter (with the cl100k_base tokenizer) to generate overlapping 1,500-token chunks and tags each chunk with its page number.

Note: The chosen chunk size and overlap are fine for small PDFs (the Veo 3 model card paper is only 4 pages). These values can be adjusted for larger documents that may contain hundreds or thousands of pages.

Setup Qdrant for Vector Database

Qdrant is an open-source vector database used for storing and retrieving vector embeddings. I will use Qdrant as the vector store for the RAG system.

from qdrant_client.models import VectorParams, Distance
from qdrant_client.models import PointStruct
from qdrant_client import QdrantClient

qdrant_client = QdrantClient(":memory:")
collection_name = "vector_store"
qdrant_client.create_collection(
    collection_name = collection_name,
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
),)

This code initializes an in-memory Qdrant client and creates a collection called “vector_store” configured to store 1536-dimensional embeddings and compare them using cosine similarity to find semantically similar text.

Helper Code for uploading document to Vector Database

The next step is to define the code that uploads the processed PDF file to the Qdrant vector database. I will use an embedding model from the OpenAI API to generate vector embeddings for the text, which will then be stored in the vector database.

import os
from openai import OpenAI

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def create_upload_embeddings(chunks):
    model_name = "text-embedding-3-small"
    response = openai_client.embeddings.create(input=chunks, model=model_name)

    embeddings = [record.embedding for record in response.data]

    points = [
        PointStruct(
            id=idx,
            vector=vec,
            payload={"text": text},
        )
        for idx, (vec, text) in enumerate(zip(embeddings, chunks))
    ]

    qdrant_client.upsert(
        collection_name=collection_name,
        wait=True,
        points=points,
    )

create_upload_embeddings function generates embeddings for the text chunks using the text-embedding-3-small model, then uploads each chunk and its embedding into the Qdrant collection for semantic search .

Upload Local Document(PDF) to Vector Store

text = load_pdf("Veo-3-Model-Card.pdf")
chunks = chunk_data(text)

create_upload_embeddings(chunks)

The PDF “Veo-3-Model-Card.pdf” is chunked using the chunk_data function and uploaded to the vector store with the create_upload_embeddings function.

Setup Tools (Functions) for Function Calling for the Agentic RAG

I will define two functions that provide different data sources for the RAG system: one for retrieving information from the PDF file and another for searching the web.

1. Function for Local Document(PDF)

def retrieve_document(query: str, top_k: int = 5) -> str:
    model_name = "text-embedding-3-small"
    query_embedding = openai_client.embeddings.create(
        input=[query], model=model_name
    ).data[0].embedding

    results = qdrant_client.query_points(collection_name, query_embedding, limit=top_k, with_payload=True)
    retrieved_texts = [output.payload["text"] for output in results.points if output.payload and "text" in output.payload]
    context = "\n\n---\n\n".join(retrieved_texts) if retrieved_texts else "no relevant context found"

    output = f"""Based on the following context:
        
        {context}
        

        Provide a relevant response to:

        
        {query}
        
        """.strip()
    
    return output

The retrieve_document function takes a user query, embeds it, searches Qdrant for the top 5 most similar chunks from the PDF and returns a formatted prompt that includes the retrieved context plus the original query.

2. Function for Web Search

A web search function is implemented using Tavily serving as the tool to use in the agentic RAG when up-to-date information is required.

from tavily import TavilyClient

tavily = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

def web_search(query: str, num_results: int = 10):
    try:
        result = tavily.search(
            query=query,
            search_depth="basic",
            max_results=num_results,
            include_answer=False,       
            include_raw_content=False,
            include_images=False
        )

        results = result.get("results", [])

        return {
            "query": query,
            "results": results, 
            "sources": [
                {"title": r.get("title", ""), "url": r.get("url", "")}
                for r in results
            ]
        }

    except Exception as e:
        return {
            "error": f"Search error: {e}",
            "query": query,
            "results": [],
            "sources": [],
        }

web_search function runs an internet search for a given query (up to 10 results), and returns a structured object with the query, the results and their sources, which the agent then uses as context for answering questions that need real-time information.

Create Tool Schema

The tool schema defines custom instructions for an AI model on when it should call a tool. In this case, there are two tools: one for retrieving information from a document and another for performing a web search. The schema also specifies the conditions and actions to be taken when the model calls a tool.A json tool schema is defined below based on the OpenAI tool schema structure.

tool_schemas = [
   {
        "type": "function",
        "name": "retrieve_document",
        "description": """
        Search the internal PDF file containing Veo 3 Model Card.
        Use this tool when the user requests information about Veo 3
        that only appear in this document and
        for every answer you give include page-number citations in the form [page. X]. 
        """,
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Query to be searched in the PDF corpus.",
                },
            },
            "required": ["query"],
            "additionalProperties": False
        },
    },
    
   {
        "type": "function",
        "name": "web_search",
        "description": """Execute a web search to fetch up to date information. Synthesize a concise, 
        self-contained answer from the content of the results of the visited pages.
        Fetch pages, extract text, and provide the best available result while citing 1-3 sources (title + URL). 
        If sources conflict, surface the uncertainty and prefer the most recent evidence.
        """,
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Query to be searched on the web.",
                },
            },
            "required": ["query"],
            "additionalProperties": False
        },
    },
]

type: Specifies that the type of tool is a function.
name: The function name to be used for the tool call: the first tool schema has the function name retrieve_document, and the second tool schema has the function name web_search.
description: Describes what the AI model should do when calling a tool. The first tool schema instructs the model to use retrieve_document when it receives a query about Veo 3 model. The second tool schema instructs the model to search the internet using the web_search function to fetch up-to-date information and extract relevant details to generate the best response.
strict: It is set to true, this property instructs the LLM to strictly follow the tool schema’s instructions.
parameters: Defines the parameters that will be passed into the functions in the tool schemas. In first tool schema query represents the questions on Veo 3 model to be searched in the pdf file, while in the second tool schema query represents the search term to look up on the internet.
required: Instructs the LLM that query is a mandatory parameter for both tools.
additionalProperties: it is set to false, meaning that the tool’s arguments object cannot include any parameters other than those defined under parameters.properties.

Create Agentic RAG Using GPT-5 and Function Calling

Finally, I will build an agent that we can chat with. It will retrieve information from the PDF file for queries about the Veo 3 model and perform a web search when up-to-date information is needed. I will use GPT-5-mini, a fast and accurate model from OpenAI along with function calling to invoke the tool schemas and the web_search and retrieve_document functions already defined.

from datetime import datetime, timezone
import json

# tracker for the last model’s response id to maintain conversation’s state 
prev_response_id = None

# a list for storing tool’s results from the function call 
tool_results = []

while True:
    # if the tool results is empty prompt message 
    if len(tool_results) == 0:
        user_message = input("User: ")

        # commands for exiting chat 
        if isinstance(user_message, str) and user_message.strip().lower() in {"exit", "q"}:
            print("Exiting chat. Goodbye!")
            break

    else:
        # set the user’s messages to the tool results to be sent to the model 
        user_message = tool_results.copy()
    
        # clear the tool results for the next call 
        tool_results = []

    # obtain current’s date to be passed into the model as an instruction to assist in decision making
    today_date = datetime.now(timezone.utc).date().isoformat()     

    response = openai_client.responses.create(
        model = "gpt-5-mini",
        input = user_message,
        instructions=f"Current date is {today_date}.",
        tools = tool_schemas,
        previous_response_id=prev_response_id,
        text = {"verbosity": "low"},
        reasoning={
            "effort": "low",
        },
        store=True,
        )
    
    prev_response_id = response.id

    # Handles model response’s output 
    for output in response.output:
        
        if output.type == "reasoning":
            print("Assistant: ","Reasoning ....")

            for reasoning_summary in output.summary:
                print("Assistant: ",reasoning_summary)

        elif output.type == "message":
            for item in output.content:
                print("Assistant: ",item.text)

        # checks if the output type is a function call and append the function call’s results to the tool results list
        elif output.type == "function_call":
            # obtain function name 
            function_name = globals().get(output.name)
            # loads function arguments 
            args = json.loads(output.arguments)
            function_response = function_name(**args)
            # append tool results list with the the function call’s id and function’s response 
            tool_results.append(
                {
                    "type": "function_call_output",
                    "call_id": output.call_id,
                    "output": json.dumps(function_response)
                }
            )

Step by Step Code Breakdown

from openai import OpenAI
import os 

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
prev_response_id = None
tool_results = []

Initialized two variables prev_response_id and tool_results. prev_response_id keeps track of the model’s response to maintain conversation state, and tool_results is a list that stores outputs returned from the function call.

Code Walkthrough of the Loop

if len(tool_results) == 0:
    user_message = input("User: ")
    if isinstance(user_message, str) and user_message.strip().lower() in {"exit", "q"}:
        print("Exiting chat. Goodbye!")
        break

else:
    user_message = tool_results.copy()
    tool_results = []

today_date = datetime.now(timezone.utc).date().isoformat()     

response = client.responses.create(
    model = "gpt-5-mini",
    input = user_message,
    instructions=f"Current date is {today_date}.",
    tools = tool_schema,
    previous_response_id=prev_response_id,
    text = {"verbosity”: "low"},
    reasoning={
        "effort": "low",
    },
    store=True,
    )

prev_response_id = response.id

Checks if tool_results is empty:

If it’s empty, ask the user for a message. The user can also type exit or q to quit.
If it’s not empty, set user_message to the collected tool outputs instead, then clear tool_results so they aren’t sent again.

Gets today_date so the model can reason with the current date.

Calls client.responses.create with:

model: gpt-5-mini
input: the user_message
instructions: the current date
tools: the tool schema
previous_response_id: the last response ID, to keep context
text.verbosity: low, for concise replies
reasoning.effort: low, for faster answers (can be high for harder tasks)
store: save this response for continuity

Finally, prev_response_id is updated to the new response ID so the next loop stays in the same conversation.

for output in response.output:
    if output.type == "reasoning":
        print("Assistant: ","Reasoning ....")

        for reasoning_summary in output.summary:
            print("Assistant: ",reasoning_summary)

    elif output.type == "message":
        for item in output.content:
            print("Assistant: ",item.text)

    elif output.type == "function_call":
        # obtain function name 
        function_name = globals().get(output.name)
        # loads function arguments 
        args = json.loads(output.arguments)
        function_response = function_name(**args)
        # append tool results list with the the function call’s id and function’s response 
        tool_results.append(
            {
                "type": "function_call_output",
                "call_id": output.call_id,
                "output": json.dumps(function_response)
            }
        )

Loops through response.output and handles each output type:

reasoning: print “Reasoning ….” and then print each item in the reasoning summary.
message: go through each content item and print its text.
function_call: get the function name, load its arguments, call the function (retrieve_document or web_search), and store the result.

After calling the function, append its response and call_id to tool_results. This lets the next loop send the tool result back to the model.

Below is a sample output from the terminal.

User: What is the model architecture used in Veo3 model?
Assistant:  Reasoning ....
Assistant:  Reasoning ....
Assistant:  Veo 3 uses a latent diffusion architecture — diffusion applied to temporal audio latents and to spatio‑temporal video latents (latent diffusion model). [page. 1]

User: What is Apple’s latest MacBook model?
Assistant:  Reasoning ....
Assistant:  Reasoning ....
Assistant:  As of 2025, Apple’s latest MacBook is the MacBook Air (13-inch, M4, 2025). Source: Apple Newsroom - Apple introduces the new MacBook Air with the M4 chip (https://www.apple.com/newsroom/2025/03/apple-introduces-the-new-macbook-air-with-the-m4-chip-and-a-sky-blue-color/).
User: Write a code to return the product of a list of numbers. 
Assistant:  Reasoning ....
Assistant:  Python (works on all versions):
def product(nums):
    result = 1
    for n in nums:
        result *= n
    return result
# Examples
print(product([2, 3, 4]))  # 24
(If using Python 3.8+, you can also use math.prod: import math; math.prod(nums).)
User: q
Exiting chat. Goodbye!

In the results above, each response generated using a tool (function) call was returned with its corresponding source.

For the query “What is the model architecture used in the Veo 3 model?”, the agent generates:
“Veo 3 uses a latent diffusion architecture — diffusion applied to temporal audio latents and to spatio‑temporal video latents (latent diffusion model). [page. 1]”
It cites page 1 of the Veo 3 Model Card PDF, retrieved by the retrieve_document tool (function).
For the question “What is Apple’s latest MacBook model?”, the agent performs a web search using the web_search tool (function) and returns a response that cites the source URL: “As of 2025, Apple’s latest MacBook is the MacBook Air (13-inch, M4, 2025). Source: Apple Newsroom — Apple introduces the new MacBook Air with the M4 chip (https://www.apple.com/newsroom/2025/03/apple-introduces-the-new-macbook-air-with-the-m4-chip-and-a-sky-blue-color/).“
For the prompt “Write code to return the product of a list of numbers.”, the agent does not call any tool and generates the response directly.

Note

This is the Jupyter notebook for the full code.

Conclusion

In this article I explained RAG, and how agentic RAG improves vanilla RAG by allowing a Large Language Model to decide when to retrieve information and pull from multiple data sources. I walked through building an agentic RAG step by step using function calling, where the model can call two different functions to get information from a local document or from the web. By the end, we had an agentic RAG that can retrieve context from a PDF or the internet and generate a context-aware response for the user.

Check out my GitHub repo, GenAI-Courses where I have published courses on various Generative AI topics.

Reach to me via:

Email: olafenwaayoola@gmail.com
Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References

https://platform.openai.com/docs/guides/function-calling?api-mode=responses

https://docs.tavily.com/documentation/api-reference/endpoint/search

How to Build a Web Search Agent with Function Calling and GPT-5

Ayoola Olafenwa — Mon, 29 Sep 2025 14:17:30 GMT

AI Agent and Large Language Models (LLMs)

Large language models (LLMs) are advanced AI systems built on deep neural network such as transformers and trained on vast amounts of text to generate human-like language. LLMs like ChatGPT, Claude, Gemini and Grok can tackle many challenging tasks and are used across fields such as science, healthcare, education, and finance.

An AI agent extends the capabilites of LLMs to solve tasks that are beyond their pre-trained knowledge. An LLM can write a Python tutorial from what it learned during training. If you ask it to book a flight, the task requires access to your calendar, web search and the ability to take actions, these fall beyond the LLM’s pre-trained knowledge. Some of the common actions include:

Weather forecast: The LLM connects to a web search tool to fetch latest weather forecast.
Booking agent: An AI agent that can check a user’s calendar, search the web to visit a booking site like Expedia to find available options for flight and hotel, present them to the user for confirmation and complete the booking on behalf of the user.

How an AI Agent Works

AI agent is a system that uses a Large Language Model to plan, reason and take steps to interact with its environment using tools suggested from the model’s reasoning to solve a particular task.

Basic Structure of an AI Agent

Large Language Model(LLM): LLM is the brain of an AI agent. It takes a user’s prompt, plans and reasons through the request and breaks the problem into steps that determine which tools it should use to complete the task.
Tool is the framework that the agent uses to perform an action based on the plan and reasoning from the Large Language Model. If you ask an LLM to book a table for you at a restaurant, possible tools that will be used include calendar to check your availability and a web search tool to access the restaurant website and make a reservation for you.

Ilustrated Decision Making of a Booking AI Agent

AI agents can access different tools depending on the task. A tool might be a data store, such as a database. For example, a customer-support agent could access a customer’s account details and purchase history and decide when to retrieve that information to help resolve an issue.

AI agents are used to solve a wide range of tasks, and there are many powerful agents available. Coding agents, particularly agentic IDEs such as Cursor, Windsurf, and GitHub Copilot help engineers write and debug code faster and build projects quickly. CLI Coding agents like Claude Code and Codex CLI can interact with a user’s desktop and terminal to carry out coding tasks. ChatGPT supports agents that can perform actions such as booking reservations on a user’s behalf. Agents are also integrated into customer support workflows to communicate with customers and resolve their issues.

Function Calling

Function calling a technique for connecting a large language model (LLM) to external tools such as APIs or databases. It is used in creating AI agents to connect LLMs to tools. In function calling, each tool is defined as a code function (for example, a weather API to fetch the latest forecast) along with a JSON Schema that specifies the function’s parameters and instructs the LLM on when and how to call the function for a given task.

The type of function defined depends on the task the agent is designed to perform. For example, for a customer support agent we can define a function that can extract information from unstructured data, such as PDFs containing details about a business’s products.

In this post I will demonstrate how to use function calling to build a simple web search agent using GPT-5 as the large language model.

Basic Structure of Web Search Agent

The main logic behind the web search agent:

Define a code function to handle the web search.
Define custom instructions that guide the large language model in determining when to call the web search function based on the query. For example, if the query asks about the current weather, the web search agent will recognize the need to search the internet to get the latest weather reports. However, if the query asks it to write a tutorial about a programming language like Python, something it can answer from its pre-trained knowledge it will not call the web search function and will respond directly instead.

Prerequisite

Create OpenAI account and generate an API key

1: Create an OpenAI Account if you don’t have one

2: Generate an API Key

Set up and Activate Environment

python3 -m venv env
source env/bin/activate

Export OpenAI API Key

export OPENAI_API_KEY="Your Openai API Key"

Setup Tavily for Web Search

export TAVILY_API_KEY="Your Tavily API Key"

Install Packages

pip3 install openai
pip3 install tavily-python

Step by Step Building A Web Search Agent with Function Calling

Step1: Create Web Search Function with Tavily

A web search function is implemented using Tavily, serving as the tool for function calling in the web search agent.

from tavily import TavilyClient
import os

tavily = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

def web_search(query: str, num_results: int = 10):
    try:
        result = tavily.search(
            query=query,
            search_depth="basic",
            max_results=num_results,
            include_answer=False,       
            include_raw_content=False,
            include_images=False
        )

        results = result.get("results", [])

        return {
            "query": query,
            "results": results, 
            "sources": [
                {"title": r.get("title", ""), "url": r.get("url", "")}
                for r in results
            ]
        }

    except Exception as e:
        return {
            "error": f"Search error: {e}",
            "query": query,
            "results": [],
            "sources": [],
        }

Web function code break down

Tavily is initialized with its API key. In the web_search function, the following steps are performed:

Tavily search function is called to search the internet and retrieve the top 10 results.
The search results and their corresponding sources are returned.

This returned output will serve as relevant context for the web search agent: which we will define later in this article, to fetch up-to-date information for queries (prompts) that require real-time data such as weather forecasts.

Step2: Create Tool Schema

The tool schema defines custom instructions for an AI model on when it should call a tool, in this case the tool that will be used is a web search function. It also specifies the conditions and actions to be taken when the model calls a tool. A json tool schema is defined below based on the OpenAI tool schema structure.

tool_schema = [
    {
        "type": "function",
        "name": "web_search",

        "description": """Execute a web search to fetch up to date information. Synthesize a concise, 
        self-contained answer from the content of the results of the visited pages.
        Fetch pages, extract text, and provide the best available result while citing 1-3 sources (title + URL). 
        If sources conflict, surface the uncertainty and prefer the most recent evidence.
        """,

        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Query to be searched on the web.",
                },
            },
            "required": ["query"],
            "additionalProperties": False
        },
    },
]

Tool schema’s Properties

type: Specifies that the type of tool is a function.
name: the name of the function that will be used for tool call, which is web_search.
description: Describes what the AI model should do when calling the web search tool. It instructs the model to search the internet using the web_search function to fetch up-to-date information and extract relevant details to generate the best response.
strict: It is set to true, this property instructs the LLM to strictly follow the tool schema’s instructions.
parameters: Defines the parameters that will be passed into the web_search function. In this case, there is only one parameter: query which represents the search term to look up on the internet.
required: Instructs the LLM that query is a mandatory parameter for the web_search function.
additionalProperties: it is set to false, meaning that the tool’s arguments object cannot include any parameters other than those defined under parameters.properties.

Step3: Create Web Search Agent Using GPT-5 and Function Calling

Finally I will build an agent that we can chat with, which can search the web when it needs up-to-date information. I will use GPT-5-mini, a fast and accurate model from OpenAI, along with function calling to invoke the tool schema and the web search function already defined.

from datetime import datetime, timezone
import json
from openai import OpenAI
import os 

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# tracker for the last model's response id to maintain conversation's state 
prev_response_id = None

# a list for storing tool's results from the function call 
tool_results = []

while True:
    # if the tool results is empty prompt message 
    if len(tool_results) == 0:
        user_message = input("User: ")

        """ commands for exiting chat """
        if isinstance(user_message, str) and user_message.strip().lower() in {"exit", "q"}:
            print("Exiting chat. Goodbye!")
            break

    else:
        # set the user's messages to the tool results to be sent to the model 
        user_message = tool_results.copy()
    
        # clear the tool results for the next call 
        tool_results = []

    # obtain current's date to be passed into the model as an instruction to assist in decision making
    today_date = datetime.now(timezone.utc).date().isoformat()     

    response = client.responses.create(
        model = "gpt-5-mini",
        input = user_message,
        instructions=f"Current date is {today_date}.",
        tools = tool_schema,
        previous_response_id=prev_response_id,
        text = {"verbosity": "low"},
        reasoning={
            "effort": "low",
        },
        store=True,
        )
    
    prev_response_id = response.id

    # Handles model response's output 
    for output in response.output:
        
        if output.type == "reasoning":
            print("Assistant: ","Reasoning ....")

            for reasoning_summary in output.summary:
                print("Assistant: ",reasoning_summary)

        elif output.type == "message":
            for item in output.content:
                print("Assistant: ",item.text)

        # checks if the output type is a function call and append the function call's results to the tool results list
        elif output.type == "function_call":
            # obtain function name 
            function_name = globals().get(output.name)
            # loads function arguments 
            args = json.loads(output.arguments)
            function_response = function_name(**args)
            # append tool results list with the the function call's id and function's response 
            tool_results.append(
                {
                    "type": "function_call_output",
                    "call_id": output.call_id,
                    "output": json.dumps(function_response)
                }
            )

Step by Step Code Breakdown

from openai import OpenAI
import os 

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
prev_response_id = None
tool_results = []

Initialized the OpenAI model API with an API key.
Initialized two variables prev_response_id and tool_results. prev_response_id keeps track of the model’s response to maintain conversation state, and tool_results is a list that stores outputs returned from the web_search function call.

The chat runs inside the loop. A user enters a message and the model called with tool schema accepts the message, reasons over it, decides whether to call the web search tool, and then the tool’s output is passed back to the model. The model generates a context-aware response. This continues until the user exits the chat.

Code Walkthrough of the Loop

if len(tool_results) == 0:
    user_message = input("User: ")
    if isinstance(user_message, str) and user_message.strip().lower() in {"exit", "q"}:
        print("Exiting chat. Goodbye!")
        break

else:
    user_message = tool_results.copy()
    tool_results = []

today_date = datetime.now(timezone.utc).date().isoformat()     

response = client.responses.create(
    model = "gpt-5-mini",
    input = user_message,
    instructions=f"Current date is {today_date}.",
    tools = tool_schema,
    previous_response_id=prev_response_id,
    text = {"verbosity": "low"},
    reasoning={
        "effort": "low",
    },
    store=True,
    )

prev_response_id = response.id

Checks if the tool_results is empty. If it is, the user will be prompted to type in a message, with an option to quit using exit or q.
If the tool_results is not empty, user_message will be set to the collected tool outputs to be sent to the model. tool_results is cleared to avoid resending the same tool outputs on the next loop iteration.
The current date (today_date) is obtained to be used by the model to make time-aware decisions.
Calls client.responses.create to generate the model’s response and it accepts the following parameters:
- model: set to gpt-5-mini.
- input: accepts the user’s message.
- instructions: set to current’s date (today_date).
- tools: set to the tool schema that was defined earlier.
- previous_response_id: set to the previous response’s id so the model can maintain conversation state.
- text: verbosity is set to low to keep model’s response concise.
- reasoning: GPT-5-mini is a reasoning model, set the reasoning’s effort to low for faster’s response. For more complex tasks we can set it to high.
- store: tells the model to store the current’s response so it can be retrieved later and helps with conversation continuity.
prev_response_id is set to current’s response id so the next function call can thread onto the same conversation.

for output in response.output:
    if output.type == "reasoning":
        print("Assistant: ","Reasoning ....")

        for reasoning_summary in output.summary:
            print("Assistant: ",reasoning_summary)

    elif output.type == "message":
        for item in output.content:
            print("Assistant: ",item.text)

    elif output.type == "function_call":
        # obtain function name 
        function_name = globals().get(output.name)
        # loads function arguments 
        args = json.loads(output.arguments)
        function_response = function_name(**args)
        # append tool results list with the the function call's id and function's response 
        tool_results.append(
            {
                "type": "function_call_output",
                "call_id": output.call_id,
                "output": json.dumps(function_response)
            }
        )

Process the model’s response output and do the following;

If the output type is reasoning, print each item in the reasoning summary.
If the output type is message, iterate through the content and print each text item.
If the output type is a function call, obtain the function’s name, parse its arguments, and pass them to the function (web_search) to generate a response. In this case, the web search response contains up-to-date information relevant to the user’s message. Finally appends the function call’s response and function call id to tool_results. This lets the next loop send the tool result back to the model.

Full Code for Web Search Agent

from datetime import datetime, timezone
import json
from openai import OpenAI
import os 
from tavily import TavilyClient

tavily = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

def web_search(query: str, num_results: int = 10):
    try:
        result = tavily.search(
            query=query,
            search_depth="basic",
            max_results=num_results,
            include_answer=False,       
            include_raw_content=False,
            include_images=False
        )

        results = result.get("results", [])

        return {
            "query": query,
            "results": results, 
            "sources": [
                {"title": r.get("title", ""), "url": r.get("url", "")}
                for r in results
            ]
        }

    except Exception as e:
        return {
            "error": f"Search error: {e}",
            "query": query,
            "results": [],
            "sources": [],
        }


tool_schema = [
    {
        "type": "function",
        "name": "web_search",
        "description": """Execute a web search to fetch up to date information. Synthesize a concise, 
        self-contained answer from the content of the results of the visited pages.
        Fetch pages, extract text, and provide the best available result while citing 1-3 sources (title + URL). "
        If sources conflict, surface the uncertainty and prefer the most recent evidence.
        """,
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Query to be searched on the web.",
                },
            },
            "required": ["query"],
            "additionalProperties": False
        },
    },
]

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# tracker for the last model's response id to maintain conversation's state 
prev_response_id = None

# a list for storing tool's results from the function call 
tool_results = []

while True:
    # if the tool results is empty prompt message 
    if len(tool_results) == 0:
        user_message = input("User: ")

        """ commands for exiting chat """
        if isinstance(user_message, str) and user_message.strip().lower() in {"exit", "q"}:
            print("Exiting chat. Goodbye!")
            break

    else:
        # set the user's messages to the tool results to be sent to the model 
        user_message = tool_results.copy()
    
        # clear the tool results for the next call 
        tool_results = []

    # obtain current's date to be passed into the model as an instruction to assist in decision making
    today_date = datetime.now(timezone.utc).date().isoformat()     

    response = client.responses.create(
        model = "gpt-5-mini",
        input = user_message,
        instructions=f"Current date is {today_date}.",
        tools = tool_schema,
        previous_response_id=prev_response_id,
        text = {"verbosity": "low"},
        reasoning={
            "effort": "low",
        },
        store=True,
        )
    
    prev_response_id = response.id


    # Handles model response's output 
    for output in response.output:
        
        if output.type == "reasoning":
            print("Assistant: ","Reasoning ....")

            for reasoning_summary in output.summary:
                print("Assistant: ",reasoning_summary)

        elif output.type == "message":
            for item in output.content:
                print("Assistant: ",item.text)

        # checks if the output type is a function call and append the function call's results to the tool results list
        elif output.type == "function_call":
            # obtain function name 
            function_name = globals().get(output.name)
            # loads function arguments 
            args = json.loads(output.arguments)
            function_response = function_name(**args)
            # append tool results list with the the function call's id and function's response 
            tool_results.append(
                {
                    "type": "function_call_output",
                    "call_id": output.call_id,
                    "output": json.dumps(function_response)
                }
            )

When you run the code, you can easily chat with the agent to ask questions that require the latest information, such as the current weather or the latest product releases. The agent responds with up-to-date information along with the corresponding sources from the internet. Below is a sample output from the terminal.

User: What is the weather like in London today?
Assistant:  Reasoning ....
Assistant:  Reasoning ....
Assistant:  Right now in London: overcast, about 18°C (64°F), humidity ~88%, light SW wind ~16 km/h, no precipitation reported. Source: WeatherAPI (current conditions) — https://www.weatherapi.com/

User: What is the latest iPhone model?
Assistant:  Reasoning ....
Assistant:  Reasoning ....
Assistant:  The latest iPhone models are the iPhone 17 lineup (including iPhone 17, iPhone 17 Pro, iPhone 17 Pro Max) and the new iPhone Air — announced by Apple on Sept 9, 2025. Source: Apple Newsroom — https://www.apple.com/newsroom/2025/09/apple-debuts-iphone-17/

User: Multiply 500 by 12.           
Assistant:  Reasoning ....
Assistant:  6000
User: exit   
Exiting chat. Goodbye!

You can see the results with their corresponding web sources. When you ask it to perform a task that doesn’t require up-to-date information, such as maths calculations or writing code the agent responds directly without any web search.

Note: The web search agent is a simple, single-tool agent. Advanced agentic systems orchestrate multiple specialized tools and use efficient memory to maintain context, plan, and solve more complex tasks.

Conclusion

In this post I explained how an AI agent works and how it extends the capabilities of a large language model to interact with its environment, perform actions and solve tasks through the use of tools. I also explained function calling and how it enables LLMs to call tools. I demonstrated how to create a tool schema for function calling that defines when and how an LLM should call a tool to perform an action. I defined a web search function using Tavily to fetch information from the web and then showed step by step how to build a web search agent using function calling and GPT-5-mini as the LLM. In the end, we built a web search agent capable of retrieving up-to-date information from the internet to answer user queries.

Check out my GitHub repo, GenAI-Courses where I have published more courses on various Generative AI topics. It also includes a guide on building an Agentic RAG using function calling.

Reach to me via:

Email: olafenwaayoola@gmail.com
Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References

https://platform.openai.com/docs/guides/function-calling?api-mode=responses

https://docs.tavily.com/documentation/api-reference/endpoint/search

How Multimodal Large Language Model Works

Tue, 22 Apr 2025 09:05:15 GMT

Source

Multimodal Large Language Model

It is an advanced large language model (LLM) designed to process and interpret various modalities such as text, images, and audio. It is used for tasks that go beyond text processing, such as answering questions based on documents, analyzing speech inputs, and describing image contents. Popular multimodal models include GPT-4 Vision, Gemini, Phi-4 Multimodal, and LLaMA Vision.

Working of A Basic Multimodal Model

Most multimodal models operate in the following manner: there is a base language model, such as GPT, which processes text and integrates other modalities through the interaction of an encoder, an adapter, and language integration mechanisms. A multimodal model that handles text and images typically works as follows:

Image By Author

Image Encoder: It is a deep neural network responsible for converting raw images into high-dimensional feature vectors known as image embeddings. These embeddings contains extracted information from the image. Some of the commonly used image encoders in multimodal models are CLIP, SigLIP.

Adapter: The image embeddings from the image encoder is not compatible directly with a language model. Adapter is used to adapt image embeddings into the format compatible with the language model. The type of image adapter depend on the architecture of a multimodal model. It can be in the form of a module such as a linear projection to scale the dimension of the image embeddings to match the expected dimension for the language model. The image adapter makes it possible for a language model to treat images as text tokens.

Language Integration: The adapted embeddings are integrated into the language model. This model combines the visual and textual information to produce the desired output.

This approach is used in solving different multimodal tasks such as:

Image Captioning: Generating a textual description of an image.
Visual Question Answering: Answering a question about an image.
Image-Text Retrieval: Matching images with relevant text descriptions.

To gain a deeper understanding of how a multimodal model works, I will review Phi-4 Multimodal, a state-of-the-art open-source multimodal model.

Review of Phi-4 Multimodal

Phi-4-Multimodal is a 5.6-billion-parameter multimodal model built upon the Phi-4-Mini language model- a 3.8 billion parameters. It supports multiple modalities: text, image and audio inputs.

For vision-related tasks, Phi-4-Multimodal was trained on high-quality multimodal instruction datasets covering tasks like general image understanding, OCR (Optical Character Recognition), chart interpretation, diagram comprehension, and video summarization.

For speech/audio tasks, the model was trained on approximately 2 million hours of anonymized speech-text pairs for automatic speech recognition (ASR), and on diverse audio instruction datasets for tasks like speech summarization, speech question-answering, automatic speech translation (AST), and audio understanding.

Phi-4-Multimodal is a highly performant model achieving exceptional results on both vision and speech benchmarks. It rivals closed-source models such as Gemini and GPT-4o in various vision tasks, and it is currently ranked as the number 1 model for speech recognition on the HuggingFace OpenASR Leaderboard.

Training Overview

The base language model, Phi-4 Mini, was frozen and new modalities (vision and audio) were incorporated into it using a technique called “Mixture of LoRAs”.

Mixture of LoRAs : LoRA (Low-Rank Adaptation) is a method for adapting a pre-trained model to new tasks or modalities without significantly altering its original parameters. It involves inserting small, trainable low-rank matrices (adapters) into specific layers such as the linear layers within attention and feed-forward modules of the pre-trained Transformer model.

In Phi-4 Multimodal, dedicated LoRA adapters were trained specifically for vision and audio tasks. These adapters were integrated into the linear layers of the frozen Phi-4 Mini model, enabling it to process new modalities without changing the original model weights.

Architecture Overview

Diagram from Phi-4 Paper

I will explain how the Phi-4 Multimodal model processes image and audio modalities.

Image Modality

Dynamic Cropping Strategy: Phi-4 uses this technique to divide image into smaller crops for handling different image resolutions before it is passed into the image encoder.
Vision(Image)Encoder: Phi-4 uses SigLIP-400M as its image encoder. This model converts cropped images into image embeddings that represent visual content and details.
Vision Projector: A 2-layer MLP(Multilayer Perceptron) module is used by Phi-4 to adapt image embeddings from the image encoder for the language model(Phi-4 mini). It projects the extracted features from the image encoder into the embedding dimension supported by the language model.
Vision LoRA Adapter: The vision LoRA adapter integrated with the Phi-4 model allows the language model to process adapted image embeddings from the vision projector and perform vision-language tasks, such as image captioning or visual question answering.

In summary what happens within the Image modality:

Step 1: The image is preprocessed via dynamic cropping and passed through the image encoder.

Step 2: The image encoder processed the cropped images and extracts visual information from them and pass them to the vision projector.

Step 3: The vision projector then transforms the image encoder’s output into the embedding dimension supported language model(Phi-4 mini).

Step 4: Finally, the vision LoRA adapter injects the adapted features from the vision projector into the language model’s (Phi-4 mini) processing pipeline, allowing it to generate outputs that incorporate visual context. These connected components enable Phi‑4‑Multimodal to solve image tasks like image captioning and visual question answering.

Audio Modality

Audio Processing: Audio input is first transformed into 80-dimensional log-Mel filter-bank features, a correct format to represent audio signals capturing the essential frequency components.
Audio Encoder: It is consists of 3 convolution layers and 24 Conformer blocks (conformer is a specialized model for audio processing). The encoder extracts sound information from the processed audio signals.
Audio Projector: Phi-4 uses a 2-layer MLP to project the extracted audio features from the audio encoder to the same embedding dimension supported by the language model(Phi-4 Mini)
Audio LoRA Adapter: The audio LoRA adapter integrated with the Phi-4 model processes the adapted embeddings from the audio projector and solve tasks like automatic speech recognition (speech-to-text), audio summarization, and audio question answering.

In summary what happens within the Audio modality:

Step 1: The audio(speech) input is preprocessed into 80-dimensional log-Mel filter-bank features for correct audio signals representation and passed into the audio encoder
Step 2: The audio encoder extracts sound information from the the processed audio signals and passed it into the audio projector.
Step 3: The audio projector transforms the audio encoder’s output into a text-friendly embedding dimension, aligning the audio features with the language model’s (Phi-4 mini) input format.
Step 4: Finally, the audio LoRA adapter injects these adapted audio features into the language model’s (Phi-4 mini) processing pipeline,, allowing the language model to generate outputs that incorporate audio context. Together, these connected components enable Phi‑4‑Multimodal to seamlessly integrate speech data into its text-based generation framework and solve tasks like describing the content of an audio file.

Phi-4 multimodal can process speech(audio) in multiple languages: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Code for Running Phi-4 Multimodal

Install Required Packages

requirements.txt file

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

Paste the packages above in a txt file and install using:

pip3 install -r requirements.txt

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor,GenerationConfig

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda",
    torch_dtype="auto",trust_remote_code=True,
    # if you do not use Ampere or later GPUs, change attention to "eager"
    _attn_implementation='flash_attention_2'
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt template
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

Step By Step Code Breakdown

Loaded the Phi-4 multimodal model from Hugging Face Transformers, along with the processor for input preprocessing during inference.
Defined the generation configuration to be used during inference.
Created a prompt template for formatting prompts before passing them to the model.

Function for Running an Image

import requests
from PIL import Image
from urllib.parse import urlparse

def process_image(prompt, image_path):
    prompt = f'{user_prompt}<|image_1|>{prompt}{prompt_suffix}{assistant_prompt}'
    
    # Check if the image_path is a URL or a local file
    parsed_url = urlparse(image_path)
    is_url = bool(parsed_url.scheme and parsed_url.netloc)
    
    if is_url:
        image = Image.open(requests.get(image_path, stream=True).raw)
    else:
        image = Image.open(image_path)
    
    # Process image
    inputs = processor(text=prompt, images=image,  return_tensors='pt').to('cuda:0')

    # Generate output
    generate_ids = model.generate(
        **inputs,
        max_new_tokens=1000,
        generation_config=generation_config,
    )

    # Retrieve response from the generated output
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    
    return response

Step By Step Code Breakdown

The function process_image accepts a prompt and an image path. It performs the following tasks:

Loads the image and passes both the prompt and image to the processor for preprocessing.
The processed inputs from the processor are passed to the generate function to produce an output based on the image and prompt.
The generated output is post-processed to extract the response, which is then returned.

Sample Image

Source

image_path = "https://cdn.pixabay.com/photo/2015/11/29/13/08/kingfisher-1068684_1280.jpg"
prompt = "What is shown in this image?"
process_image(prompt, image_path)

Generated Response

The image depicts a kingfisher in mid-flight, with its wings spread wide 
as it dives towards the water. The bird is captured in a moment of action, 
with water droplets visible around it, indicating the speed of its descent. 
The background is blurred, focusing the viewer's attention on the bird and 
the splashing water.

Sample Image 2

Source

image_path = "https://cdn.pixabay.com/photo/2016/11/29/04/54/photographer-1867417_1280.jpg"
prompt = "Describe this image."
process_image(prompt, image_path)

Generated Response

The image shows a person with a blurred face, wearing a black jacket 
and a watch, holding a Sony camera with a lens cap on. 
The background features a body of water and a hilly landscape.

The model was able to generate accurate responses describing the images.

Function for Running on Audio

import soundfile as sf
import io
from urllib.request import urlopen

def process_audio(prompt, audio_path):
    prompt = f'{user_prompt}<|audio_1|>{prompt}{prompt_suffix}{assistant_prompt}'
    # Download and open audio file
    audio, samplerate = sf.read(io.BytesIO(urlopen(audio_path).read()))
    
    # Process with the audio
    inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

    # Generate response
    generate_ids = model.generate(
        **inputs,
        max_new_tokens=1000,
        generation_config=generation_config,
    )
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return response

Step By Step Code Breakdown

The function process_audio accepts a prompt and an audio path. It performs a similar task to the image processing function. It does the following:

Loads the audio file and passes both the prompt and audio to the processor for preprocessing.
The processed inputs from the processor are passed to the generate function to produce an output based on the audio and prompt.
The generated output is post-processed to extract the response, which is then returned.

audio_prompt = """Transcribe the audio to text"""
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
print(process_audio(audio_prompt, audio_url))

Generated Response from Audio

what we do as a society we have to think about where we're moving to i frequently talk to students 
about cognitive enhancing drugs and a lot of students take them for 
studying and exams but other students feel angry about this they feel 
those students are cheating and we have no long-term health and safety 
studies in healthy people and we really need those 
before people start taking them.

Read more about the full technical details of Phi-4 mini and Phi-4 Multimodal from the paper:

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Conclusion

I explained how a basic multimodal model works including the fundamental logic behind how most multimodal models that process text and images align these modalities, allowing images to be treated similarly to texts to solve visual tasks. I reviewed the Phi-4 multimodal model, describing how it was trained to handle both text and audio modalities. Additionally, I explained step-by-step the architectural design of the Phi-4 multimodal model detailing how it processes image and audio (speech) inputs and adapts to tasks such as image captioning and audio description. I also provided code samples demonstrating how to run the Phi-4 multimodal model on images and audio to generate responses.

Reach to me via:

Email: olafenwaayoola@gmail.com

Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

Deploying Large Language Models: vLLM and Quantization

Ayoola Olafenwa — Fri, 05 Apr 2024 08:49:31 GMT

source

Deployment of Large Language Models (LLMs)

We live in an amazing time of Large Language Models like ChatGPT, GPT-4, and Claude that can perform multiple amazing tasks. In practically every field, ranging from education, healthcare to arts and business, Large Language Models are being used to facilitate efficiency in delivering services. Over the past year, many brilliant open-source Large Language Models, such as Llama, Mistral, Falcon, and Gemma, have been released. These open-source LLMs are available for everyone to use, but deploying them can be very challenging as they can be very slow and require a lot of GPU compute power to run for real-time deployment. Different tools and approaches have been created to simplify the deployment of Large Language Models.

Many deployment tools have been created for serving LLMs with faster inference, such as vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs for loading very large Language Models. In this article, I will explain how to deploy Large Language Models with vLLM and quantization.

Latency and Throughput

Some of the major factors that affect the speed performance of a Large Language Model are GPU hardware requirements and model size. The larger the size of the model, the more GPU compute power is required to run it. Common benchmark metrics used in measuring the speed performance of a Large Language Model are Latency and Throughput.

Latency: This is the time required for a Large Language Model to generate a response. It is usually measured in seconds or milliseconds.

Throughput: This is the number of tokens generated per second or millisecond from a Large Language Model.

Install Required Packages

Below are the two required packages for running a Large Language Model: Hugging Face transformers and accelerate.

pip3 install transformers
pip3 install accelerate

Run Phi-2 Model

Phi-2 is a state-of-the-art foundation model from Microsoft with 2.7 billion parameters. It was pre-trained with a variety of data sources, ranging from code to textbooks. Learn more about Phi-2 from here.

Benchmarking LLM Latency and Throughput with Hugging Face Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype = "auto", trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

prompt = "Generate a python code that accepts a list of numbers and returns the sum."

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)

Loaded Phi-2 model and its tokenizer.
Created the prompt “Generate a python code that accepts a list of numbers and returns the sum.” to be passed to the model
Tokenized the prompt.

start = time.time()
response = model.generate(**inputs, max_new_tokens= 200,
temperature = 0.5)
end = time.time()

latency = end-start
print(f"Latency: {latency} seconds")

output_tokens = len(response[0])
through_put = output_tokens / latency
print(f"Throughput: {through_put} tokens/second")

text = tokenizer.decode(response[0])
print(text)

Generated a response from the model.
Obtained the latency by calculating the time required to generate the response.
Obtained the total length of tokens in the response generated, divided it by the latency and calculated the throughput.
Printed the generated response

Generated Output

Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a list of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

print(sum_list([1, 2, 3, 4, 5]))

This model was run on an A1000 (16GB GPU), and it achieves a latency of 2.7 seconds and a throughput of 32 tokens/second.

Deployment of A Large Language Model with vLLM

vLLM is an open source LLM library for serving Large Language Models at low latency and high throughput.

How vLLM works

The transformer is the building block of Large Language Models. The transformer network uses a mechanism called the attention mechanism, which is used by the network to study and understand the context of words. The attention mechanism is made up of a bunch of mathematical calculations of matrices known as attention keys and values. The memory used by the interaction of these attention keys and values affects the speed of the model.

vLLM introduced a new attention mechanism called PagedAttention that efficiently manages the allocation of memory for the transformer’s attention keys and values during the generation of tokens. The memory efficiency of vLLM has proven very useful in running Large Language Models at low latency and high throughput. This is a high-level explanation of how vLLM works. To learn more in-depth technical details, read the vLLM documentation.

Install vLLM

pip3 install vllm==0.3.3

Run Phi-2 with vLLM

from vllm import LLM, SamplingParams
import time

llm = LLM("microsoft/phi-2", trust_remote_code = True)

sampling_params = SamplingParams(temperature = 0.5, max_tokens = 200)

Imported required packages from vLLM for running Phi-2.
Loaded Phi-2 with vLLM.
Set important parameters for model’s generation.

prompt = "Generate a python code that accepts a list of numbers and returns the sum."

start = time.time()
response = llm.generate(prompt, sampling_params)
end = time.time()

latency = end-start
print(f"Latency: {latency}seconds")

output_tokens = len(response[0].outputs[0].token_ids)

through_put = output_tokens / latency

print(f"Throughput: {through_put}tokens/second")

generated_text = response[0].outputs[0].text
print(generated_text)

Defined the prompt, “Generate a python code that accepts a list of numbers and returns the sum.”
Generated the model’s response using llm.generate and computed the latency.
Obtained the length of total tokens generated from the response.
Divided the length of tokens by the latency to get the throughput
Printed the generated text.

Generated Output

Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
 [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

numbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))

I ran Phi-2 with vLLM on the same prompt, “Generate a python code that accepts a list of numbers and returns the sum.” On the same GPU, an A1000 (16GB GPU), vLLM produces a latency of 1.2 seconds and a throughput of 63 tokens/second, compared to Hugging Face transformers’ latency of 2.85 seconds and a throughput of 32 tokens/second. Running a Large Language Model with vLLM produces the same accurate result as using Hugging Face, with much lower latency and higher throughput.

Note

The metrics (latency and throughput) I obtained for vLLM are estimated benchmarks for vLLM performance. The model generation speed depends on many factors, such as the length of the input prompt and the size of the GPU. According to the official vLLM report, running an LLM model on a powerful GPU like the A100 in a production setting with vLLM achieves 24 times higher throughput than Hugging Face Transformers

Benchmarking Latency and Throughput in Real Time

The way I calculated the latency and throughput for running Phi-2 is experimental, and I did this to explain how vLLM accelerates a Large Language Model’s performance. In the real-world use case of LLMs, such as a chat-based system where the model outputs a token as it is generated, measuring the latency and throughput is more complex.

A chat-based system is based on streaming output tokens. Some of the major factors that affect the LLM metrics are Time to First Token (the time required for a model to generate the first token), Time Per Output Token (the time spent per output token generated), the input sequence length, the expected output, the total expected output tokens, and the model size. In a chat-based system, the latency is usually a combination of Time to First Token and Time Per Output Token multiplied by the total expected output tokens.

The longer the input sequence length passed into a model, the slower the response. Some of the approaches used in running LLMs in real-time involve batching users’ input requests or prompts to perform inference on the requests concurrently, which helps in improving the throughput. Generally, using a powerful GPU and serving LLMs with efficient tools like vLLM improves both the latency and throughput in real-time.

Run the vLLM deployment on Google Colab

Quantization of Large Language Models

Quantization is the conversion of a machine learning model from a higher precision to a lower precision by shrinking the model’s weights into smaller bits, usually 8-bit or 4-bit. Deployment tools like vLLM are very useful for inference serving of Large Language Models at very low latency and high throughput. We are able to run Phi-2 with Hugging Face and vLLM conveniently on the T4 GPU on Google Colab because it is a smaller LLM with 2.7 billion parameters. For example, a 7-billion-parameter model like Mistral 7B cannot be run on Colab with either Hugging Face or vLLM. Quantization is best for managing GPU hardware requirements for Large Language Models. When GPU availability is limited and we need to run a very large Language Model, quantization is the best approach to load LLMs on constrained devices.

BitsandBytes

It is a python library built with custom quantization functions for shrinking model’s weights into lower bits(8-bit and 4-bit).

Install BitsandBytes

pip3 install bitsandbytes

Quantization of Mistral 7B Model

Mistral 7B, a 7-billion-parameter model from MistralAI, is one of the best state-of-the-art open-source Large Language Models. I will go through a step-by-step process of running Mistral 7B with different quantization techniques that can be run on the T4 GPU on Google Colab.

Quantization with 8-bit Precision

This is the conversion of a machine learning model’s weight into 8-bit precision. BitsandBytes has been integrated with Hugging Face transformers to load a language model using the same Hugging Face code, but with minor modifications for quantization.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_8bit=True
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map = "auto",quantization_config = quant_config)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

Imported the needed packages for running model, including the BitsandBytesConfig library.
Defined the quantization config and set the parameter load_in_8bit to true for loading the model’s weights in 8-bit precision.
Passed the quantization config into the function for loading the model, set the parameter device_map for bitsandbytes to automatically allocate appropriate GPU memory for loading the model.
Loaded the tokenizer weights.

Quantization with 4-bit Precision

This is the conversion of a machine learning model’s weight into 4-bit precision.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2",device_map = "auto", quantization_config = quant_config)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

The code for loading Mistral 7B in 4-bit precision is similar to that of 8-bit precision except for a few changes:

Changed load_in_8bit to load_in_4bit.
A new parameter bnb_4bit_compute_dtype is introduced into the BitsandBytesConfig to perform the model’s computation in bfloat16. bfloat16 is computation data type for loading model’s weights for faster inference. It can be used with both 4-bit and 8-bit precisions. If it is in 8-bit you just need to change the parameter from bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.

NF4(4-bit Normal Float) and Double Quantization

NF4 (4-bit Normal Float) from QLoRA is an optimal quantization approach that yields better results than the standard 4-bit quantization. It is integrated with double quantization, where quantization occurs twice; quantized weights from the first stage of quantization are passed into the next stage of quantization, yielding optimal float range values for the model’s weights. According to the report from the QLoRA paper, NF4 with double quantization does not suffer from a drop in accuracy performance. Read more in-depth technical details about NF4 and Double Quantization from the QLoRA paper.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map = "auto", quantization_config = quant_config)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

The code for loading model in nf4 and double quantization is similar with extra parameters set in the BitsandBytesConfig:

load_4bit: Loading model in 4-bit precision is set to true.
bnb_4bit_quant_type: The type of quantization is set to nf4.
bnb_4bit_use_double_quant: Double quantization is set to True.
bnb_4_bit_compute_dtype: bfloat16 Computation data type is used for faster inference.

Line 11–13: Loaded the model’s weights and tokenizer.

Full Code for Model Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

#uncomment for 8bit precision
"""quant_config = BitsAndBytesConfig(
    load_in_8bit=True
)"""

#uncomment for 4bit precision
"""quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
) """

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map = "auto",quantization_config = quant_config)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

prompt = [
    {"role": "user", "content": "What is Natural Language Processing?"}
]
inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt")

model_inputs = inputs.to("cuda")

generated_ids = model.generate(model_inputs, max_new_tokens=200, do_sample=True)

decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Generated Output

 [INST] What is Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its main objective is to read, decipher, 
understand, and make sense of the human language in a valuable way. It can be used for various tasks such as speech recognition, 
text-to-speech synthesis, sentiment analysis, machine translation, part-of-speech tagging, name entity recognition, 
summarization, and question-answering systems. NLP technology allows machines to recognize, understand,
 and respond to human language in a more natural and intuitive way, making interactions more accessible and efficient.

Quantization is a very good approach for optimizing the running of very Large Language Models on smaller GPUs and can be applied to any model, such as Llama 70B, Falcon 40B, and mpt-30b. According to reports from the LLM.int8 paper, very Large Language Models suffer less from accuracy drops when quantized compared to smaller ones. Quantization is best applied to very Large Language Models and does not work well for smaller models because of the loss in accuracy performance.

Run Mistral 7B Quantization on Google Colab

Conclusion

In this article, I provided a step-by-step approach to measuring the speed performance of a Large Language Model, explained how vLLM works, and how it can be used to improve the latency and throughput of a Large Language Model. Finally, I explained quantization and how it is used to load Large Language Models on small-scale GPUs.

Reach to me via:

Email: olafenwaayoola@gmail.com

Linkedin: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/

References

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

What are Quantized LLMs

Understanding performance benchmarks for LLM inference