Damilola Oyedunmade for AI Engineering

Posted on May 23 • Edited on May 26

A Beginner's Guide to Getting Started with Multimodality in LangChain

#webdev #programming #ai #beginners

AI models are becoming increasingly adept at processing data beyond just text. They can now take in images, audio, video, and even documents like PDFs. This ability to handle multiple types of input is known as multimodality, and it's starting to shape how developers build smarter, more flexible systems.

In LangChain, multimodality is beginning to show up in different parts of the stack, from chat models that can describe images to embedding models that could one day support audio or video search. It’s still early, and the tooling isn’t fully standardized, but support is growing fast.

So, here’s the exciting thing: We are currently working on Langcasts.com, a resource crafted specifically for AI engineers, whether you're just getting started or already deep in the game. We'll be sharing guides, tips, hands-on walkthroughs, and extensive classes to help you master every piece of the puzzle. If you’d like to be notified the moment new materials drop, you can subscribe here to get updates directly.

This guide will walk you through what multimodality looks like in LangChain today. You’ll see what’s possible right now, where the limitations are, and how to start experimenting with models that go beyond plain text.

What Is Multimodality?

Multimodality means working with more than one type of "input” data. In practice, this could be anything from combining text and images in a single input to building systems that understand video, audio, or entire documents.

For AI engineers, this opens up more natural and powerful ways to interact with models. Instead of describing an image in words, you can show the image. Instead of summarizing a video, you can pass the file directly. The model doesn't just process data; it interprets it in its native form.

This shift matters because most real-world data isn't just text. Whether you’re working on a product recommendation engine, a medical assistant, or a support chatbot, there's value in being able to analyze and respond to different formats.

LangChain is beginning to support these workflows. While the tools are still evolving, the ability to plug in multimodal inputs is already changing how developers think about prompts, context, and retrieval.

Now let’s look at where multimodality is currently supported in LangChain, and what each part of the stack can (and can’t) do.

Where Multimodality Shows Up in LangChain

Multimodal support in LangChain is gradually becoming more available, but it’s not yet consistent across the system.

LangChain’s multimodal support is emerging across three main areas:

Component	Multimodal Input	Multimodal Output	Notes
Chat Models	✅ (images, files)	⚠️ (limited to audio)	Most flexible area today
Embedding Models	❌ (text only)	❌	Support expected in the future
Vector Stores	❌ (text only)	❌	Evolving alongside embeddings

Each layer plays a different role in handling multimodal data. Some are ready now; others are still catching up.

Using Multimodal Chat Models

If you're looking to work with multimodal inputs today, chat models are your best starting point. Some models already let you send not just text, but also images, audio, video, and documents. The exact capabilities depend on the provider, so the first step is knowing what’s supported.

LangChain makes it easier to work with these models by offering a flexible way to define inputs. There isn’t a universal format across providers, but most follow a similar structure using content blocks, small sections of input that specify both type and content.

Supported Inputs

Right now, the most commonly supported input type is images. Many models, including OpenAI’s GPT-4 and Google’s Gemini, allow image inputs alongside text. Gemini goes further by supporting files like PDFs and even video content.

Here’s a basic example of how to send a multimodal input using LangChain:


import { HumanMessage } from "@langchain/core/messages";

const message = new HumanMessage({
  content: [
    { type: "text", text: "describe the weather in this image" },
    { type: "image_url", image_url: { url: image_url } },
  ],
});
const response = await model.invoke([message]);

This pattern—mixing text and image content—is becoming the standard for models that support multimodal input. Still, formats vary, so always double-check the model's documentation.

Supported Outputs

Multimodal outputs are still rare. Most chat models respond with text only. One exception is OpenAI’s gpt-4o-audio-preview, which can return audio as output. If you’re building an app that speaks back to the user or generates sound, that’s a starting point.

In LangChain, these outputs come through the standard AIMessage object, just like any other response.

Tool Use and Workarounds

Today, no model can handle raw multimodal data inside a tool call. But there’s a workaround: pass a reference. Instead of uploading an image or file into the tool call itself, you pass a URL or file path. The tool can then fetch the data and process it on its own.

This approach works with any model that supports tool use. It’s not as smooth as native multimodal support, but it gets the job done when you need to extend capabilities with tools.

In the next section, we’ll look at how embedding models are expected to handle multimodal content and where that stands right now.

Embedding Models and Multimodality

Embedding models turn data into vectors, numerical representations that capture meaning and context. These vectors are essential for tasks like similarity search, ranking, and retrieval. Right now, LangChain’s embedding support is focused entirely on text. That works for many applications, but it leaves out a growing number of use cases that involve other types of data.

Multimodal embedding is about bringing the same vector-based logic to images, audio, and video. Imagine embedding an image of a product and retrieving similar items, or using a voice recording to find related clips or transcripts. The foundation is the same as with text embeddings: represent content in a way that makes it easy to compare and search.

LangChain doesn’t support these types of embeddings yet, but it’s clear that direction is coming. As more models release APIs that support multimodal embedding, the interface will evolve to handle new input types.

For now, if you’re working with image or audio search, you’ll need to use external tools to generate those embeddings and manage the data manually. It’s a bit more setup, but it gives you a path forward until built-in support lands.

Vector Stores and Multimodality

Vector stores are the backbone of retrieval in many AI applications. They hold embeddings and make it possible to search through them efficiently. In a typical setup, you embed some input, store it, and later retrieve relevant items by comparing vectors.

Right now, vector stores in LangChain are designed for text-based embeddings. That makes sense, given the current state of embedding models. But as support for image, audio, and video embeddings grows, vector stores will need to handle more than just text.

Adding multimodal support to vector stores means you could search across different types of data in the same system. For example, you might store both text and image embeddings, then run a search using an image as input and get back relevant text or visual content.

LangChain’s interfaces aren’t quite there yet, but the shift is coming. If you're planning for future use cases like visual search, voice-driven retrieval, or mixed-media chatbots, it's worth keeping this in mind.

Until then, developers can still experiment by using external embedding tools and storing the resulting vectors in a custom setup. It takes a bit more work, but it allows for early exploration of multimodal workflows.

Next, we’ll go over how to get started with the current tools available and where to look when you’re ready to try things out.

Getting Started

Before diving into multimodal features, it helps to have a basic understanding of chat models and how they work in LangChain. This guide assumes you're already comfortable with that. If you're new to chat models, it’s worth reviewing the core concepts first, like how messages are structured and how model calls are made.

With that in place, getting started with multimodal models begins with choosing the right one. Since capabilities vary, check the chat model integration table to see which providers support inputs like images, audio, or documents, and whether they can return non-text outputs.

Once you've picked a model, head to the provider’s integration guide. LangChain keeps things flexible, so formats aren't fully standardized. Following the provider’s exact examples will help ensure your inputs are accepted and interpreted correctly.

How to Pass Multimodal Data Directly to Models

Multimodal input is typically passed using content blocks. These are structured objects that let you send text, images, or other supported media as part of a message.

Here’s what that looks like in practice using LangChain:


import { HumanMessage } from "@langchain/core/messages";

const message = new HumanMessage({
  content: [
    { type: "text", text: "What’s happening in this image?" },
    { type: "image_url", image_url: { url: image_url } },
  ],
});
const response = await model.invoke([message]);

In this example:

The content array contains both a text prompt and an image URL.
Each part of the message is clearly labeled by type (text, image_url, etc.).
The model processes the full message as a single unit.

This format works with models that support OpenAI-style content blocks. Some providers, like Gemini, may have their own formats for other types of inputs like documents or video. The concept is the same—identify the data type, wrap it in the correct structure, and pass it along with the message.

Be aware that not all models handle all types of media. Some only support images, while others are expanding into PDFs, video, or audio. Always check the integration docs to confirm what’s supported and how to format your input.

A Note on Tools

Currently, chat models can’t take raw multimodal data directly into tool calls. If you’re using tools in your chain, pass a reference instead—like a public URL or file path. The tool can then access and process the data independently, and pass results back to the model.

This pattern gives you flexibility and works well with models that support tool use.

Multimodality is no longer a future concept, it’s starting to show up in tools developers are using today. With chat models now accepting images or documents, and more models inching toward audio and video, it’s a good time to start experimenting.

LangChain’s support is still growing, and not every piece is fully integrated yet. But the flexibility in its design gives you enough to start building. Whether you’re adding image understanding to a chatbot or preparing for more complex media search, the foundations are already in place.

Start simple. Test what your chosen model supports. Use content blocks to pass mixed inputs. Build around tools when needed. And stay close to the docs, because the landscape is changing quickly.

Modular, Fast, and Built for Developers

CKEditor 5 gives you full control over your editing experience. A modular architecture means you get high performance, fewer re-renders and a setup that scales with your needs.

Start now