<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: DenisYay</title>
    <description>The latest articles on Forem by DenisYay (@denisyay).</description>
    <link>https://forem.com/denisyay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1231947%2Fa2e9517f-1da1-4643-bea9-b7f49d8e1445.png</url>
      <title>Forem: DenisYay</title>
      <link>https://forem.com/denisyay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/denisyay"/>
    <language>en</language>
    <item>
      <title>Create a Visual Chatbot on AWS EC2 with LLaVA-1.5, 🤗 Transformers and Runhouse</title>
      <dc:creator>DenisYay</dc:creator>
      <pubDate>Tue, 12 Dec 2023 22:28:53 +0000</pubDate>
      <link>https://forem.com/denisyay/create-a-visual-chatbot-on-aws-ec2-with-llava-15-transformers-and-runhouse-hm1</link>
      <guid>https://forem.com/denisyay/create-a-visual-chatbot-on-aws-ec2-with-llava-15-transformers-and-runhouse-hm1</guid>
      <description>&lt;h4&gt;
  
  
  Get started with multimodal conversational models using the open-source LLaVA-1.5 model, Hugging Face Transformers and Runhouse.
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;The full Python code for this tutorial, including standing up the necessary infrastructure, is publicly available in &lt;a href="https://github.com/DenisYay/llava" rel="noopener noreferrer"&gt;this Github repo&lt;/a&gt; for you to try for yourself.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For the first version of this tutorial check out &lt;a href="https://www.run.house/blog/create-a-visual-chatbot-on-aws-ec2-with-llava-1-5" rel="noopener noreferrer"&gt;this post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;Multimodal conversational models represent a leap forward from text-only AI by harnessing the strengths of Large Language Models (LLMs) and Reinforcement Learning from Human Feedback (RLHF) to address challenges that require a combination of language and additional modalities (e.g. image and text). Visual capabilities in GPT-4V(ision) have been a recent highlight, leading to the creation of a sophisticated model proficient in both language and image comprehension in the same context. Though GPT-4V is undeniably advanced, the proprietary nature of such closed-source models often restricts the scope for research and innovation. The AI in radiology industry, for example, often requires FDA approval for new technologies. Relying on a model that produces varying results from month to month presents obvious challenges with reproducibility during audits.&lt;/p&gt;

&lt;p&gt;Fortunately, the landscape is evolving with the introduction of open-source alternatives, democratizing access to vision-language models. Deploying these models is not trivial, especially on self-hosted hardware.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://run.house" rel="noopener noreferrer"&gt;Runhouse&lt;/a&gt; is an open-source platform that makes it easy to deploy and run your machine learning application on any cloud or self-hosted infrastructure. In this tutorial, we will guide you step-by-step on how to create your own vision chat assistant that leverages the innovative &lt;a href="https://llava-vl.github.io/" rel="noopener noreferrer"&gt;LLaVA-1.5&lt;/a&gt; (Large Language and Vision Assistant) multimodal model, as described in the &lt;a href="https://arxiv.org/abs/2304.08485" rel="noopener noreferrer"&gt;Visual Instruction Tuning&lt;/a&gt; paper. After a brief overview of the LLaVA-1.5 model, we'll delve into the implementation code to construct a vision chat assistant, utilizing resources from the &lt;a href="https://github.com/haotian-liu/LLaVA" rel="noopener noreferrer"&gt;official code&lt;/a&gt; repository. Runhouse will allow us to stand up the necessary infrastructure and deploy the visual chatbot application in just &lt;em&gt;4 lines of Python code&lt;/em&gt; (!!)&lt;/p&gt;

&lt;h3&gt;
  
  
  What is LLaVA-1.5?
&lt;/h3&gt;

&lt;p&gt;The LLaVA model was introduced in the paper &lt;a href="https://arxiv.org/abs/2304.08485" rel="noopener noreferrer"&gt;Visual Instruction Tuning&lt;/a&gt;, and then further improved in &lt;a href="https://arxiv.org/abs/2310.03744" rel="noopener noreferrer"&gt;Improved Baselines with Visual Instruction Tuning&lt;/a&gt; (also referred to as LLaVA-1.5).&lt;/p&gt;

&lt;p&gt;The core idea behind it is to extract visual embeddings from an image and treat them in the same way as embeddings coming from textual language tokens by feeding them to a Large Language Model (Vicuna). To choose the “right” embeddings, the model uses a pre-trained CLIP visual encoder to extract the visual embeddings and then projects them into the word embedding space of the language model. The projection operation is accomplished using a vision-language connector, which was originally chosen to be a simple linear layer in the first paper, and later replaced with a more expressive Multilayer Perceptron (MLP) in &lt;a href="https://arxiv.org/abs/2310.03744" rel="noopener noreferrer"&gt;Improved Baselines with Visual Instruction&lt;/a&gt;. The architecture of the model is depicted below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzp0bmzsefcnddaen4u63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzp0bmzsefcnddaen4u63.png" alt="Llava network architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the advantages of the method is that by using a pre-trained vision encoder and a pre-trained language model, only the lightweight vision-language connector must be learned from scratch.&lt;/p&gt;

&lt;p&gt;One of resulting impressive model versions, LLaVA-1.5 13b, achieved SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizing all public data, completed training in ~1 day on a single 8-A100 node, and surpassed other methods that use billion-scale data (&lt;a href="https://llava-vl.github.io/" rel="noopener noreferrer"&gt;source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Building upon the fact that the LLaVA model is so lightweight to train and fine-tune, novel domain specific-agents can be created in no time. One such example is Microsoft Large Language and Vision Assistant for bioMedicine &lt;a href="https://arxiv.org/abs/2306.00890" rel="noopener noreferrer"&gt;LLaVA-Med&lt;/a&gt;. More on it in a subsequent post.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLaVA-1.5 Visual Chatbot Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;The full Python code is available in &lt;a href="https://github.com/DenisYay/llava/blob/main/llava_chat/llava_chat_transformers.py" rel="noopener noreferrer"&gt;this Github repo&lt;/a&gt; so that you can try it yourself.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: this code builds upon the amazing &lt;a href="https://github.com/huggingface/transformers/releases/tag/v4.36.0" rel="noopener noreferrer"&gt;Hugging Face Transformers 4.36.0 release&lt;/a&gt; adding support for Mixtral, LLava, BakLLava and more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Creating a multimodal chatbot (e.g. ask follow up textual questions about the provided image) using the code provided in the official repository is relatively straightforward. The repository provides standardized chat templates that can be used to parse the inputs in the right format. Following the right format used in training and fine-tuning is crucial for the quality of the answers generated by the model. The exact template depends on the specific variant of the language model used. The template for LLaVA-1.5 with a pre-trained Vicuna language model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions.

USER: &amp;lt;im_start&amp;gt;&amp;lt;image&amp;gt;&amp;lt;im_end&amp;gt; User's prompt

ASSISTANT: Assistant answer

USER: Another prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first few lines are the general system prompt used by the model. The special tokens &lt;code&gt;&amp;lt;im_start&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;image&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;im_end&amp;gt;&lt;/code&gt; are used to indicate where embeddings representing the image will be placed. The chatbot can be defined in just one simple Python class. Here it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# imports

class LlavaModel(rh.Module):
    def __init__(self, model_id="llava-hf/llava-1.5-7b-hf", **model_kwargs):
        super().__init__()
        self.model_id, self.model_kwargs = model_id, model_kwargs
        self.model = None

    def load_model(self):
        self.model = pipeline("image-to-text",
                              model=self.model_id,
                              device_map="auto",
                              torch_dtype=torch.bfloat16,
                              model_kwargs=self.model_kwargs)

    def predict(self, img_path, prompt, **inf_kwargs):
        if not self.model:
            self.load_model()
        with torch.no_grad():
            image = Image.open(requests.get(img_path, stream=True).raw)
            return self.model(image, prompt=prompt, generate_kwargs=inf_kwargs)[0]["generated_text"]


if __name__ == "__main__":
    gpu = rh.cluster(name="rh-a10x", instance_type="A10G:1")
    remote_llava_model = LlavaModel(load_in_4bit=True).get_or_to(system=gpu,
                                                                 env=rh.env(["transformers==4.36.0"],
                                                                            working_dir="local:./"),
                                                                 name="llava-model")
    ans = remote_llava_model.predict(img_path="https://upcdn.io/kW15bGw/raw/uploads/2023/09/22/file-387X.png",
                                     prompt="USER: &amp;lt;image&amp;gt;\nHow would I make this dish? Step by step please."
                                            "\nASSISTANT:",
                                     max_new_tokens=200)
    print(ans)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s walk through the methods of the class defined above.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;load_model&lt;/strong&gt;: loads the language model using the Hugging Face &lt;code&gt;pipeline()&lt;/code&gt; abstraction method with the specified parameters for quantization (&lt;code&gt;torch.bfloat16&lt;/code&gt;). The method is part of the  Hugging Face Transformers framework. Quantization to 16, 8 or 4-bit allows for reduced GPU memory requirements. In our case, the quantization we are using fits into one NVIDIA A10G GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;predict&lt;/strong&gt;: takes an image (&lt;code&gt;img_path&lt;/code&gt;) and a text prompt (&lt;code&gt;prompt&lt;/code&gt;) and returns the textual response of the loaded Llava model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;main&lt;/strong&gt;: Runhouse-enabled setup handling all aspects of setting up the remote compute cluster (using AWS EC2 with a single A10G)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let’s take a look at the Runhouse-specific code in detail.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import runhouse as rh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A prerequisite to the awesomeness in this article.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class LlavaModel(rh.Module):
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;LlavaModel&lt;/code&gt; class inherits from the Runhouse &lt;a href="https://www.run.house/docs/api/python/module" rel="noopener noreferrer"&gt;Module&lt;/a&gt; class. Modules represent classes that can be sent to and used on remote clusters and environments. They support remote execution of methods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpu = rh.cluster(name="rh-a10x", instance_type="A10G:1")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command defines a new on-demand cluster named rh-a10x utilizing 1 NVIDIA A10G GPU. A prerequisite to this command is setting up at least one cloud provider using the following documentation. For the purpose of this tutorial, we’ve used AWS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;remote_llava_model = LlavaModel(load_in_4bit=True).get_or_to(system=gpu,
env=rh.env(["transformers==4.36.0"],working_dir="local:./"),
name="llava-model")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command quantizes the model to fit onto the available memory, defines the &lt;code&gt;transformers&lt;/code&gt; environment and names the model &lt;code&gt;llava-model&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;get_or_to&lt;/code&gt; Runhouse method is an alternative to the simpler &lt;code&gt;to&lt;/code&gt; function that allows us to deploy a LlavaModel instance to the GPU cluster defined above. It provides a way to reuse an existing instance if one is found with the specified name, saving costs for the team.&lt;/p&gt;

&lt;p&gt;Now that we’ve created our multimodal chatbot and deployed it on an on-demand cluster, we’ll walk through running the inference and what an example output might look like!&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Inference on our Visual Chatbot
&lt;/h3&gt;

&lt;p&gt;Now that our model is deployed to an on-demand AWS cluster, we’re able to run the conversational UX using the &lt;code&gt;predict()&lt;/code&gt; method. We’ll be asking questions about a matcha hot dog image, to test the model’s ability to understand an unnatural image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmadebyai.ai%2Fmatcha-hot-dog.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmadebyai.ai%2Fmatcha-hot-dog.png" alt="matcha hot dog generated by madebyai.ai"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ans = remote_llava_model.predict(img_path="https://upcdn.io/kW15bGw/raw/uploads/2023/09/22/file-387X.png",
                                     prompt="USER: &amp;lt;image&amp;gt;\nHow would I make this dish? Step by step please."
                                            "\nASSISTANT:",
                                     max_new_tokens=200)
    print(ans)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;img_path&lt;/code&gt; corresponds to a publicly available link to the hot dog image and the &lt;code&gt;prompt&lt;/code&gt; is an initial question to ask our model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One could implement a &lt;code&gt;continue_chat&lt;/code&gt; method to demonstrate that this is an actual chat interface by asking follow up questions about the image. Please see the &lt;a href="https://www.run.house/blog/create-a-visual-chatbot-on-aws-ec2-with-llava-1-5" rel="noopener noreferrer"&gt;previous version of the post&lt;/a&gt; on how to do so.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After running this the first time to set up the infrastructure, the output should look like this. (Notice how logs and stdout are streamed back to you as if the application was running locally. Thank you Runhouse!)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m llava_chat.llava_chat_transformers
INFO | 2023-11-28 18:09:42.394662 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-11-28 18:09:42.550811 | Authentication (publickey) successful!
INFO | 2023-11-28 18:09:42.551436 | Connecting to server via SSH, port forwarding via port 32300.
INFO | 2023-11-28 18:09:45.611387 | Checking server rh-a10x
INFO | 2023-11-28 18:09:45.652054 | Server rh-a10x is up.
INFO | 2023-11-28 18:09:45.652610 | Getting llava-model
INFO | 2023-11-28 18:09:46.346477 | Time to get llava-model: 0.69 seconds
INFO | 2023-11-28 18:09:46.347149 | Calling llava-model.predict
base_env servlet: Calling method predict on module llava-model
INFO | 2023-11-28 18:09:57.475645 | Time to call llava-model.predict: 11.13 seconds


... Answer ...
To make this dish, which appears to be a hot dog covered in green sauce and possibly topped with seaweed, follow these steps:

1. Prepare the hot dog: Grill or boil the hot dog until it is cooked through and heated to your desired temperature.

2. Prepare the green sauce: Combine ingredients like mayonnaise, mustard, ketchup, and green food coloring to create a green sauce. You can also add other ingredients like chopped onions, relish, or pickles to enhance the flavor.

3. Prepare the seaweed: Wash and chop the seaweed into small pieces. You can use a food processor or a knife to achieve the desired size.

4. Assemble the hot dog: Place the cooked hot dog in a bun and spread the green sauce evenly over the top. Add the chopped seaweed pieces on top of the sauce, ensuring they are evenly distributed.

5. Serve: Serve the hot dog with a side of your choice, such as chips or a salad, and enjoy your unique and delicious creation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we know by now, a matcha hot dog is a dish best served cold.&lt;/p&gt;

&lt;p&gt;If you want to try it for yourself, this tutorial is hosted in this &lt;a href="https://github.com/DenisYay/llava/blob/main/llava_chat/llava_&amp;lt;br&amp;gt;%0Achat_transformers.py" rel="noopener noreferrer"&gt;public Github repo&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Visual chat models are a major step forward from text-only AI that introduce vision capabilities to conversations. For certain applications, self-hosting is crucial for auditability, reproducibility, and controlling the accuracy and performance of the application. This can be a hard requirement for certain medical and financial use cases. In addition, deploying with &lt;a href="https://run.house" rel="noopener noreferrer"&gt;Runhouse&lt;/a&gt; can help reduce training and inference costs by automatically selecting cloud providers based on price and availability.&lt;/p&gt;

&lt;p&gt;In a subsequent post we’ll explore use cases leveraging &lt;a href="https://github.com/microsoft/LLaVA-Med" rel="noopener noreferrer"&gt;LLaVA-Med&lt;/a&gt; and other potential medical field (and in particular, radiology AI) machine learning applications.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aiops</category>
      <category>llava</category>
      <category>huggingface</category>
    </item>
  </channel>
</rss>
