Forem: Yemi Adejumobi

Run & Debug your LLM Apps locally using Ollama & Llama 3.1

Yemi Adejumobi — Wed, 14 Aug 2024 19:34:25 +0000

In the rapidly evolving landscape of AI and ML, large language models (LLMs) have become increasingly powerful and ubiquitous. However, the costs and complexities associated with running these models in cloud environments can be prohibitive, especially for developers and small teams looking to experiment and innovate.

Enter Ollama, a game-changing tool that brings the power of LLMs to your local machine. This blog post will explore how Ollama can simplify your development process, allowing you to run LLM applications locally with ease and efficiency while adding Langtrace, an open-source observability tool that complements Ollama perfectly, providing crucial insights into your LLM application's performance and behavior.

Whether you're a seasoned AI developer or just starting your journey with language models, this guide will equip you with the knowledge and tools to take your LLM projects to the next level. Let’s dive in.

What is Ollama?

Ollama is an innovative tool that enables running large language models (LLMs) locally, providing a cost-effective solution for testing and development. By running LLMs locally, you can experiment and refine your ideas without incurring significant production costs.

By running LLMs locally, you can:

Reduce cloud costs: Save on cloud computing expenses by running LLMs on your local machine.
Faster experimentation: Quickly test and iterate on your ideas without relying on remote servers.
Improved data privacy: Keep your data local and secure, reducing the risk of data breaches.

Setting up Ollama and running LLMs locally

For this step, we will be using Meta’s latest open source model, Llama3.1. For most optimal performance with Ollama ensure your laptop has at least 16GB of RAM. If you do then follow these steps:

Download and install Ollama https://ollama.com/download
Download the desired LLM model (e.g., Llama3.1 or other open-source models). In a terminal window run the following to run llama3.1 locally for example

ollama run llama3.1

This is similar to docker commands, it will pull and run llama3.1

Once it is done pulling, you should have a terminal prompt you can start chatting from.

For further customization and to use Modelfile to create your own custom system prompt, refer to Ollama documentation here.

Instrumenting Ollama with Langtrace

Now that you have a local LLM, let’s say you are building a customer service bot and you would like to view detailed traces on the LLM requests, this is where Langtrace shines. Langtrace provides a Python SDK that enables observability for Ollama, allowing you to trace LLM calls and gain valuable insights into your application's performance. To instrument Ollama with Langtrace:

Generate an API key from langtrace.ai - you can also self-host.
Install the Langtrace Python or Typescript SDK.
Import the SDK and initialize the SDK.
Start tracing!

Example code snippet:

from langtrace_python_sdk import langtrace, with_langtrace_root_span
import ollama
from dotenv import load_dotenv

load_dotenv()

# langtrace.init(write_spans_to_console=False)
langtrace.init(api_key = 'YOUR_API_KEY', write_spans_to_console=False)

@with_langtrace_root_span()
def give_recs():
  response = ollama.chat(model='llama3.1', messages=[
    {
      'role': 'user',
      'content': 'You are an AI assistant with expertise in mens clothing. Help me pick clothing for a black tie dinner at work.',
    },
  ])
  print(response['message']['content'])

if __name__ == "__main__":
  print("Running fashionista bot...")
  give_recs()

Here is what the trace looks like in Langtrace UI

Here is a link to a reference cookbook for Ollama integration with Langtrace.

Tracing LLM call

With Langtrace, you can now trace LLM calls and capture essential metadata, such as:

Input, Output and Total tokens
Latency
Error rates

This data provides valuable insights into your application's performance, helping you optimize and improve it over time.

In the next blog in this series, we will cover how to use Langtrace to perform evaluations on your application’s accuracy and optimize its behavior.

Quick Update

I added a UI option to this bot. Feel free to check out the code here. I use Streamlit for the UI but you can swap it out for Gradio or any other library.

To see this in action, install Streamlit

pip install streamlit

Then run the code using

streamlit run ollama-fashionistav2.py

Next steps

In conclusion, combining Ollama's local LLM capabilities with Langtrace's observability features unlocks a powerful toolset for building and optimizing LLM applications. By following the steps outlined in this post, you can leverage the benefits of running LLMs locally with Ollama, including reduced cloud costs, accelerated experimentation, and improved data privacy.

With Langtrace, you can gain valuable insights into your application's performance, identify bottlenecks, and optimize its behavior. By integrating Ollama and Langtrace, you can build more efficient, effective, and innovative LLM applications. Try out Ollama and Langtrace today and discover the advantages of local LLM development and open-source observability for yourself!

Building a Traceable RAG System with Qdrant and Langtrace: A Step-by-Step Guide

Yemi Adejumobi — Tue, 16 Jul 2024 19:24:48 +0000

Vector databases are the backbone of AI applications, providing the crucial infrastructure for efficient similarity search and retrieval of high-dimensional data. Among these, Qdrant stands out as one of the most versatile projects. Written in Rust, Qdrant is a vector search database designed for turning embeddings or neural network encoders into full-fledged applications for matching, searching, recommending, and more.

In this blog post, we'll explore how to leverage Qdrant in a Retrieval-Augmented Generation (RAG) system and demonstrate how to trace its operations using Langtrace. This combination allows us to build and optimize AI applications that can understand and generate human-like text based on vast amounts of information.

Complete Code Repository

Before we dive into the details, I'm excited to share that the complete code for this RAG system implementation is available in our GitHub repository:

RAG System with Qdrant and Langtrace

This repository contains all the code examples discussed in this blog post, along with additional scripts, documentation, and setup instructions. Feel free to clone, fork, or star the repository if you find it useful!

What is a RAG System?

Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models (LLMs) with external knowledge. The process typically involves three steps:

Retrieval: Given a query, relevant information is retrieved from a knowledge base (in our case, stored in Qdrant).
Augmentation: The retrieved information is combined with the original query.
Generation: An LLM uses the augmented input to generate a response.

This approach allows for more accurate and up-to-date responses, as the system can reference specific information rather than relying solely on its pre-trained knowledge.

Implementing a RAG System with Qdrant

Let's walk through the process of implementing a RAG system using Qdrant as our vector database. We'll use OpenAI's GPT model for generation and Langtrace for tracing our system's operations.

Setting Up the Environment

First, we need to set up our environment with the necessary libraries:



import os
import time
import openai
from qdrant_client import QdrantClient, models
from langtrace_python_sdk import langtrace, with_langtrace_root_span
from typing import List, Dict, Any

# Initialize environment and clients
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"
langtrace.init(api_key='your_langtrace_api_key_here')
qdrant_client = QdrantClient(":memory:") 
openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))

Initializing the Knowledge Base

Next, we'll create a function to initialize our knowledge base in Qdrant:



@with_langtrace_root_span("initialize_knowledge_base")
def initialize_knowledge_base(documents: List[str]) -> None:
    start_time = time.time()

    # Check if collection exists, if not create it
    collections = qdrant_client.get_collections().collections
    if not any(collection.name == "knowledge-base" for collection in collections):
        qdrant_client.create_collection(
            collection_name="knowledge-base"
        )
        print("Created 'knowledge-base' collection")

    qdrant_client.add(
        collection_name="knowledge-base",
        documents=documents
    )
    end_time = time.time()
    print(f"Knowledge base initialized with {len(documents)} documents in {end_time - start_time:.2f} seconds")

Querying the Vector Database

We'll create a function to query our Qdrant vector database:



@with_langtrace_root_span("query_vector_db")
def query_vector_db(question: str, n_points: int = 3) -> List[Dict[str, Any]]:
    start_time = time.time()
    results = qdrant_client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )
    end_time = time.time()
    return results

Generating LLM Responses

We'll use OpenAI's GPT model to generate responses:



@with_langtrace_root_span("generate_llm_response")
def generate_llm_response(prompt: str, model: str = "gpt-3.5-turbo") -> str:
    start_time = time.time()
    completion = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt},
        ],
        timeout=10.0,
    )
    end_time = time.time()
    response = completion.choices[0].message.content
    return response

The RAG Process

Finally, we'll tie it all together in our RAG function:



@with_langtrace_root_span("rag")
def rag(question: str, n_points: int = 3) -> str:
    print(f"Processing RAG for question: {question}")

    context_start = time.time()
    context = "\n".join([r.document for r in query_vector_db(question, n_points)])
    context_end = time.time()

    prompt_start = time.time()
    metaprompt = f"""
    You are a software architect.
    Answer the following question using the provided context.
    If you can't find the answer, do not pretend you know it, but answer "I don't know".

    Question: {question.strip()}

    Context:
    {context.strip()}

    Answer:
    """
    prompt_end = time.time()

    answer = generate_llm_response(metaprompt)
    print(f"RAG completed, answer length: {len(answer)} characters")
    return answer

Tracing with Langtrace

As you may have noticed, we've decorated our functions with @with_langtrace_root_span. This allows us to trace the execution of our RAG system using Langtrace, an open-source LLM observability tool. You can read more about group traces in the Langtrace documentation.

What is Langtrace?

Langtrace is a powerful, open-source tool designed specifically for LLM observability. It provides developers with the ability to trace, monitor, and analyze the performance and behavior of LLM-based systems. By using Langtrace, we can gain valuable insights into our RAG system's operation, helping us to optimize performance, identify bottlenecks, and ensure the reliability of our AI applications.

Key features of Langtrace include:

Easy integration with existing LLM applications
Detailed tracing of LLM operations
Performance metrics and analytics
Open-source nature, allowing for community contributions and customizations

In our RAG system, each decorated function will create a span in our trace, providing a comprehensive view of the system's execution flow. This level of observability is crucial when working with complex AI systems like RAG, where multiple components interact to produce the final output.

Using Langtrace in Our RAG System

Here's how we're using Langtrace in our implementation:

We initialize Langtrace at the beginning of our script:



from langtrace_python_sdk import langtrace, with_langtrace_root_span
langtrace.init(api_key='your_langtrace_api_key_here')

We decorate each main function with



@with_langtrace_root_span("function_name")
def function_name():
    # function implementation

This setup allows us to create a hierarchical trace of our RAG system's execution, from initializing the knowledge base to generating the final response.

Testing the RAG System

Let's test our RAG system with a few sample questions:



def demonstrate_different_queries():
    questions = [
        "What is Qdrant used for?",
        "How does Docker help developers?",
        "What is the purpose of MySQL?",
        "Can you explain what FastAPI is?",
    ]
    for question in questions:
        try:
            answer = rag(question)
            print(f"Question: {question}")
            print(f"Answer: {answer}\n")
        except Exception as e:
            print(f"Error processing question '{question}': {str(e)}\n")

# Initialize knowledge base and run queries
documents = [
    "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
    "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
    "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
    "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
    "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
    "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
    "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
    "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
]
initialize_knowledge_base(documents)
demonstrate_different_queries()

Analyzing the Traces

After running our RAG system, we can analyze the traces in the Langtrace dashboard. Here's what to look for:

Check the Langtrace dashboard for a visual representation of the traces.
Look for the 'rag' root span and its child spans to understand the flow of operations.
Examine the timing information printed for each operation to identify potential bottlenecks.
Review any error messages printed to understand and address issues.

Conclusion

In this blog post, we've explored how to leverage Qdrant, a powerful vector database, in building a Retrieval-Augmented Generation (RAG) system. We've implemented a complete RAG pipeline, from initializing the knowledge base to generating responses, and added tracing with Langtrace to gain insights into our system's performance. By leveraging open-source tools like Qdrant for vector search and Langtrace for LLM observability, we're not only building powerful AI applications but also contributing to and benefiting from the broader AI development community. These tools empower developers to create, optimize, and understand complex AI systems, paving the way for more reliable AI applications in the future.

Remember, you can find the complete implementation of this RAG system in our GitHub repository. We encourage you to explore the code, experiment with it, and adapt it to your specific use cases. If you have any questions or improvements, feel free to open an issue or submit a pull request. Happy coding!