Forem: Amanda Ruzza

Pre-Cloud Development Chatbot with Streamlit, Langchain, OpenAI and MongoDB Atlas Vector Search

Amanda Ruzza — Tue, 30 Jul 2024 00:26:40 +0000

Introduction

In this blog, I’ll discuss how I built a Retrieval-Augmented Generation (RAG) system capable of processing and retrieving information from multiple PDFs on my local machine, with the end goal of deploying it at a production level in AWS and GCP.

With cost, security, and performance in mind, I explored affordable alternatives for handling terabytes of data in a real world scenario. It's crucial to recognize that not all PDFs are created equal. Developers must handle various PDF text extraction challenges, such as AES encryption, watermarks, or slow processing times, to ensure a smooth user experience.

While powerful and costly AWS and GCP services could handle PDF processing, they are not feasible for production due to cost concerns. Therefore, I developed a solution using two open-source tools: PyPDF and PyTesseract.

Additionally, I implemented what I call 'pre-cloud-development-observability' features, such as OpenAI Token usage and API costs, application execution time, and MongoDB specific operation metrics, all logged for analysis. - After all, who doesn't enjoy delving into log files to optimize performance? 🙋🏻‍♀️

Note: This blog is an in depth explanation of this application. For the Setup Guide and Python/Application Script, refer to the Github repository.

Application stack:

Streamlit - Front End
OpenAi - LLM/Foundation Model
Langchain - NLP Orchestration
MongoDB Atlas Vector Search - Cloud-based Vector Database
Dotenv - Local secret management
PyPDF - PDF text extraction
PyTesseract - OCR on AES Encrypted PDFs or PDFs with images in the background that would result in an empty text extraction

Key Features

Secure API/TOKEN keys connection hidden in the .env file
Processes multiple files - up to 200MB - within 1 single upload operation
Capability to answer questions based on pre-processed documents stored in the database - no need to reupload the same PDFs
Text extraction from AES-encrypted PDFs or those with background images
Parallel text extraction for PDFs > 5MB for improved performance
A 'Clear Chat History' button
Observability/logging features for future Cloud Development considerations:
- Langchain callback function that calculates OpenAi token usage.
- MongoDB operation specific logs recorded through the pymongo driver
- Script execution time measurement ## Application Demo Video

System Architecture Overview

The entire application runs from one Python file named chatbot-app.py. The UI, built with Streamlit, processes PDFs using either simple text extraction or OCR. Langchain serves as the application's 'master brain,' creating vector embeddings, sending them to the database, and communicating with the foundation model, OpenAI.

PDF upload and text extraction

Two Python packages are used for text extraction:

PyPDF for regular PDFs
pytesseract for OCR on PDFs requiring it

Users upload multiple files (up to Streamlit's 200MB limit) in the UI's Sidebar and click 'Process'. Streamlit then invokes the get_pdf_text function, which is part of the process_pdf logic. process_pdf attempts text extraction in the following order:

Simple extraction with PyPDF
IF text extraction fails or an error occurs (e.g., due to encryption, or a watermark in the background)
ELSE ocr_on_pdf is invoked for OCR processing, using parallel processing for files > 5MB through a ThreadPoolExecutor. ```python

def process_pdf(pdf):
try:
with tempfile.NamedTemporaryFile(delete=False) as temp_pdf:
temp_pdf.write(pdf.read())
temp_pdf_path = temp_pdf.name

    file_size = getsize(temp_pdf_path) / (1024 * 1024)  # Size in MB
    logging.info(f"Processing PDF: {pdf.name}, Size: {file_size:.2f} MB")

    if file_size == 0:
        logging.warning(f"The PDF file '{pdf.name}' is empty.")
        return ""

    pdf_reader = PdfReader(temp_pdf_path)

    try:
        text_from_pdf = "".join(page.extract_text() or "" for page in pdf_reader.pages)
    except Exception as e:
        # Catch specific exception for AES encryption
        if "cryptography>=3.1 is required for AES algorithm" in str(e):
            logging.warning(f"PDF '{pdf.name}' is AES encrypted. Performing OCR.")
            return ocr_on_pdf(temp_pdf_path)
        else:
            raise e

    if not text_from_pdf:
        logging.warning(f"No text extracted from '{pdf.name}'. Performing OCR.")
        return ocr_on_pdf(temp_pdf_path)

    logging.info(f"Processed PDF: {pdf.name}")
    return text_from_pdf

except Exception as e:
    logging.error(f"Error processing PDF: {pdf.name}. Error: {e}")
    return ""

def get_pdf_text(pdf_docs):
return "".join(process_pdf(pdf) for pdf in pdf_docs)

<br>
Below is the `ocr_on_pdf` function with `pytesseract` and the `ThreadPoolExecutor`:
```python


def ocr_on_pdf(pdf_path):
    try:
        pytesseract.pytesseract.tesseract_cmd = getenv("TESSERACT_PATH")
        images = convert_from_path(pdf_path)
        file_size = path.getsize(pdf_path) / (1024 * 1024)  # Size in MB

        if file_size > 5:  # If file is larger than 5MB
            with ThreadPoolExecutor() as executor:
                extracted_texts = list(executor.map(ocr_single_page, images))
            extracted_text = "\n".join(extracted_texts)
            logging.info(f"Parallel OCR completed for large file: {pdf_path}")
        else:
            extracted_text = "\n".join(ocr_single_page(image) for image in images)
            logging.info(f"Sequential OCR completed for small file: {pdf_path}")

        return extracted_text
    except Exception as e:
        logging.error(f"Error during OCR on PDF: {e}")
        return ""

Text conversion into vectors, storage and retrieval

Once extracted, langchain begins to do its 'orchestration magic' by splitting up the texts into chunks of 1000 characters each through the CharacterTextSplitter class:



def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n", 
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = [chunk for chunk in text_splitter.split_text(text)]

The text chunks are vectorized using Langchain's OpenAIEmbeddings class and stored in the Vector Database:



def get_vectorstore(text_chunks: List[str], metadatas: List[Dict[str, Any]] = None) -> MongoDBAtlasVectorSearch:
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

    mongo_client = MongoClient(ATLAS_URI)
    db = mongo_client[MONGODB_DB]
    collection = db[MONGODB_COLLECTION]

    vector_search = MongoDBAtlasVectorSearch(
        collection=collection,
        embedding=embeddings,
        index_name="vector_index",
        text_key="text",
        embedding_key="embedding",
        relevance_score_fn="cosine"
    )

    ids = [vector_search.add_texts([chunk], [metadata] if metadatas else None)[0] 
           for chunk, metadata in zip(text_chunks, metadatas or [None] * len(text_chunks))]
    logging.info(f"Added {len(ids)} embeddings to the vector store")
    return vector_search

This is how vectorized texts appear in MongoDB's GUI:

MongoDB Atlas Vector Search organizes text chunks and vectors into ObjectIDs, adhering to the Document Database Model, simplifying integration with larger applications already using this model:

The get_conversation_chain function retrieves text from MongoDB, sending it to OpenAI for question answering.:



def get_conversation_chain(vectorstore):
    llm = ChatOpenAI(model_name="gpt-4")
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

The geeky 🤓 Cloud Developer in me was thrilled to see how MongoDB's use of the K-nearest neighbors (KNN) ML algorithm provided accurate answers. - On a side note, as this algorithm requires a lot of compute power from a database, it would be interesting to explore its performance in a production environment with terabytes of data, but that should be a discussion for another blog. 📖 👩🏻‍💻

Streamlit Setup and 'Gotchas'

Throughout the application flow, st.session_state manages conversation states, vector retrieval, OpenAI token usage, and chat history clearing. Both session state initialization and page configuration must be done at the beginning of the script to avoid potential errors:




st.set_page_config(page_title="Chat with PDF Manuals", page_icon=":telephone_receiver:")
if 'chat_history' not in st.session_state:
    st.session_state.chat_history = []

In the handle_user_input function, session_state manages interactions, tracks OpenAI token usage, and appends chat history, enabling the 'user' the option to ask follow up questions:



def handle_userinput(user_question):
    if st.session_state.vectorstore is None:
        st.warning("Please upload PDFs first or wait until the database is initialized.")
    else:
        with get_openai_callback() as cb:
            response = st.session_state.conversation.invoke({"question": user_question})
            st.session_state.chat_history.append({"type": "user", "content": user_question})
            st.session_state.chat_history.append({"type": "bot", "content": response["answer"]})
            logging.info(f"\n\tOpenAI Token Usage:\n\t{cb}")

The clear_chat_history function, triggered by a button in the main function, resets the conversation state:



def clear_chat_history():
    logging.info("Clearing chat history")
    st.session_state.chat_history = []
    st.session_state.conversation = None
    st.rerun()

Streamlit's default sidebar in the main function facilitates multiple PDF uploads:



    with st.sidebar:
        st.subheader("Your PDF Manuals")
        uploaded_files = st.file_uploader("Upload your PDFs here and click on 'Process'", accept_multiple_files=True, type=["pdf"])
        if st.button("Process") and uploaded_files:
            with st.spinner("Processing..."):
                raw_text = get_pdf_text(uploaded_files)
                text_chunks = get_text_chunks(raw_text)
                vectorstore = get_vectorstore(text_chunks)
                st.session_state.vectorstore = vectorstore
                st.session_state.conversation = get_conversation_chain(vectorstore)
            st.success("Processing complete.")

While building the UI, I experimented with real-time text extraction display and a progress bar, but these features cluttered the UI. I opted for simplicity, relying on the default st.spinner for processing feedback.

Observability

Understanding application behavior is crucial before deploying to the Cloud. I set up two loggers, all written to a .log file:

A standard python logger to observe the application activity
A specific MongoDB performance logger out of the pymongo monitoring module.

Application Observability

While processing large PDFs, I monitored the script execution time by measuring the duration from the start to the end of the main function. Two key observations were made:

OCR on PDFs larger than 5MB took considerable time on my M1 MacBook Pro, prompting the addition of 'Parallel Processing' through a ThreadPoolExecutor as a way to avoid performance issues in the Cloud.
Cloud functions like AWS Lambda or GCP Cloud Functions may not handle this application. Since I don't plan on maintaining a constantly running VM, this observation indicated that an architecture using Serverless Containers — such as AWS ECS with Fargate or GCP Cloud Run — would be the optimal deployment approach. These containers would only run when the application is invoked, offering cost-efficiency with the option to autoscale. More on this in future blogs 📝.

To gauge the cost implications of using OpenAI's foundation model, I tracked API usage using Langchain's get_openai_callback functionality. This made it easier to understand the actual costs associated with each application usage:

MongoDB Logs

Coming from a DevOps world 👩🏻‍🏭, and having a passion for understanding databases under the hood, I leveraged this chatbot application to implement pymongo event_loggers. I created a class to aggregate the count of successful and failed operations and their average duration each time the program ran:



# MongoDB Event Listeners
class AggregatedCommandLogger(monitoring.CommandListener):
    def __init__(self):
        self.operation_counts = defaultdict(int)
        self.total_duration = 0
        self.total_operations = 0

    def started(self, event):
        pass  

    def succeeded(self, event):
        self.operation_counts[event.database_name] += 1
        self.total_duration += event.duration_micros
        self.total_operations += 1

    def failed(self, event):
        database_name = event.__dict__.get('database_name', 'unknown')
        logging.info(f"Command failed: operation_id={event.operation_id}, duration_micros={event.duration_micros}, database_name={database_name}")

    def summarize_and_reset(self):
        if self.total_operations > 0:
            avg_duration = self.total_duration / self.total_operations
            summary = f"MongoDB operations summary: {self.total_operations} total operations, "
            summary += f"average duration: {avg_duration:.2f} microseconds. "
            summary += "Operations per database: " + ", ".join(f"{db}: {count}" for db, count in self.operation_counts.items())
            logging.info(summary)

        # Reset counters
        self.operation_counts.clear()
        self.total_duration = 0
        self.total_operations = 0


aggregated_logger = AggregatedCommandLogger()

monitoring.register(aggregated_logger)

def log_mongodb_summary():
    aggregated_logger.summarize_and_reset()

The results aligned with expectations — no errors occurred, and operations between my local machine and MongoDB Atlas were swift and reliable. By building these pymongo.monitoring event_loggers, I preemptively simplified potential troubleshooting in a Cloud infrastructure, while also gaining insights into the appropriate MongoDB database size for real-world use.

Security

All of the environment variables such as the OpenAI API keys, Tesseract CLI location, MongoDB connection string, database and collection name were securely stored in the .env file - I added a sample .env in the Github repository.
For Cloud deployment, these variables will be managed via a Secrets Manager — either AWS or GCP — ensuring consistent security practices across environments.

Conclusion

This application showcases a blend of open-source tools, observability practices, and database management, offering a blueprint for scaling in AWS or GCP. Building it from scratch with a cloud-centric vision helped identify and address potential issues early on. The main challenge was handling different types of PDFs, balancing cost-efficiency, speed, and security.

Future improvements for the 🤖 include:

Adding a 'Web URL Input' for users to upload a file or provide a PDF URL
Implementing PDF metadata extraction and storing it in a separate MongoDB Atlas Database, allowing users to track previously vectorized PDFs and ask questions about them
Introducing a dropdown box in the UI to view available PDF file names

Tagging Made Easy: Automating Resource Labeling in AWS with Lambda and Resource Explorer

Amanda Ruzza — Sun, 21 Jan 2024 01:33:11 +0000

Tags and labels are one of the most important elements for resource inventory, compliance and cost savings.

Managing tags in a growing AWS infrastructure can be cumbersome and time-consuming. Manually tagging resources often leads to inconsistencies and inaccurate data, hindering cost optimization, resource management, and compliance efforts. Adding new tags across all existing resources in a specific account, especially across multiple regions, can be a daunting task.

All Cloud providers recommend to always tag/label your resources during their creation time, however, there are situations in which an organization might decide to add new tags to their existing resources.

Solution:

Here’s a Python script that leverages the power of AWS’ “Resource Explorer” - a powerful search and discovery service - and Boto3 - AWS’ SDK for Python - to automate the process of adding missing tags while saving time and ensuring consistency. This solution could be easily implemented by SRE's, DevOps and/or Cloud Engineers scoping to improve resource organization, inventory and cost management.

Scenario:  

In this fictional example, I’m imagining that a University is restructuring its infrastructure and its ‘tagging strategy.’ The University's CTO decided to dedicate/re-tag an entire already existing AWS account to the “College of Liberal Arts, and the Philosophy Department.” This Python script is searching for two tags:

‘philosophy’ and ‘liberal-arts’

in the two AWS Zones in which the College of Liberal Arts, and the Philosophy Department resources are located:
’us-east-1’ and ‘us-east-2’

Currently, this script is written for an AWS Lambda function, with the Lambda Handler set up, yet, it could be easily refactored to work as a simple Boto3 script to be executed from a local machine.

Code Breakdown:

This section describes the dependencies and functions that make up the AWS Auto Tagging Solution.

Dependencies:
These are the necessary dependencies for the Lambda Function:

import boto3
from botocore.exceptions  import ClientError
from botocore.config import Config
import logging
import json

Logger setup:
As a good habit, I added a Logger for possible debugging, either in testing or production stages. This also provides data for further analysis of the script's performance:

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)

Lambda Handler:
The core of this script lies on the Resource Explorer Boto3 Client, which is added to the Lambda Handler . ‘Resource Explorer’ searches for all the resources in the AWS account and looks for its tags.
If the service does not find the ‘philosophy’ and ‘liberal-arts’ key tags in the ’us-east-1’ and ‘us-east-2’ regions, then it will automatically add them with the apply_tags function - called inside the def lambda_handler:

def lambda_handler(event, context):
    logger.debug('Incoming Event')
    logger.debug(event)

    resource_explorer_client = boto3.client(
        'resource-explorer-2',
    )

    missing_philosophy_tag = get_resources_missing_tag(resource_explorer_client, 'philosophy')
    missing_liberal_arts_tag = get_resources_missing_tag(resource_explorer_client, 'liberal-arts')

    logger.info(f"# of Resources Missing 'philosophy' {missing_philosophy_tag['Count']['TotalResources']} - Complete List? {missing_philosophy_tag['Count']['Complete']}")
    logger.info(f"# of Resources Missing 'liberal-arts' {missing_liberal_arts_tag['Count']['TotalResources']} - Complete List? {missing_liberal_arts_tag['Count']['Complete']}")

    map_philosophy_arns=[]
    for this_resource in missing_philosophy_tag['Resources']:
        map_philosophy_arns.append(this_resource['Arn'])
    logger.info(f"The Map Philosophy ARN:{map_philosophy_arns}")

    map_liberal_arts_arns=[]
    for this_resource in missing_liberal_arts_tag['Resources']:
        map_liberal_arts_arns.append(this_resource['Arn'])
    logger.info(f"The Map Liberal Arts ARN:{map_liberal_arts_arns}")

    apply_tags(map_philosophy_arns, {'philosophy': 'phil-dept-server'})

    apply_tags(map_liberal_arts_arns, {'liberal-arts': 'la-dept-server'})

Missing Tags:
'Resource Explorer' looks for the specified tags in all the resources available in the current account in which the solution is being deployed. Since 'Resource Explorer' will only provide 100 results (resources) at a time, I added a paginator:

def get_resources_missing_tag(client, tag_name): 
    return (
        client.get_paginator('search')
            .paginate(QueryString=f"-tag.key:{tag_name}")
            .build_full_result()
    )

Applying Tags:
The ResourceGroupsTaggingAPI efficiently applies the identified missing tags, ensuring consistency and optimizing resource management:

def apply_tags(list_of_resources, tag_map):
    resources_by_region = return_resources_by_region(list_of_resources)
    counter = 0
    for this_resource in list_of_resources:
        counter += 1
        logger.info(f"{counter}) Add tag '{tag_map.keys()}' to '{this_resource}'")
    # iterates over regions:
    regions = ['us-east-1', 'us-east-2']
    for region in regions: 
        tagging_client = boto3.client('resourcegroupstaggingapi', region_name=region)

Returning a list of tagged resources:
The final portion of the script returns a dictionary with a list of all the newly tagged resources, separated by their respective regions - 'us-east-1', 'us-east-2' - and writes these results into a JSON file. For enhanced tracking and auditing capabilities, a future implementation could leverage DynamoDB to store the JSON file, providing a detailed history of tagging changes:

def return_resources_by_region(resources_for_all_regions):
    resources_by_region = dict()
    regions = ['us-east-1', 'us-east-2']
    for region_name in regions:
        resources_by_region[region_name] = [arn for arn in resources_for_all_regions if region_name in arn]

        logger.info(f"The {region_name} resources are: \n {resources_by_region[region_name]} \n") 

    return resources_by_region

def format_in_json(response):
    return json.dumps(response, indent=4, sort_keys=True, default=str)

This simple Lambda function can be scheduled once a week, as a simple auto-tagging solution, or be triggered by an event, such as the creation of a new resource.

Tag enforcement:
A long-term approach to an account-level 'Tagging' requirement would be to - once the new tags are applied to the current infrastructure by this Lambda function - implement an SCP (Service Control Policy). This SCP would deny the creation of any new resources that didn't contain the specified tags required by the Organization - in this example: ‘philosophy’ and ‘liberal-arts’ .

Click here for the full script available on GitHub, plus installation instructions for the required Boto3 and Botocore packages.

My Approach to Passing the Professional Cloud Developer Exam (First Try!)

Amanda Ruzza — Tue, 02 Jan 2024 21:15:18 +0000

I decided to study for this exam as a way for me to improve my overall knowledge in GCP, while looking at different application development approaches for my work and personal projects.

Exam Content:

The 60 questions were accurate with the official exam guide, and were heavily focused on looking at approaches to deploy, modernize or troubleshoot applications, with Google’s principals of SRE, DevOps and Security. Here are some focus topics:

Kubernetes and GKE
Serveless solutions with Pub/Sub, Cloud Functions, and Cloud Run
Data Modeling for different databases
Authorization and authentication: IAM, Workload Identity, JWT tokens, etc…
Integrations with Operations Suite

Resources:

As part of my study methodology, I like to use many resources, aiming to understand things from different points of view and make sure that I’m not just ‘memorizing’ the audio from a lecture or the text from the documentation.

Ranga Karanam's PCD course on Udemy
Ranga is an excellent teacher. However, certain topics, such as Data Modeling and GKE - Kubernetes - were a bit difficult to grasp, due to the lack of visual diagrams.
Alex Levkovich practice tests on Udemy
His practice tests were updated with the current exam content, and I also appreciate the fact that I was able too write him questions regarding some of the quizzes, and he always wrote me back, with in-depth answers of the ‘why’ in certain solutions.
Some of the GCP skillsboost labs/courses.
Many of the labs on the PCD learning path were outdated and had bugs, thus I wasn’t able to finish them, however, the GKE labs were great.
GCP guided labs from the official documentation for Cloud Functions, Cloud Build, Firestore, Pub/Sub and Spanner
Even though certain things were outdated, I used the lectures and labs from the A Cloud Guru PCD’s course
The following official GCP playlists and videos from YouTube
Modern CI/CD on GCP
Engineering for Reliability
Kubernetes Best Practices
Beyond your GCP Bill
Nana’s incredible Kubernetes tutorials on YouTube
Anton Putra’s YouTube tutorial on Kubernetes deployment strategies

Study Approach:

I strive to be efficient with my studies. As in my previous certification studies and preparation, my main goal wasn’t to ‘pass the exam so I could have a badge.’ I wanted to make sure that I really learned what was featured in the test, and that I could use these concepts and newly acquired knowledge to become a better developer. I went beyond lectures and practice tests, did hands-on-labs and self-created exercises on things that I felt needed improvement.

With that in mind, here was my methodology:

1. Daily Minimum: I carved out at least 1 hour every day, even if it meant dawn sessions or airplane lectures. I found consistency trumps cramming when it comes to certification prep! [ yes, I was attending AWS re:Invent while preparing for this exam. That wasn’t an excuse, though, I still made sure to put at least 1 hour everyday early in the morning before going to the incredible re:Invent sessions! ]

2. One Take & Reflection: I gave each Ranga lecture a single watch, no matter how complicated concepts felt. I'd then pause and ask myself: "What can I use from this in a real project?" This kept me learning, not just passively listening.

3. Practice Tests: After initial chapters, I tackled the official GCP sample exam questions. Then, I dove into related docs, taking notes and recording myself explaining answers or picturing solutions. This active interrogation solidified my understanding and made me focus even more on the lectures from Ranga’s course.

4. Hands-on Labs: I went through labs and the extra resources mentioned here. It’s was fun to practice CLI commands, tweaking Kubernetes YAML, deploying Cloud Functions and Cloud Run apps , modeling Spanner, Firestore and BigTable.

5. Problem Solving: I dug into practice tests from Udemy and A Cloud Guru, focusing on understanding the ‘problem’ - i.e practice question. I'd record my voice memos analyzing the "why" and then do labs related to ‘question-specific’ topics.

6. Walking and Visualization: My daily walks and errands doubled as study time. I'd listen to my voice memos, replaying and refining my grasp of GCP solutions. By the end, I had 53 recordings (15-25 minutes each) with my self explanations.

With this study process, I felt productive, and excited about putting all this knowledge in practice in the real world, while also acquiring confidence to take the exam on December 28th, 2023. I passed it on my first attempt, and felt that this preparation journey was 100% worth my time. I’d totally recommend studying for this certification, my goal with this blog is to show you that with a mix of different resources, a daily study routine and practical application, this exam will improve your insight on GCP’s incredible possibilities for application development.