<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Amanda Ruzza</title>
    <description>The latest articles on Forem by Amanda Ruzza (@amandaruzza).</description>
    <link>https://forem.com/amandaruzza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1246885%2Fd39cfd89-1ab8-4a03-9dd7-6ebe8a2037f7.JPG</url>
      <title>Forem: Amanda Ruzza</title>
      <link>https://forem.com/amandaruzza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/amandaruzza"/>
    <language>en</language>
    <item>
      <title>Pre-Cloud Development Chatbot with Streamlit, Langchain, OpenAI and MongoDB Atlas Vector Search</title>
      <dc:creator>Amanda Ruzza</dc:creator>
      <pubDate>Tue, 30 Jul 2024 00:26:40 +0000</pubDate>
      <link>https://forem.com/amandaruzza/pre-cloud-development-chatbot-with-streamlit-langchain-openai-and-mongodb-atlas-vector-search-43l</link>
      <guid>https://forem.com/amandaruzza/pre-cloud-development-chatbot-with-streamlit-langchain-openai-and-mongodb-atlas-vector-search-43l</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this blog, I’ll discuss how I built a Retrieval-Augmented Generation (RAG) system capable of processing and retrieving information from multiple PDFs on my local machine, with the end goal of deploying it at a production level in AWS and GCP.&lt;/p&gt;

&lt;p&gt;With cost, security, and performance in mind, I explored affordable alternatives for handling terabytes of data in a real world scenario. It's crucial to recognize that not all PDFs are created equal. Developers must handle various PDF text extraction challenges, such as AES encryption, watermarks, or slow processing times, to ensure a smooth user experience.&lt;/p&gt;

&lt;p&gt;While powerful and costly AWS and GCP services could handle PDF processing, they are not feasible for production due to cost concerns. Therefore, I developed a solution using two open-source tools: &lt;code&gt;PyPDF&lt;/code&gt; and &lt;code&gt;PyTesseract&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Additionally, I implemented what I call &lt;em&gt;'pre-cloud-development-observability'&lt;/em&gt; features, such as OpenAI Token usage and API costs, application execution time, and MongoDB specific operation metrics, all logged for analysis. - &lt;em&gt;After all, who doesn't enjoy delving into log files to optimize performance?&lt;/em&gt; 🙋🏻‍♀️&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; &lt;em&gt;This blog is an in depth explanation of this application. For the Setup Guide and Python/Application Script, refer to the &lt;a href="https://github.com/Amanda-Ruzza/rag-pdf-mongodb-local/tree/master" rel="noopener noreferrer"&gt;Github repository&lt;/a&gt;.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;&lt;u&gt;Application stack:&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; - Front End&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAi&lt;/strong&gt; - LLM/Foundation Model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Langchain&lt;/strong&gt; - NLP Orchestration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB Atlas Vector Search&lt;/strong&gt; - Cloud-based Vector Database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dotenv&lt;/strong&gt; - Local secret management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPDF&lt;/strong&gt; - PDF text extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTesseract&lt;/strong&gt; - OCR on AES Encrypted PDFs or PDFs with images in the background that would result in an empty text extraction
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;



&lt;ul&gt;
&lt;li&gt;Secure API/TOKEN keys connection hidden in the &lt;code&gt;.env&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;Processes multiple files - up to 200MB - within 1 single upload operation&lt;/li&gt;
&lt;li&gt;Capability to answer questions based on pre-processed documents stored in the database - no need to reupload the same PDFs&lt;/li&gt;
&lt;li&gt;Text extraction from AES-encrypted PDFs or those with background images&lt;/li&gt;
&lt;li&gt;Parallel text extraction for PDFs &amp;gt; 5MB for improved performance&lt;/li&gt;
&lt;li&gt;A &lt;em&gt;'Clear Chat History'&lt;/em&gt; button&lt;/li&gt;
&lt;li&gt;Observability/logging features for future Cloud Development considerations:

&lt;ul&gt;
&lt;li&gt;Langchain &lt;code&gt;callback&lt;/code&gt; function that calculates OpenAi token usage.&lt;/li&gt;
&lt;li&gt;MongoDB operation specific logs recorded through the &lt;code&gt;pymongo&lt;/code&gt; driver&lt;/li&gt;
&lt;li&gt;Script execution time measurement 

## Application Demo Video

&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/E0RpmGbmKEg"&gt;
&lt;/iframe&gt;
&lt;/p&gt;



&lt;h2&gt;
  
  
  System Architecture Overview
&lt;/h2&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ez4tgc3pmjtp2lrf276.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ez4tgc3pmjtp2lrf276.png" alt="Architecture Diagram"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;The entire application runs from one Python file named &lt;a href="https://github.com/Amanda-Ruzza/rag-pdf-mongodb-local/blob/master/chatbot-app.py" rel="noopener noreferrer"&gt;&lt;code&gt;chatbot-app.py&lt;/code&gt;&lt;/a&gt;. The UI, built with Streamlit, processes PDFs using either simple text extraction or OCR. Langchain serves as the application's 'master brain,' creating vector embeddings, sending them to the database, and communicating with the foundation model, OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  PDF upload and text extraction
&lt;/h2&gt;

&lt;p&gt;&lt;br&gt;
Two Python packages are used for text extraction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypdf.readthedocs.io/en/stable/user/extract-text.html" rel="noopener noreferrer"&gt;&lt;code&gt;PyPDF&lt;/code&gt;&lt;/a&gt; for regular PDFs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/pytesseract/" rel="noopener noreferrer"&gt;&lt;code&gt;pytesseract&lt;/code&gt;&lt;/a&gt; for OCR on PDFs requiring it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users upload multiple files (up to Streamlit's 200MB limit) in the UI's Sidebar and click 'Process'. Streamlit then invokes the &lt;code&gt;get_pdf_text function&lt;/code&gt;, which is part of the &lt;code&gt;process_pdf&lt;/code&gt; logic. &lt;code&gt;process_pdf&lt;/code&gt; attempts text extraction in the following order:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple extraction with &lt;code&gt;PyPDF&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IF&lt;/strong&gt; text extraction fails or an error occurs (e.g., due to encryption, or a watermark in the background)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ELSE&lt;/strong&gt; &lt;code&gt;ocr_on_pdf&lt;/code&gt; is invoked for OCR processing, using parallel processing for files &amp;gt; 5MB through a &lt;a href="https://www.geeksforgeeks.org/how-to-use-threadpoolexecutor-in-python3/" rel="noopener noreferrer"&gt;&lt;code&gt;ThreadPoolExecutor&lt;/code&gt;&lt;/a&gt;.

```python
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;def process_pdf(pdf):&lt;br&gt;
    try:&lt;br&gt;
        with tempfile.NamedTemporaryFile(delete=False) as temp_pdf:&lt;br&gt;
            temp_pdf.write(pdf.read())&lt;br&gt;
            temp_pdf_path = temp_pdf.name&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    file_size = getsize(temp_pdf_path) / (1024 * 1024)  # Size in MB
    logging.info(f"Processing PDF: {pdf.name}, Size: {file_size:.2f} MB")

    if file_size == 0:
        logging.warning(f"The PDF file '{pdf.name}' is empty.")
        return ""

    pdf_reader = PdfReader(temp_pdf_path)

    try:
        text_from_pdf = "".join(page.extract_text() or "" for page in pdf_reader.pages)
    except Exception as e:
        # Catch specific exception for AES encryption
        if "cryptography&amp;gt;=3.1 is required for AES algorithm" in str(e):
            logging.warning(f"PDF '{pdf.name}' is AES encrypted. Performing OCR.")
            return ocr_on_pdf(temp_pdf_path)
        else:
            raise e

    if not text_from_pdf:
        logging.warning(f"No text extracted from '{pdf.name}'. Performing OCR.")
        return ocr_on_pdf(temp_pdf_path)

    logging.info(f"Processed PDF: {pdf.name}")
    return text_from_pdf

except Exception as e:
    logging.error(f"Error processing PDF: {pdf.name}. Error: {e}")
    return ""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def get_pdf_text(pdf_docs):&lt;br&gt;
    return "".join(process_pdf(pdf) for pdf in pdf_docs)&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;br&amp;gt;
Below is the `ocr_on_pdf` function with `pytesseract` and the `ThreadPoolExecutor`:
```python


def ocr_on_pdf(pdf_path):
    try:
        pytesseract.pytesseract.tesseract_cmd = getenv("TESSERACT_PATH")
        images = convert_from_path(pdf_path)
        file_size = path.getsize(pdf_path) / (1024 * 1024)  # Size in MB

        if file_size &amp;gt; 5:  # If file is larger than 5MB
            with ThreadPoolExecutor() as executor:
                extracted_texts = list(executor.map(ocr_single_page, images))
            extracted_text = "\n".join(extracted_texts)
            logging.info(f"Parallel OCR completed for large file: {pdf_path}")
        else:
            extracted_text = "\n".join(ocr_single_page(image) for image in images)
            logging.info(f"Sequential OCR completed for small file: {pdf_path}")

        return extracted_text
    except Exception as e:
        logging.error(f"Error during OCR on PDF: {e}")
        return ""


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Text conversion into vectors, storage and retrieval
&lt;/h2&gt;

&lt;p&gt;Once extracted, &lt;code&gt;langchain&lt;/code&gt; begins to do its &lt;em&gt;'orchestration magic'&lt;/em&gt; by splitting up the texts into chunks of 1000 characters each through the &lt;code&gt;CharacterTextSplitter&lt;/code&gt; class:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_text_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
The text chunks are vectorized using Langchain's &lt;code&gt;OpenAIEmbeddings&lt;/code&gt; class and stored in the Vector Database:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_vectorstore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MongoDBAtlasVectorSearch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;mongo_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ATLAS_URI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mongo_client&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MONGODB_DB&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MONGODB_COLLECTION&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;vector_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoDBAtlasVectorSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embedding_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;relevance_score_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_texts&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
           &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_chunks&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Added &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; embeddings to the vector store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vector_search&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
This is how &lt;em&gt;vectorized texts&lt;/em&gt; appear in MongoDB's GUI:&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feojimki08uzsgrb5n604.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feojimki08uzsgrb5n604.png" alt="mdb-embbedings-1"&gt;&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
MongoDB Atlas Vector Search organizes text chunks and vectors into &lt;code&gt;ObjectIDs&lt;/code&gt;, adhering to the Document Database Model, simplifying integration with larger applications already using this model:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faa7git8vzlse2wrqnkqk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faa7git8vzlse2wrqnkqk.png" alt="mdb-embbedings-2"&gt;&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
The &lt;code&gt;get_conversation_chain&lt;/code&gt; function retrieves text from MongoDB, sending it to OpenAI for question answering.:&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_conversation_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;conversation_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ConversationalRetrievalChain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;conversation_chain&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
The geeky 🤓 &lt;em&gt;Cloud Developer&lt;/em&gt; in me was thrilled to see how MongoDB's use of the &lt;a href="https://www.mongodb.com/resources/basics/knn-search" rel="noopener noreferrer"&gt;K-nearest neighbors&lt;/a&gt; (KNN) ML algorithm provided accurate answers. &lt;em&gt;- On a side note, as this algorithm requires a lot of compute power from a database, it would be interesting to explore its performance in a production environment with terabytes of data, but that should be a discussion for another blog&lt;/em&gt;. 📖 👩🏻‍💻 &lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56ril1lp650s7gu57aqi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56ril1lp650s7gu57aqi.png" alt="mdb-embbedings-3"&gt;&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Streamlit Setup and &lt;em&gt;'Gotchas'&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Throughout the application flow, &lt;code&gt;st.session_state&lt;/code&gt; manages conversation states, vector retrieval, OpenAI token usage, and chat history clearing. Both session state initialization and page configuration must be done at the beginning of the script to avoid potential errors:&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;


&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_page_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chat with PDF Manuals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_icon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:telephone_receiver:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
In the &lt;code&gt;handle_user_input&lt;/code&gt; function, &lt;code&gt;session_state&lt;/code&gt; manages interactions, tracks OpenAI token usage, and &lt;code&gt;appends&lt;/code&gt; chat history, enabling the &lt;em&gt;'user'&lt;/em&gt; the option to ask follow up questions:&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_userinput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please upload PDFs first or wait until the database is initialized.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;get_openai_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
            &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s"&gt;OpenAI Token Usage:&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
The &lt;code&gt;clear_chat_history&lt;/code&gt; function, triggered by a button in the main function, resets the conversation state:&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clear_chat_history&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Clearing chat history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerun&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
Streamlit's default &lt;code&gt;sidebar&lt;/code&gt; in the &lt;code&gt;main&lt;/code&gt; function facilitates multiple PDF uploads:&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sidebar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subheader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your PDF Manuals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;uploaded_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;file_uploader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Upload your PDFs here and click on &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Process&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accept_multiple_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Process&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;uploaded_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spinner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;raw_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pdf_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uploaded_files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;text_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_text_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_vectorstore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;
                &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_conversation_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing complete.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
While building the UI, I experimented with real-time text extraction display and a progress bar, but these features cluttered the UI. I opted for simplicity, relying on the default &lt;code&gt;st.spinner&lt;/code&gt; for processing feedback.&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Understanding application behavior is crucial before deploying to the Cloud. I set up two loggers, all written to a &lt;code&gt;.log&lt;/code&gt; file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A standard python logger to observe the application activity&lt;/li&gt;
&lt;li&gt;A specific MongoDB performance logger out of the &lt;code&gt;pymongo&lt;/code&gt; &lt;a href="https://pymongo.readthedocs.io/en/stable/api/pymongo/monitoring.html" rel="noopener noreferrer"&gt;monitoring module&lt;/a&gt;. &lt;/li&gt;
&lt;/ul&gt;



&lt;h3&gt;
  
  
  Application Observability
&lt;/h3&gt;

&lt;p&gt;While processing large PDFs, I monitored the &lt;code&gt;script execution&lt;/code&gt; time by measuring the duration from the start to the end of the &lt;code&gt;main&lt;/code&gt; function. Two key observations were made:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OCR on PDFs larger than 5MB took considerable time on my M1 MacBook Pro, prompting the addition of 'Parallel Processing' through a &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; as a way to avoid performance issues in the Cloud.&lt;/li&gt;
&lt;li&gt;Cloud functions like AWS Lambda or GCP Cloud Functions may not handle this application. Since I don't plan on maintaining a constantly running VM, this observation indicated that an architecture using &lt;em&gt;Serverless Containers&lt;/em&gt; — such as AWS ECS with Fargate or GCP Cloud Run — would be the optimal deployment approach. These containers would only run when the application is invoked, offering cost-efficiency with the option to autoscale. More on this in future blogs 📝.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To gauge the cost implications of using OpenAI's foundation model, I tracked API usage using Langchain's &lt;code&gt;get_openai_callback&lt;/code&gt; &lt;a href="https://python.langchain.com/v0.1/docs/modules/model_io/llms/token_usage_tracking/" rel="noopener noreferrer"&gt;functionality&lt;/a&gt;. This made it easier to understand the actual costs associated with each application usage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmsbg56rorp3l8s4l811.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmsbg56rorp3l8s4l811.png" alt="openai-token-usage-mdb-logs-screenshot"&gt;&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  MongoDB Logs
&lt;/h3&gt;

&lt;p&gt;Coming from a DevOps world 👩🏻‍🏭, and having a passion for understanding databases under the hood, I leveraged this chatbot application to implement &lt;code&gt;pymongo&lt;/code&gt; &lt;code&gt;event_loggers&lt;/code&gt;. I created a class to aggregate the count of successful and failed operations and their average duration each time the program ran:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# MongoDB Event Listeners
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AggregatedCommandLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitoring&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CommandListener&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;started&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;succeeded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_micros&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;database_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__dict__&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;database_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Command failed: operation_id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, duration_micros=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_micros&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, database_name=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;database_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_and_reset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;avg_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MongoDB operations summary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; total operations, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;average duration: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_duration&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; microseconds. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Operations per database: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Reset counters
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operation_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_operations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;


&lt;span class="n"&gt;aggregated_logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AggregatedCommandLogger&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;monitoring&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregated_logger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_mongodb_summary&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;aggregated_logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summarize_and_reset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
The results aligned with expectations — no errors occurred, and operations between my local machine and MongoDB Atlas were swift and reliable. By building these &lt;code&gt;pymongo.monitoring&lt;/code&gt; &lt;code&gt;event_loggers&lt;/code&gt;, I preemptively simplified potential troubleshooting in a Cloud infrastructure, while also gaining insights into the appropriate MongoDB database size for real-world use.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;All of the environment variables such as the OpenAI API keys, Tesseract CLI location, MongoDB connection string, database and collection name were securely stored in the &lt;code&gt;.env&lt;/code&gt; file - &lt;em&gt;I added a sample &lt;code&gt;.env&lt;/code&gt; in the&lt;/em&gt; &lt;a href="https://github.com/Amanda-Ruzza/rag-pdf-mongodb-local/blob/master/sample-dotenv-file.txt" rel="noopener noreferrer"&gt;Github repository&lt;/a&gt;. &lt;br&gt;
For Cloud deployment, these variables will be managed via a &lt;em&gt;Secrets Manager&lt;/em&gt; — either AWS or GCP — ensuring consistent security practices across environments.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This application showcases a blend of open-source tools, observability practices, and database management, offering a blueprint for scaling in AWS or GCP. Building it from scratch with a cloud-centric vision helped identify and address potential issues early on. The main challenge was handling different types of PDFs, balancing cost-efficiency, speed, and security.&lt;/p&gt;

&lt;p&gt;Future improvements for the 🤖 include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding a 'Web URL Input' for users to upload a file or provide a PDF URL&lt;/li&gt;
&lt;li&gt;Implementing PDF metadata extraction and storing it in a separate MongoDB Atlas Database, allowing users to track previously vectorized PDFs and ask questions about them&lt;/li&gt;
&lt;li&gt;Introducing a dropdown box in the UI to view available PDF file names&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>pdftextextraction</category>
      <category>python</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>Tagging Made Easy: Automating Resource Labeling in AWS with Lambda and Resource Explorer</title>
      <dc:creator>Amanda Ruzza</dc:creator>
      <pubDate>Sun, 21 Jan 2024 01:33:11 +0000</pubDate>
      <link>https://forem.com/amandaruzza/tagging-made-easy-automating-resource-labeling-in-aws-with-lambda-and-resource-explorer-1e2l</link>
      <guid>https://forem.com/amandaruzza/tagging-made-easy-automating-resource-labeling-in-aws-with-lambda-and-resource-explorer-1e2l</guid>
      <description>&lt;p&gt;Tags and labels are one of the most important elements for resource inventory, compliance and cost savings.&lt;/p&gt;

&lt;p&gt;Managing tags in a growing AWS infrastructure can be cumbersome and time-consuming. Manually tagging resources often leads to inconsistencies and inaccurate data, hindering cost optimization, resource management, and compliance efforts. Adding new tags across all existing resources in a specific account, especially across multiple regions, can be a daunting task. &lt;/p&gt;

&lt;p&gt;All Cloud providers recommend to always tag/label your resources during their creation time, however, there are situations in which an organization might decide to add  new tags to their existing resources. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;u&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/u&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s a  Python script that leverages the power of AWS’ &lt;a href="https://aws.amazon.com/resourceexplorer/"&gt;“Resource Explorer”&lt;/a&gt; - a powerful search and discovery service -  and &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/resource-explorer-2.html"&gt;Boto3&lt;/a&gt; - AWS’ SDK for Python - to automate the process of adding missing tags while saving  time and ensuring consistency. This solution could be easily implemented by SRE's, DevOps and/or Cloud Engineers scoping to improve resource organization, inventory and cost management.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;u&gt;&lt;strong&gt;Scenario:&lt;/strong&gt;&lt;/u&gt;  
&lt;/h2&gt;

&lt;p&gt;In this fictional example, I’m imagining that a University is restructuring its infrastructure and its &lt;em&gt;‘tagging strategy.’&lt;/em&gt; The University's CTO decided to dedicate/re-tag an entire already existing AWS account to the “College of Liberal Arts, and the Philosophy Department.” This Python script is searching for two tags:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;‘philosophy’&lt;/em&gt; and &lt;em&gt;‘liberal-arts’&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;in the two  AWS Zones in which the &lt;em&gt;College of Liberal Arts, and the Philosophy Department&lt;/em&gt; resources are located:&lt;br&gt;
&lt;em&gt;’us-east-1’&lt;/em&gt; and &lt;em&gt;‘us-east-2’&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Currently, this script is written for an AWS Lambda function, with the &lt;code&gt;Lambda Handler&lt;/code&gt; set up, yet, it could be easily refactored to work as a simple &lt;code&gt;Boto3&lt;/code&gt; script to be executed from a local machine.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;u&gt;&lt;strong&gt;Code Breakdown:&lt;/strong&gt;&lt;/u&gt;
&lt;/h2&gt;

&lt;p&gt;This section describes the dependencies and functions that make up the AWS Auto Tagging Solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependencies:&lt;/strong&gt;&lt;br&gt;
These are the necessary dependencies for the Lambda Function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3
from botocore.exceptions  import ClientError
from botocore.config import Config
import logging
import json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Logger setup:&lt;/strong&gt;&lt;br&gt;
As a good habit, I added a Logger for possible debugging, either in testing or production stages. This also provides data for further analysis of the script's performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lambda Handler:&lt;/strong&gt;&lt;br&gt;
The core of this script lies on the &lt;code&gt;Resource Explorer Boto3 Client&lt;/code&gt;, which is added to the &lt;code&gt;Lambda Handler&lt;/code&gt; . ‘Resource Explorer’ searches for all the resources in the AWS account and looks for its tags.&lt;br&gt;
If the service does not find the &lt;em&gt;‘philosophy’&lt;/em&gt; and &lt;em&gt;‘liberal-arts’&lt;/em&gt; key tags in the &lt;em&gt;’us-east-1’&lt;/em&gt; and &lt;em&gt;‘us-east-2’&lt;/em&gt; regions, then it will automatically add them with the &lt;code&gt;apply_tags&lt;/code&gt; function - called inside the &lt;code&gt;def lambda_handler&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def lambda_handler(event, context):
    logger.debug('Incoming Event')
    logger.debug(event)

    resource_explorer_client = boto3.client(
        'resource-explorer-2',
    )

    missing_philosophy_tag = get_resources_missing_tag(resource_explorer_client, 'philosophy')
    missing_liberal_arts_tag = get_resources_missing_tag(resource_explorer_client, 'liberal-arts')

    logger.info(f"# of Resources Missing 'philosophy' {missing_philosophy_tag['Count']['TotalResources']} - Complete List? {missing_philosophy_tag['Count']['Complete']}")
    logger.info(f"# of Resources Missing 'liberal-arts' {missing_liberal_arts_tag['Count']['TotalResources']} - Complete List? {missing_liberal_arts_tag['Count']['Complete']}")

    map_philosophy_arns=[]
    for this_resource in missing_philosophy_tag['Resources']:
        map_philosophy_arns.append(this_resource['Arn'])
    logger.info(f"The Map Philosophy ARN:{map_philosophy_arns}")

    map_liberal_arts_arns=[]
    for this_resource in missing_liberal_arts_tag['Resources']:
        map_liberal_arts_arns.append(this_resource['Arn'])
    logger.info(f"The Map Liberal Arts ARN:{map_liberal_arts_arns}")

    apply_tags(map_philosophy_arns, {'philosophy': 'phil-dept-server'})

    apply_tags(map_liberal_arts_arns, {'liberal-arts': 'la-dept-server'})

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Missing Tags:&lt;/strong&gt;&lt;br&gt;
'Resource Explorer' looks for the specified tags in all the resources available in the current account in which the solution is being deployed. Since 'Resource Explorer' will only provide 100 results (resources) at a time, I added a paginator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_resources_missing_tag(client, tag_name): 
    return (
        client.get_paginator('search')
            .paginate(QueryString=f"-tag.key:{tag_name}")
            .build_full_result()
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Applying Tags:&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;ResourceGroupsTaggingAPI&lt;/code&gt; efficiently applies the identified missing tags, ensuring consistency and optimizing resource management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def apply_tags(list_of_resources, tag_map):
    resources_by_region = return_resources_by_region(list_of_resources)
    counter = 0
    for this_resource in list_of_resources:
        counter += 1
        logger.info(f"{counter}) Add tag '{tag_map.keys()}' to '{this_resource}'")
    # iterates over regions:
    regions = ['us-east-1', 'us-east-2']
    for region in regions: 
        tagging_client = boto3.client('resourcegroupstaggingapi', region_name=region)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Returning a list of tagged resources:&lt;/strong&gt;&lt;br&gt;
The final portion of the script returns a dictionary with a list of all the newly tagged resources, separated by their respective regions - &lt;em&gt;'us-east-1', 'us-east-2'&lt;/em&gt; - and writes these results into a &lt;em&gt;JSON file&lt;/em&gt;. For enhanced tracking and auditing capabilities, a future implementation could leverage &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html"&gt;DynamoDB&lt;/a&gt; to store the &lt;em&gt;JSON file&lt;/em&gt;, providing a detailed history of tagging changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def return_resources_by_region(resources_for_all_regions):
    resources_by_region = dict()
    regions = ['us-east-1', 'us-east-2']
    for region_name in regions:
        resources_by_region[region_name] = [arn for arn in resources_for_all_regions if region_name in arn]

        logger.info(f"The {region_name} resources are: \n {resources_by_region[region_name]} \n") 

    return resources_by_region

def format_in_json(response):
    return json.dumps(response, indent=4, sort_keys=True, default=str)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple Lambda function can be scheduled once a week, as a simple &lt;em&gt;auto-tagging solution&lt;/em&gt;, or be triggered by an event, such as the creation of a new resource. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag enforcement:&lt;/strong&gt;&lt;br&gt;
A long-term approach to an account-level 'Tagging' requirement would be to - once the new tags are applied to the current infrastructure by this Lambda function - implement an &lt;a href="https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html"&gt;SCP&lt;/a&gt; (Service Control Policy). This SCP would deny the creation of any new resources that didn't contain the specified tags required by the Organization - in this example: &lt;em&gt;‘philosophy’&lt;/em&gt; and &lt;em&gt;‘liberal-arts’&lt;/em&gt; .&lt;/p&gt;

&lt;p&gt;Click &lt;a href="https://github.com/Amanda-Ruzza/aws-lambda-auto-tagger.git"&gt;here&lt;/a&gt; for the full script available on GitHub, plus installation instructions for the required &lt;code&gt;Boto3&lt;/code&gt; and &lt;code&gt;Botocore&lt;/code&gt; packages.&lt;/p&gt;

</description>
      <category>awslambda</category>
      <category>python</category>
      <category>resourceexplorer</category>
      <category>resourcetags</category>
    </item>
    <item>
      <title>My Approach to Passing the Professional Cloud Developer Exam (First Try!)</title>
      <dc:creator>Amanda Ruzza</dc:creator>
      <pubDate>Tue, 02 Jan 2024 21:15:18 +0000</pubDate>
      <link>https://forem.com/amandaruzza/my-approach-to-passing-the-professional-cloud-developer-exam-first-try-4pel</link>
      <guid>https://forem.com/amandaruzza/my-approach-to-passing-the-professional-cloud-developer-exam-first-try-4pel</guid>
      <description>&lt;p&gt;I decided to study for this exam as a way for me to improve my overall knowledge in GCP, while looking at different application development approaches for my work and personal projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exam Content:
&lt;/h2&gt;

&lt;p&gt;The 60 questions were accurate with the official exam guide, and were heavily focused on looking at approaches to deploy, modernize or troubleshoot applications, with Google’s principals of SRE, DevOps and Security. Here are some focus topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes and GKE &lt;/li&gt;
&lt;li&gt;  Serveless solutions with Pub/Sub, Cloud Functions, and Cloud Run&lt;/li&gt;
&lt;li&gt;  Data Modeling for different databases&lt;/li&gt;
&lt;li&gt;  Authorization and authentication: IAM, Workload Identity, JWT tokens, etc…&lt;/li&gt;
&lt;li&gt;  Integrations with Operations Suite&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources:
&lt;/h2&gt;

&lt;p&gt;As part of my study methodology, I like to use many resources, aiming to understand things from different points of view and make sure that I’m not just ‘memorizing’ the audio from a lecture or the text from the documentation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ranga Karanam's PCD course on &lt;a href="https://www.udemy.com/course/google-cloud-certified-professional-cloud-developer/"&gt;Udemy&lt;/a&gt;&lt;br&gt;
Ranga is an excellent teacher. However, certain topics, such as Data Modeling and GKE - Kubernetes - were a bit difficult to grasp, due to the lack of visual diagrams. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alex Levkovich practice tests on &lt;a href="https://www.udemy.com/course/full-practice-exams-google-professional-cloud-developer/"&gt;Udemy&lt;/a&gt;&lt;br&gt;
His practice tests were updated with the current exam content, and I also appreciate the fact that I was able too write him questions regarding some of the quizzes, and he always wrote me back, with in-depth answers of the ‘why’ in certain solutions. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Some of the GCP skillsboost labs/courses. &lt;br&gt;
Many of the labs on the PCD learning path were outdated and had bugs, thus I wasn’t able to finish them, however, the GKE labs were great.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCP guided labs from the official documentation for Cloud Functions, Cloud Build, Firestore, Pub/Sub and Spanner&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Even though certain things were outdated, I used the lectures and labs from the &lt;a href="https://www.pluralsight.com/cloud-guru?exp=3"&gt;A Cloud Guru&lt;/a&gt;  PCD’s course&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The following official GCP playlists and videos from YouTube&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=ZLI8sknDNWw"&gt;Modern CI/CD on GCP&lt;/a&gt;&lt;br&gt;
&lt;a href="https://youtube.com/playlist?list=PLIivdWyY5sqLOiLXJDlN-wKd0g7hf_9vC&amp;amp;si=fqqv4tS5SHHUIdHS"&gt;Engineering for Reliability&lt;/a&gt;&lt;br&gt;
&lt;a href="https://youtube.com/playlist?list=PLIivdWyY5sqL3xfXz5xJvwzFW_tlQB_GB&amp;amp;si=rWFG9yXuXEnqXWGE"&gt;Kubernetes Best Practices&lt;/a&gt;&lt;br&gt;
&lt;a href="https://youtube.com/playlist?list=PLIivdWyY5sqKJx6FwJMRcsnFIkkNFtsX9&amp;amp;si=MsSFg7XvlK_StJyD"&gt;Beyond your GCP Bill&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nana’s incredible &lt;a href="https://youtube.com/playlist?list=PLy7NrYWoggjziYQIDorlXjTvvwweTYoNC&amp;amp;si=R0tacN-Yf_h9Ch3g"&gt;Kubernetes tutorials&lt;/a&gt; on YouTube&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Anton Putra’s YouTube tutorial on &lt;a href="https://youtu.be/lxc4EXZOOvE?si=jEJ2lkU272wZnzCz"&gt;Kubernetes deployment strategies&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Study Approach:
&lt;/h2&gt;

&lt;p&gt;I strive to be efficient with my studies. As in my previous certification studies and preparation, my main goal wasn’t to ‘pass the exam so I could have a badge.’ I wanted to make sure that I really learned what was featured in the test, and that I could use these concepts and newly acquired knowledge  to become a better developer. I went beyond lectures and practice tests,  did hands-on-labs and self-created exercises on things that I felt needed improvement.&lt;/p&gt;

&lt;p&gt;With that in mind, here was my methodology:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Daily Minimum:&lt;/strong&gt; I carved out at least 1 hour every day, even if it meant dawn sessions or airplane lectures. I found consistency trumps cramming when it comes to certification prep! &lt;em&gt;[ yes, I was attending AWS re:Invent while preparing for this exam. That wasn’t an excuse, though, I  still made sure to put at least 1 hour everyday early in the morning before going to the incredible re:Invent sessions! ]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. One Take &amp;amp; Reflection:&lt;/strong&gt; I gave each Ranga lecture a single watch, no matter how complicated concepts felt. I'd then pause and ask myself: "What can I use from this in a real project?" This kept me learning, not just passively listening.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Practice Tests:&lt;/strong&gt; After initial chapters, I tackled the official GCP sample exam questions. Then, I dove into related docs, taking notes and recording myself explaining answers or picturing solutions. This active interrogation solidified my understanding and made me focus even more on the lectures from Ranga’s course.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Hands-on Labs:&lt;/strong&gt; I went through labs and the extra resources mentioned here. It’s was fun to practice CLI commands, tweaking Kubernetes YAML, deploying Cloud Functions and Cloud Run apps , modeling Spanner, Firestore and BigTable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Problem Solving:&lt;/strong&gt; I dug into practice tests from Udemy and A Cloud Guru, focusing on understanding the ‘problem’ - i.e practice question.  I'd record my voice memos analyzing the "why" and then do labs related to ‘question-specific’ topics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Walking and Visualization:&lt;/strong&gt; My daily walks and errands doubled as study time. I'd listen to my voice memos, replaying and refining my grasp of GCP solutions. By the end, I had 53 recordings (15-25 minutes each) with my self explanations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UfQGQvdK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p0qnabif7g6qk2hpdiyf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UfQGQvdK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p0qnabif7g6qk2hpdiyf.png" alt="Study Tool for GCP Certification" width="800" height="1243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this study process, I felt productive, and excited about putting all this knowledge in practice in the real world, while also acquiring confidence to take the exam on December 28th, 2023. I passed it on my first attempt, and felt that this preparation journey was 100% worth my time. I’d totally recommend studying for this certification, my goal with this blog is to show you that with a mix of different resources, a daily study routine and practical application, this exam will improve your insight on GCP’s incredible possibilities for application development. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HUsXc-Ch--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bifqvh0hlsm30l2kpdms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HUsXc-Ch--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bifqvh0hlsm30l2kpdms.png" alt="Google Professional Cloud Developer Certification" width="792" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>professionalclouddeveloper</category>
      <category>certificationstudytechniques</category>
      <category>gcpcertification</category>
      <category>googlecloud</category>
    </item>
  </channel>
</rss>
