<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Carolina Monte</title>
    <description>The latest articles on Forem by Carolina Monte (@carolinamonte).</description>
    <link>https://forem.com/carolinamonte</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1329410%2F1b547875-8287-4110-b6db-017f38dcf382.jpeg</url>
      <title>Forem: Carolina Monte</title>
      <link>https://forem.com/carolinamonte</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/carolinamonte"/>
    <language>en</language>
    <item>
      <title>Introduction to Semantic Search with Python and OpenAI API</title>
      <dc:creator>Carolina Monte</dc:creator>
      <pubDate>Wed, 01 May 2024 17:13:39 +0000</pubDate>
      <link>https://forem.com/carolinamonte/introduction-to-semantic-search-with-python-and-openai-api-efg</link>
      <guid>https://forem.com/carolinamonte/introduction-to-semantic-search-with-python-and-openai-api-efg</guid>
      <description>&lt;p&gt;&lt;strong&gt;Semantic search&lt;/strong&gt; represents a significant leap over traditional keyword-based search methods. Instead of merely matching keywords or phrases, semantic search &lt;em&gt;comprehends&lt;/em&gt; the context and underlying meaning behind a query, providing more relevant and intelligent search results. This approach leverages advanced Natural Language Processing (NLP) techniques and is particularly useful in applications where understanding user intent and content relevance is crucial.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll dive into the basics of implementing semantic search using Python 🐍 and the OpenAI API. We'll focus on document embeddings to demonstrate how text can be converted into numerical vectors that machines can understand and process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Document Embeddings
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfdsncvtnu280x6w41ck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfdsncvtnu280x6w41ck.png" alt="Embedding" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Document embeddings are like the secret sauce powering semantic search technologies, transforming how we interact with text-based information. They're not just about finding exact word matches; they dive into the deeper layers of meaning within documents to help systems understand relevance in context.&lt;/p&gt;

&lt;p&gt;To grasp the concept of document embeddings, let's unpack the core elements that drive their functionality:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimensionality&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Picture document embeddings like puzzle pieces scattered across a vast landscape, each representing a different facet of meaning. These embeddings live in high-dimensional spaces, often spanning hundreds of dimensions. Think of it like having hundreds of different lenses to view the same text, each revealing a unique angle. For example, &lt;strong&gt;OpenAI's Ada model&lt;/strong&gt; crafts embeddings in an &lt;strong&gt;over 1,000-dimensional space&lt;/strong&gt;. This multidimensional approach allows us to capture the richness and depth of textual content, capturing nuances that go beyond simple word matching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Elements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the heart of every embedding vector are floating-point numbers – the building blocks of meaning. These numbers aren't just random; they're carefully crafted to encode the essence of the text. They capture everything from the words themselves to the grammar and syntax, weaving together a rich tapestry of meaning. &lt;strong&gt;By translating language into numbers&lt;/strong&gt;, document embeddings make it easier for computers to understand and compare textual information, almost like speaking their language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Normalization is like the referee ensuring a fair game in the world of document embeddings. It's all about making sure that different embeddings play by the same rules. Through normalization, we standardize the length of embedding vectors, ensuring they're all on an even playing field. This consistency makes it easier to compare vectors using techniques like &lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="noopener noreferrer"&gt;cosine similarity&lt;/a&gt;&lt;/strong&gt;, helping us measure how closely related different pieces of text are. By leveling the playing field, normalization helps semantic search algorithms find those subtle connections that might otherwise be missed.&lt;/p&gt;

&lt;h2&gt;
  
  
  And now to the fun part: Step-by-Step implementation
&lt;/h2&gt;

&lt;p&gt;How can we visualize all of these concepts and see them in action? 👀&lt;/p&gt;

&lt;p&gt;Let's walk through a simple Python script to implement a basic semantic search using embeddings from the OpenAI API.&lt;/p&gt;

&lt;p&gt;You’ll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;a valid OpenAI API Key&lt;br&gt;
&lt;em&gt;Please be advised that use of the OpenAI API may require payment for access to certain features and services. Make sure to review and understand the terms and conditions associated with each plan before proceeding.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;an environment where you can run Python (can be locally or in platforms such as Google Colab)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Setting Up Your Environment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, ensure you have Python installed along with the requests and numpy libraries. You can install these packages using pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Authenticating with the OpenAI API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Replace &lt;code&gt;'your-openai-api-key'&lt;/code&gt; in the script with your actual API key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Fetching Embeddings&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's how to define a function in Python to fetch embeddings from the OpenAI API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;your-openai-api-key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://api.openai.com/v1/embeddings&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Implementing the Search Function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The search function calculates cosine similarities between the query embedding and each document embedding, returning the most relevant document.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="n"&gt;document_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
 &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5: Running Your Semantic Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define some sample documents and a query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dave Grohl likes to play drums.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Artificial intelligence is reshaping industries and societies.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The mitochondrion is known as the powerhouse of the cell.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mate is a very popular beverage in Argentina&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Czech Republic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s population in 2022 is 10.67 million.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What can I drink today?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search Result:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, as we asked &lt;em&gt;"What can I drink today?"&lt;/em&gt; likely the result would be &lt;strong&gt;"Mate is a very popular beverage in Argentina"&lt;/strong&gt;. 🧉&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Semantic search is transforming how we interact with data. By understanding the deeper meaning of text, systems can provide more accurate and useful results. This tutorial introduces you to the basics of building a semantic search engine using Python and the OpenAI API, which you can extend to more complex and robust applications.&lt;br&gt;
By using embeddings and calculating similarities, developers can create systems that are not only efficient but also capable of understanding the nuances of human language, making technologies more intuitive and user-friendly.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>semanticsearch</category>
      <category>openai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
