<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Emmett McFarlane</title>
    <description>The latest articles on Forem by Emmett McFarlane (@emcf).</description>
    <link>https://forem.com/emcf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1517377%2F2c7e9085-5f0e-4158-9943-1ba4e177a283.png</url>
      <title>Forem: Emmett McFarlane</title>
      <link>https://forem.com/emcf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/emcf"/>
    <language>en</language>
    <item>
      <title>How to use OpenAI’s new Structured Outputs API (with code)</title>
      <dc:creator>Emmett McFarlane</dc:creator>
      <pubDate>Fri, 09 Aug 2024 04:04:31 +0000</pubDate>
      <link>https://forem.com/emcf/how-to-use-openais-new-structured-outputs-api-with-code-2enl</link>
      <guid>https://forem.com/emcf/how-to-use-openais-new-structured-outputs-api-with-code-2enl</guid>
      <description>&lt;h3&gt;
  
  
  OpenAI has recently released a game-changing feature for devs looking to build more reliable systems.
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5la8gz1dby2hpv57rwm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5la8gz1dby2hpv57rwm.png" alt="Image description" width="800" height="508"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The new model, gpt-4o-2024–08–06, with Structured Outputs scores a perfect 100% on OpenAI's structured extraction evaluation. In comparison, gpt-4–0613 scores less than 40%. Source: OpenAI's blog post&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This new feature ensures that the model's output will exactly match the JSON Schemas provided by developers, making it easier to build powerful assistants and extract structured data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️ If you want to use this technique with GPT-4o or other LLMs to extract clean structured data from any PDF, Word doc, or website, check out &lt;a href="https://thepi.pe/" rel="noopener noreferrer"&gt;this open source extractor tool&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How does it work?
&lt;/h2&gt;

&lt;p&gt;Under the hood, OpenAI uses a technique called constrained sampling or constrained decoding. Instead of allowing the model to select any token from the vocabulary, it constrains the output to only tokens that are valid according to the supplied schema. This is done dynamically, so the model can still generate flexible and diverse responses while adhering to the specified structure.&lt;br&gt;
The constrained decoding approach used by OpenAI involves dynamically determining which tokens are valid after each token is generated, based on the previously generated tokens and the rules within the context-free grammar (CFG) that indicates which tokens are valid next. This ensures that the model's output always adheres to the specified schema.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;p&gt;I'm more of a "learn-by-example" man, so I'll give the simplest possible example I could come up with.&lt;br&gt;
In this example, we define a Person class using Pydantic, which has two fields: name and age&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then create an OpenAI client and use the chat.completions.parse method to send our request. The messages parameter includes a system message instructing the model to extract the names and ages, and a user message with the text we want to extract data from. The tools parameter includes our Person class, specifying that we want the model to extract data in this format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-2024-08-06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract the names and ages of the people mentioned in the following text.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John is 30 years old and his sister Alice is 25.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pydantic_function_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsed_arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;completion.choices[0].message.tool_calls[0].function.parsed_arguments is a dictionary that contains the parsed arguments returned by the model, which will match the structure defined by your Pydantic class. In this case, since we defined a Person class with name and age fields, the parsed arguments will also have those fields.&lt;br&gt;
Here's an example of what the parsed arguments might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    'name': 'John',
    'age': 30
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's say you want to use these parsed arguments in your code instead of just printing them. Here's how you could create a new Person object using the parsed arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a new Person object using the parsed arguments
&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;parsed_arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Output: John
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Output: 30
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's pretty neat!&lt;/p&gt;

&lt;h2&gt;
  
  
  What models can I use?
&lt;/h2&gt;

&lt;p&gt;Structured ouputs are supported on all models that include gpt-4 and later, and response formats available on gpt-4o-mini and gpt-4o-2024–08–06.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations and Restrictions
&lt;/h2&gt;

&lt;p&gt;There are a few limitations to keep in mind when using Structured Outputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured Outputs only supports a subset of JSON Schema, as detailed in OpenAI's documentation.&lt;/li&gt;
&lt;li&gt;The first API response with a new schema will incur additional latency due to the preprocessing of the schema. Subsequent responses will be faster with no latency penalty.&lt;/li&gt;
&lt;li&gt;The model may refuse unsafe requests or stop generating before completing the schema if it reaches max_tokens or another stop condition.&lt;/li&gt;
&lt;li&gt;Structured Outputs doesn't prevent all kinds of model mistakes within the values of the JSON object.&lt;/li&gt;
&lt;li&gt;Structured Outputs is not compatible with parallel function calls.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>openai</category>
      <category>webdev</category>
      <category>chatgpt</category>
      <category>ai</category>
    </item>
    <item>
      <title>Extracting Data from Tricky PDFs with Google Gemini in 10 lines of Python</title>
      <dc:creator>Emmett McFarlane</dc:creator>
      <pubDate>Thu, 18 Jul 2024 14:20:52 +0000</pubDate>
      <link>https://forem.com/emcf/extracting-data-from-tricky-pdfs-with-google-gemini-in-10-lines-of-python-7ni</link>
      <guid>https://forem.com/emcf/extracting-data-from-tricky-pdfs-with-google-gemini-in-10-lines-of-python-7ni</guid>
      <description>&lt;p&gt;In this guide, I'll show you how to extract structured data from PDFs using vision-language models (VLMs) like Gemini Flash or GPT-4o.&lt;/p&gt;

&lt;p&gt;Gemini, Google's latest series of vision-language models, has shown state of the art performance in text and image understanding. This improved multimodal capability and long context window makes it particularly useful for processing visually complex PDF data that traditional extraction models struggle with, such as figures, charts, tables, and diagrams.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;By doing so, you can easily build your own data extraction tool for visual file and web extraction. Here's how:&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Gemini's long context window and multimodal capability makes it particularly useful for processing visually complex PDF data where traditional extraction models struggle.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Setting Up Your Environment
&lt;/h2&gt;

&lt;p&gt;Before we dive into extraction, let's set up our development environment. This guide assumes you have Python installed on your system. If not, download and install it from &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;https://www.python.org/downloads/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⚠️ Note that, if you don't want to use Python, you can use the cloud platform at thepi.pe to upload your files and download your result as a CSV without writing any code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Required Libraries
&lt;/h3&gt;

&lt;p&gt;Open your terminal or command prompt and run the following commands:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

pip install git+https://github.com/emcf/thepipe
pip install pandas


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For those new to Python, pip is the package installer for Python, and these commands will download and install the necessary libraries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Set Up Your API Key
&lt;/h3&gt;

&lt;p&gt;To use thepipe, you need an API key.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Disclaimer: While thepi.pe is a free an open source tool, the API has a cost, roughly $0.00002 per token. If you want to avoid such costs, check out the local setup instructions on GitHub. Note that you will still have to pay your LLM provider of choice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's how to get and set it up:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Visit &lt;a href="https://thepi.pe/platform/" rel="noopener noreferrer"&gt;https://thepi.pe/platform/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Create an account or log in&lt;/li&gt;
&lt;li&gt;Find your API key in the settings page&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1b8wk67vvo2apns48znb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1b8wk67vvo2apns48znb.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, you need to set this as an environment variable. The process varies depending on your operating system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy the API key from the settings menu on thepi.pe Platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Windows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search for "Environment Variables" in the Start menu&lt;/li&gt;
&lt;li&gt;Click "Edit the system environment variables"&lt;/li&gt;
&lt;li&gt;Click the "Environment Variables" button&lt;/li&gt;
&lt;li&gt;Under "User variables", click "New"&lt;/li&gt;
&lt;li&gt;Set the variable name as THEPIPE_API_KEY and the value as your API key&lt;/li&gt;
&lt;li&gt;Click "OK" to save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For macOS and Linux:&lt;br&gt;
Open your terminal and add this line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

export THEPIPE_API_KEY=your_api_key_here


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then, reload your configuration:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

source ~/.bashrc # or ~/.zshrc


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Defining Your Extraction Schema
&lt;/h2&gt;

&lt;p&gt;The key to successful extraction is defining a clear schema for the data you want to pull out. Let's say we're extracting data from a Bill of Quantity document:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvulrg5gpm0gsbp5dbp6x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvulrg5gpm0gsbp5dbp6x.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An example of a page from the Bill of Quantity document. The data on each page is independent of the other pages, so we do our extraction "per page". There are multiple pieces of data to extract per page, so we set multiple extractions to True&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Looking at the column names, we might want to extract a schema like this:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can modify the schema to your liking on thepi.pe Platform. Clicking "View Schema" will give you a schema you can copy and paste for use with the Python API&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8qsh3hwdgkdzj9mhr63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8qsh3hwdgkdzj9mhr63.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting Data from PDFs
&lt;/h2&gt;

&lt;p&gt;Now, let's use extract_from_file to pull data from a PDF:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;thepipe.extract&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_from_file&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_from_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bill_of_quantity.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ai_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-flash-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;chunking_method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_by_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here, we've &lt;code&gt;chunking_method="chunk_by_page"&lt;/code&gt; because we want to send each page to the AI model individually (the PDF is too large to feed all at once). We also set &lt;code&gt;multiple_extractions=True&lt;/code&gt; because the PDF pages each contain multiple rows of data. Here's what a page from the PDF looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzw74igyfw04zbd701cxk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzw74igyfw04zbd701cxk.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The results of the extraction for the Bill of Quantity PDF as viewed on thepi.pe Platform&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Processing the Results
&lt;/h2&gt;

&lt;p&gt;The extraction results are returned as a list of dictionaries. We can process these results to create a pandas DataFrame:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Display the first few rows of the DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This creates a DataFrame with all the extracted information, including textual content and descriptions of visual elements like figures and tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exporting to Different Formats
&lt;/h2&gt;

&lt;p&gt;Now that we have our data in a DataFrame, we can easily export it to various formats. Here are some options:&lt;/p&gt;

&lt;h3&gt;
  
  
  Exporting to Excel
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_excel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_research_data.xlsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sheet_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This creates an Excel file named "extracted_research_data.xlsx" with a sheet named "Research Data". The &lt;code&gt;index=False&lt;/code&gt; parameter prevents the DataFrame index from being included as a separate column.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exporting to CSV
&lt;/h3&gt;

&lt;p&gt;If you prefer a simpler format, you can export to CSV:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_research_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This creates a CSV file that can be opened in Excel or any text editor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ending Notes
&lt;/h2&gt;

&lt;p&gt;The key to successful extraction lies in defining a clear schema and utilizing the AI model's multimodal capabilities. As you become more comfortable with these techniques, you can explore more advanced features like custom chunking methods, custom extraction prompts, and integrating the extraction process into larger data pipelines.&lt;/p&gt;

</description>
      <category>pdf</category>
      <category>chatgpt</category>
      <category>data</category>
      <category>python</category>
    </item>
    <item>
      <title>Web Extraction with Vision-LLMs: SQL-Ready Data From Any URL with GPT-4o</title>
      <dc:creator>Emmett McFarlane</dc:creator>
      <pubDate>Wed, 22 May 2024 22:18:40 +0000</pubDate>
      <link>https://forem.com/emcf/web-extraction-with-vision-llms-done-the-right-way-structured-data-from-any-url-with-gpt-4o-1al8</link>
      <guid>https://forem.com/emcf/web-extraction-with-vision-llms-done-the-right-way-structured-data-from-any-url-with-gpt-4o-1al8</guid>
      <description>&lt;h2&gt;
  
  
  Let's talk about GPT-4o
&lt;/h2&gt;

&lt;p&gt;GPT-4o, OpenAI's latest vision-language model, excels in handling images compared to its predecessor language model GPT-4. This improved multimodal capability makes it particularly useful for processing visually complex web data that traditional scrapers often struggle with. Whether it's extracting information from blogs, live feeds, news articles, youtube videos, etc, using vision-language models provides a significant advantage over standard language models for unstructured extraction "in the wild".&lt;/p&gt;

&lt;p&gt;In this guide, we'll explore how to scrape visual and text content from webpages and prepare the data for use with multimodal language models like GPT-4o. Then, we'll use the scraped visual and text prompt to extract specific structured data from the articles on the webpage. Lastly, we'll show how to validate the data and upsert it to a PostgreSQL database. &lt;/p&gt;

&lt;h3&gt;
  
  
  Set Up Your APIs
&lt;/h3&gt;

&lt;p&gt;Ensure you have set the &lt;code&gt;THEPIPE_API_KEY&lt;/code&gt; environment variable with your API key. If you don't have an API key, you can get one &lt;a href="https://thepi.pe/pricing"&gt;here&lt;/a&gt; or you can use The Pipe on your own server by following the local setup instructions in the documentation. Additionally, set your &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; for accessing GPT-4o. Don't have that either? Get it &lt;a href="https://openai.com/index/openai-api/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For Windows users:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;setx THEPIPE_API_KEY &lt;span class="s2"&gt;"your_api_key"&lt;/span&gt;
setx OPENAI_API_KEY &lt;span class="s2"&gt;"your_openai_api_key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Mac users:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;THEPIPE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_api_key"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_openai_api_key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your terminal for the changes to take effect. You'll now need to install The Pipe API. Open your terminal and run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;thepipe_api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extract Content from a Webpage
&lt;/h3&gt;

&lt;p&gt;Use The Pipe API to extract text and images from a webpage. Here's an example of how to do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;thepipe_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;thepipe&lt;/span&gt;

&lt;span class="c1"&gt;# Extract multimodal content from a webpage
&lt;/span&gt;&lt;span class="n"&gt;webpage_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thepipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.bbc.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Pipe API can handle dynamic content that adapts as you scroll (as many modern webpages contain) and automatic scrolling, ensuring that all relevant text and images are captured. This is particularly important for visual web data, which traditional scrapers often miss or misinterpret.&lt;/p&gt;

&lt;p&gt;The result of the extraction will be a list of dictionaries, each containing the extracted content from the webpage as a hosted browser scrolls through the page. The content will include both text and images, making it suitable for use with multimodal language models like GPT-4o:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dwoqst6sc99k77aywt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dwoqst6sc99k77aywt.png" alt="Web Extraction" width="496" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Prepare the Input for GPT-4o
&lt;/h3&gt;

&lt;p&gt;Next, prepare the input prompt by combining the extracted content with a user query. This will help GPT-4o understand what you want to achieve with the extracted data. In our case, we will perform a structured data extraction task from the webpage, looking to grab the articles, their contents, and any images associated with them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add a user query
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Please extract each article from the given webpage. Do this by returning a JSON object with the key, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;articles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, containing a list of all the articles in the given page. For each article, provide a JSON dictionary containing the following keys: 
            title (required string),
            extracted_plaintext (required string),
            topics (required list),
            sentiment (required string),
            language (required string),
            image_description (optional string).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;

&lt;span class="c1"&gt;# Combine the content to create the input prompt for GPT-4o
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webpage_content&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Send the Input to GPT-4o
&lt;/h3&gt;

&lt;p&gt;With the input prepared, you can now send it to GPT-4o using the OpenAI API. Make sure you have your &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; set in your environment variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the OpenAI client
&lt;/span&gt;&lt;span class="n"&gt;openai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Send the unstructured visuals and text to GPT-4o
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Extract the structured JSON
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Print the result
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPT-4o will process the input prompt and return a structured JSON object containing the extracted articles, their titles, extracted plaintext, topics, sentiment, language, and image descriptions (if available). This structured data can be used for further analysis or processing (see images below for visualizations using the Beta version of The Pipe API portal).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv730mp1jn4i479z0fcqe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv730mp1jn4i479z0fcqe.png" alt="GPT-4o Output" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gavsd1fvjl10il9k2qb.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gavsd1fvjl10il9k2qb.gif" alt="GPT-4o json" width="764" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting the Data to Use
&lt;/h3&gt;

&lt;p&gt;Now that you have the structured data, you can use it for various purposes, such as content analysis, summarization, sentiment analysis, or even generating new content based on the extracted information. The structured data can be easily pushed into a SQL table, a NoSQL database, or any other data storage system for further processing. For example, here, I am pushing the extracted data into a PostgreSQL database hosted on Supabase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;supabase&lt;/span&gt;
&lt;span class="n"&gt;sb_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SUPABASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SUPABASE_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extracted_plaintext&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extracted_plaintext&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;topics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;topics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;image_description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;image_description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;sb_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;demo_db&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Viola. We just went from this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://bbc.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t0sugujr99iltc4fmwx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t0sugujr99iltc4fmwx.png" alt="Supabase" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without having to parse any HTML, use Document Layout Analysis models, deal with complex CSS selectors, or make custom scrapers for dynamically loaded visuals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Token Limits
&lt;/h3&gt;

&lt;p&gt;When dealing with vision models, you might need to handle token limits sooner than text models. The Pipe API allows you to extract text-only content in extreme cases to avoid exceeding token limits. For more details, you can &lt;a href="https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6"&gt;click here for a discussion on token limits for GPT-4-Vision&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Extract text-only content from a webpage
&lt;/span&gt;&lt;span class="n"&gt;webpage_content_text_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thepipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;messages_text_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webpage_content_text_only&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Congratulations!
&lt;/h3&gt;

&lt;p&gt;You've successfully scraped a webpage and extracted visually complicated unstructured data from it. This process can be extended to other types of content, such as PDFs, videos, and more, using The Pipe API. For more details on GPT-4o, check out the &lt;a href="https://openai.com/blog/hello-gpt-4o"&gt;OpenAI announcement&lt;/a&gt;. If you're a developer, feel free to contribute to The Pipe on &lt;a href="https://github.com/emcf/thepipe"&gt;GitHub&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Happy coding! 🚀&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>chatgpt</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
