Forem: Emmett McFarlane

How to use OpenAI’s new Structured Outputs API (with code)

Emmett McFarlane — Fri, 09 Aug 2024 04:04:31 +0000

OpenAI has recently released a game-changing feature for devs looking to build more reliable systems.

The new model, gpt-4o-2024–08–06, with Structured Outputs scores a perfect 100% on OpenAI's structured extraction evaluation. In comparison, gpt-4–0613 scores less than 40%. Source: OpenAI's blog post

This new feature ensures that the model's output will exactly match the JSON Schemas provided by developers, making it easier to build powerful assistants and extract structured data.

⚠️ If you want to use this technique with GPT-4o or other LLMs to extract clean structured data from any PDF, Word doc, or website, check out this open source extractor tool

How does it work?

Under the hood, OpenAI uses a technique called constrained sampling or constrained decoding. Instead of allowing the model to select any token from the vocabulary, it constrains the output to only tokens that are valid according to the supplied schema. This is done dynamically, so the model can still generate flexible and diverse responses while adhering to the specified structure.
The constrained decoding approach used by OpenAI involves dynamically determining which tokens are valid after each token is generated, based on the previously generated tokens and the rules within the context-free grammar (CFG) that indicates which tokens are valid next. This ensures that the model's output always adheres to the specified schema.

How to use it

I'm more of a "learn-by-example" man, so I'll give the simplest possible example I could come up with.
In this example, we define a Person class using Pydantic, which has two fields: name and age

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

We then create an OpenAI client and use the chat.completions.parse method to send our request. The messages parameter includes a system message instructing the model to extract the names and ages, and a user message with the text we want to extract data from. The tools parameter includes our Person class, specifying that we want the model to extract data in this format.

from openai import OpenAI
client = OpenAI()

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "Extract the names and ages of the people mentioned in the following text."
        },
        {
            "role": "user",
            "content": "John is 30 years old and his sister Alice is 25."
        }
    ],
    tools=[
        openai.pydantic_function_tool(Person)
    ]
)

print(completion.choices[0].message.tool_calls[0].function.parsed_arguments)

completion.choices[0].message.tool_calls[0].function.parsed_arguments is a dictionary that contains the parsed arguments returned by the model, which will match the structure defined by your Pydantic class. In this case, since we defined a Person class with name and age fields, the parsed arguments will also have those fields.
Here's an example of what the parsed arguments might look like:

{
    'name': 'John',
    'age': 30
}

Now, let's say you want to use these parsed arguments in your code instead of just printing them. Here's how you could create a new Person object using the parsed arguments:

# Create a new Person object using the parsed arguments
person = Person(**parsed_arguments)
print(person.name)  # Output: John
print(person.age)   # Output: 30

That's pretty neat!

What models can I use?

Structured ouputs are supported on all models that include gpt-4 and later, and response formats available on gpt-4o-mini and gpt-4o-2024–08–06.

Limitations and Restrictions

There are a few limitations to keep in mind when using Structured Outputs:

Structured Outputs only supports a subset of JSON Schema, as detailed in OpenAI's documentation.
The first API response with a new schema will incur additional latency due to the preprocessing of the schema. Subsequent responses will be faster with no latency penalty.
The model may refuse unsafe requests or stop generating before completing the schema if it reaches max_tokens or another stop condition.
Structured Outputs doesn't prevent all kinds of model mistakes within the values of the JSON object.
Structured Outputs is not compatible with parallel function calls.

Extracting Data from Tricky PDFs with Google Gemini in 10 lines of Python

Emmett McFarlane — Thu, 18 Jul 2024 14:20:52 +0000

In this guide, I'll show you how to extract structured data from PDFs using vision-language models (VLMs) like Gemini Flash or GPT-4o.

Gemini, Google's latest series of vision-language models, has shown state of the art performance in text and image understanding. This improved multimodal capability and long context window makes it particularly useful for processing visually complex PDF data that traditional extraction models struggle with, such as figures, charts, tables, and diagrams.

By doing so, you can easily build your own data extraction tool for visual file and web extraction. Here's how:

Gemini's long context window and multimodal capability makes it particularly useful for processing visually complex PDF data where traditional extraction models struggle.

Setting Up Your Environment

Before we dive into extraction, let's set up our development environment. This guide assumes you have Python installed on your system. If not, download and install it from https://www.python.org/downloads/

⚠️ Note that, if you don't want to use Python, you can use the cloud platform at thepi.pe to upload your files and download your result as a CSV without writing any code.

Install Required Libraries

Open your terminal or command prompt and run the following commands:



pip install git+https://github.com/emcf/thepipe
pip install pandas

For those new to Python, pip is the package installer for Python, and these commands will download and install the necessary libraries.

Set Up Your API Key

To use thepipe, you need an API key.

Disclaimer: While thepi.pe is a free an open source tool, the API has a cost, roughly $0.00002 per token. If you want to avoid such costs, check out the local setup instructions on GitHub. Note that you will still have to pay your LLM provider of choice.

Here's how to get and set it up:

Visit https://thepi.pe/platform/
Create an account or log in
Find your API key in the settings page

Now, you need to set this as an environment variable. The process varies depending on your operating system:

Copy the API key from the settings menu on thepi.pe Platform

For Windows:

Search for "Environment Variables" in the Start menu
Click "Edit the system environment variables"
Click the "Environment Variables" button
Under "User variables", click "New"
Set the variable name as THEPIPE_API_KEY and the value as your API key
Click "OK" to save

For macOS and Linux:
Open your terminal and add this line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):



export THEPIPE_API_KEY=your_api_key_here

Then, reload your configuration:



source ~/.bashrc # or ~/.zshrc

Defining Your Extraction Schema

The key to successful extraction is defining a clear schema for the data you want to pull out. Let's say we're extracting data from a Bill of Quantity document:

An example of a page from the Bill of Quantity document. The data on each page is independent of the other pages, so we do our extraction "per page". There are multiple pieces of data to extract per page, so we set multiple extractions to True

Looking at the column names, we might want to extract a schema like this:



schema = {
  "item": "string",
  "unit": "string",
  "quantity": "int",
}

You can modify the schema to your liking on thepi.pe Platform. Clicking "View Schema" will give you a schema you can copy and paste for use with the Python API

Extracting Data from PDFs

Now, let's use extract_from_file to pull data from a PDF:



from thepipe.extract import extract_from_file
results = extract_from_file(
  file_path = "bill_of_quantity.pdf",
  schema = schema,
  ai_model = "google/gemini-flash-1.5b",
  chunking_method = "chunk_by_page"
)

Here, we've chunking_method="chunk_by_page" because we want to send each page to the AI model individually (the PDF is too large to feed all at once). We also set multiple_extractions=True because the PDF pages each contain multiple rows of data. Here's what a page from the PDF looks like:

The results of the extraction for the Bill of Quantity PDF as viewed on thepi.pe Platform

Processing the Results

The extraction results are returned as a list of dictionaries. We can process these results to create a pandas DataFrame:



import pandas as pd
df = pd.DataFrame(results)
# Display the first few rows of the DataFrame
print(df.head())

This creates a DataFrame with all the extracted information, including textual content and descriptions of visual elements like figures and tables.

Exporting to Different Formats

Now that we have our data in a DataFrame, we can easily export it to various formats. Here are some options:

Exporting to Excel



df.to_excel("extracted_research_data.xlsx", index=False, sheet_name="Research Data")

This creates an Excel file named "extracted_research_data.xlsx" with a sheet named "Research Data". The index=False parameter prevents the DataFrame index from being included as a separate column.

Exporting to CSV

If you prefer a simpler format, you can export to CSV:



df.to_csv("extracted_research_data.csv", index=False)

This creates a CSV file that can be opened in Excel or any text editor.

Ending Notes

The key to successful extraction lies in defining a clear schema and utilizing the AI model's multimodal capabilities. As you become more comfortable with these techniques, you can explore more advanced features like custom chunking methods, custom extraction prompts, and integrating the extraction process into larger data pipelines.

Web Extraction with Vision-LLMs: SQL-Ready Data From Any URL with GPT-4o

Emmett McFarlane — Wed, 22 May 2024 22:18:40 +0000

Let's talk about GPT-4o

GPT-4o, OpenAI's latest vision-language model, excels in handling images compared to its predecessor language model GPT-4. This improved multimodal capability makes it particularly useful for processing visually complex web data that traditional scrapers often struggle with. Whether it's extracting information from blogs, live feeds, news articles, youtube videos, etc, using vision-language models provides a significant advantage over standard language models for unstructured extraction "in the wild".

In this guide, we'll explore how to scrape visual and text content from webpages and prepare the data for use with multimodal language models like GPT-4o. Then, we'll use the scraped visual and text prompt to extract specific structured data from the articles on the webpage. Lastly, we'll show how to validate the data and upsert it to a PostgreSQL database.

Set Up Your APIs

Ensure you have set the THEPIPE_API_KEY environment variable with your API key. If you don't have an API key, you can get one here or you can use The Pipe on your own server by following the local setup instructions in the documentation. Additionally, set your OPENAI_API_KEY for accessing GPT-4o. Don't have that either? Get it here.

For Windows users:

setx THEPIPE_API_KEY "your_api_key"
setx OPENAI_API_KEY "your_openai_api_key"

For Mac users:

export THEPIPE_API_KEY="your_api_key"
export OPENAI_API_KEY="your_openai_api_key"

Restart your terminal for the changes to take effect. You'll now need to install The Pipe API. Open your terminal and run the following command:

pip install thepipe_api

Extract Content from a Webpage

Use The Pipe API to extract text and images from a webpage. Here's an example of how to do this:

from thepipe_api import thepipe

# Extract multimodal content from a webpage
webpage_content = thepipe.extract("https://www.bbc.com/")

The Pipe API can handle dynamic content that adapts as you scroll (as many modern webpages contain) and automatic scrolling, ensuring that all relevant text and images are captured. This is particularly important for visual web data, which traditional scrapers often miss or misinterpret.

The result of the extraction will be a list of dictionaries, each containing the extracted content from the webpage as a hosted browser scrolls through the page. The content will include both text and images, making it suitable for use with multimodal language models like GPT-4o:

Prepare the Input for GPT-4o

Next, prepare the input prompt by combining the extracted content with a user query. This will help GPT-4o understand what you want to achieve with the extracted data. In our case, we will perform a structured data extraction task from the webpage, looking to grab the articles, their contents, and any images associated with them.

# Add a user query
query = [{
    "role": "system", 
    "content": [{
            "type": "text", 
            "text": """Please extract each article from the given webpage. Do this by returning a JSON object with the key, "articles", containing a list of all the articles in the given page. For each article, provide a JSON dictionary containing the following keys: 
            title (required string),
            extracted_plaintext (required string),
            topics (required list),
            sentiment (required string),
            language (required string),
            image_description (optional string)."""
    }]
}]

# Combine the content to create the input prompt for GPT-4o
messages = webpage_content + query

Send the Input to GPT-4o

With the input prepared, you can now send it to GPT-4o using the OpenAI API. Make sure you have your OPENAI_API_KEY set in your environment variables.

from openai import OpenAI
import json

# Initialize the OpenAI client
openai_client = OpenAI()

# Send the unstructured visuals and text to GPT-4o
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={"type": "json_object"},
    temperature=0,
)
# Extract the structured JSON
response = response.choices[0].message.content
response = json.loads(response)
# Print the result
print(response)

GPT-4o will process the input prompt and return a structured JSON object containing the extracted articles, their titles, extracted plaintext, topics, sentiment, language, and image descriptions (if available). This structured data can be used for further analysis or processing (see images below for visualizations using the Beta version of The Pipe API portal).

Putting the Data to Use

Now that you have the structured data, you can use it for various purposes, such as content analysis, summarization, sentiment analysis, or even generating new content based on the extracted information. The structured data can be easily pushed into a SQL table, a NoSQL database, or any other data storage system for further processing. For example, here, I am pushing the extracted data into a PostgreSQL database hosted on Supabase:

import supabase
sb_client = supabase.create_client(SUPABASE_URL, SUPABASE_KEY)
for section in response['sections']:
    try:
        entry = {
            'title': section['title'],
            'extracted_plaintext': section['extracted_plaintext'],
            'topics': section['topics'],
            'sentiment': section['sentiment'],
            'language': section['language'],
            'image_description': section.get('image_description', None)
        }
        sb_client.table('demo_db').insert(entry).execute()
    except:
        pass

Viola. We just went from this:

https://bbc.com

to this:

Without having to parse any HTML, use Document Layout Analysis models, deal with complex CSS selectors, or make custom scrapers for dynamically loaded visuals.

Handling Token Limits

When dealing with vision models, you might need to handle token limits sooner than text models. The Pipe API allows you to extract text-only content in extreme cases to avoid exceeding token limits. For more details, you can click here for a discussion on token limits for GPT-4-Vision.

# Extract text-only content from a webpage
webpage_content_text_only = thepipe.extract("https://example.com", text_only=True)
messages_text_only = webpage_content_text_only + query

Congratulations!

You've successfully scraped a webpage and extracted visually complicated unstructured data from it. This process can be extended to other types of content, such as PDFs, videos, and more, using The Pipe API. For more details on GPT-4o, check out the OpenAI announcement. If you're a developer, feel free to contribute to The Pipe on GitHub!

Happy coding! 🚀