Forem: Tho

PDF Query using OpenAI API

Tho — Thu, 29 Jun 2023 04:03:44 +0000

Context

The OpenAI API introduces a new programmable approach for retrieving data from a raw text file using AI. We can easily create a prompt by combining raw text content and the data fields we wish to extract from the text. We then send this prompt to OpenAI in string format. By using a well-crafted prompt, we can also specify the desired response format, such as JSON or YAML, which greatly enhances the convenience of the extraction process. A great example exemplifying this is as follows:

Prompt:

Want to extract fields: "PO Number", "Total Amount" and "Delivery Address".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s

Note that the %s will be replaced by the raw text content and here is a sample output:

{
  "PO Number": "PO-003847945",
  "Total Amount": "1,485.00",
  "Delivery Address": "Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT"
}

Problem

There is a limitation on the number of tokens that can be sent to the OpenAI API, and this applies not only to OpenAI but also to other LLM models. To put it simply, tokens can be thought of as words. For example, the token limit for gpt-35-turbo is 4096 tokens. That means the above approach doesn't work for the large text file (e.g. a file with 100 pages or even less)

Solution

Split the file into smaller chunks that are smaller than the token limitation.
Utilize vector databases such as FAISS or Chroma to store these chunks.
Use LLM to search for related chunks in the database for each data retrieval request and summarize the information to obtain the final result.

Refer to chat.py the details

Run sample code

Prerequisites:

Python 3.6+
Virtualenv

Open terminal, move to the root directory and run the folllowing commands:

python -m venv .env
source .env/bin/activate
pip install -r /path/to/requirements.txt

Note that,

the first command is only needed to run once to create .env folder in the root directory for the same code
the second command is to activate the virtual env, need to run at the beginning of the test session
the last command is to install dependencies, need to run once unless there're changes in dependencies

Next steps:

update chat.py to add your open-ai API key

os.environ["OPENAI_API_KEY"] = "YOUR-OPEN-AI-API-KEY"

run the code and ask questions

python chat.py 
Enter a query: what is the name of the document?
The name of the document is the "Microsoft Partner Agreement".
Enter a query: what is the legal entity in this document?
The legal entity in this document is Microsoft.

Source code

pdf-query

Writing an Interpreter: Implementation

Tho — Sun, 04 Jun 2023 11:29:00 +0000

Part 1 can be found here

Lexer

The Lexer serves as the most basic element. Its primary function involves iterating through the characters present in the source code. It may combine certain characters to create a single token, and subsequently generate a token object with its associated type. This object is then added to the resulting list.

More in-depth information regarding the implementation can be found here

Parser

The parser is the most complex component in an interpreter. Before we delve into it, let's understand the difference between an expression and a statement:

Expression: An expression is a combination of values, variables, operators, and function calls that come together to give a single value. It's like a calculation that produces a result.
Statement: A statement is a complete section of code that carries out an action or a series of actions. It represents an instruction or a command that the program follows. Expressions can be part of a statement.

When we parse an expression, we generate an Abstract Syntax Tree (AST). This is the trickiest part because we need to handle operator precedence (which operations happen first) and the types of operators (like unary or ternary). It becomes even more challenging when expressions involve variables and function calls with their own arguments, which can also be expressions.

Some examples:

1 + 2;
1 + (2 - 3) * 3;
1 + (2 - a) * 3 + sum(2, 3 + 2 * b);

Pratt parser

The Pratt parser is a flexible and efficient method for parsing expressions, especially in languages with complex rules for operator precedence and associativity. It relies on Pratt parsing functions, which are specialized parsing functions associated with operators. Each operator in the language has its own parsing function that determines how expressions involving that operator should be parsed.

Here are some advantages of the Pratt parser:

Operator Precedence: Each operator is assigned a precedence level, ensuring that expressions are evaluated correctly according to the language's precedence rules.
Left-Associativity and Right-Associativity: The parser can handle both types of associativity. For example, in the expression 3 ^ 2 ^ 3, it correctly interprets it as 3 ^ (2 ^ 3) instead of (3 ^ 2) ^ 3. Extensibility: It is easy to add new operators by specifying their precedence and implementing their respective parsing function.

Additional resources for more information:

Pratt parser: Implementation

Pratt parser sounds complex but its implementation is quite simple. The full implementation can be found here. To summarize:

First we need to configure parser functions for operators: prefixParsers and infixParsers
Configure precedences for operators: precedences
Using the configurations above to implement parseExpression

The Pratt parser may sound complicated, but its implementation is actually quite straightforward. You can find the complete implementation here. Here's a summary of how it works:

First, we need to set up parser functions for operators: prefixParsers for infix operators, and infixParsers for prefix ones. It's doable but the implementation doesn't support suffix for now
We configure the precedence levels for operators using the precedences setup.
With the above configurations in place, we can now implement the parseExpression function, which parses expressions based on the defined operator rules.

That's the basic idea behind the Pratt parser. You can check out the provided link for a more detailed implementation.

Parse if-else statement

Now that we have implemented parseExpression, parsing functions for statements are relatively straightforward, as long as they have clear syntax. Let's take a look at how to parse an if-else statement. Assuming we have the following syntax in Backus-Naur Form (BNF):

<if-statement> ::= if ( <expression> ) { <statement> }
                | if ( <expression> ) { <statement> } else { <statement> }

The parse function for the if statement would look like this:

private Statement parseIfStatement() {
    var retVal = new If(lexer.currentToken());
    assertPeekTokenThenNext(TokenType.LPAREN);
    lexer.nextToken();
    retVal.setCondition(parseExpression());
    assertPeekTokenThenNext(TokenType.RPAREN);
    assertPeekTokenThenNext(TokenType.LBRACE);
    retVal.setIfBody(parseBlockStatement());
    if (peekTokenIs(TokenType.ELSE)) {
        lexer.nextToken();
        lexer.nextToken();
        retVal.setElseBody(parseBlockStatement());
    }
    return retVal;
}

The implementation closely reflects the defined syntax in BNF, making it easy to understand and follow.

Evaluator

Details can be found here

References:

Writing an Interpreter: High-Level Overview

Tho — Sun, 04 Jun 2023 11:28:49 +0000

Introduction

Creating a new language for specific logics is sometimes unavoidable. Some use cases can be listed out:

We want to develop a highly dynamic application that enables non-technical business users and system admins to customize the system and its workflow by writing code. I aim to make the application highly configurable without the need for a complex user interface. This approach is commonly referred to as a domain-specific language (DSL).
Code Generation and Automation: Building a custom language can facilitate code generation and automation for repetitive or complex tasks. By defining a language that closely matches the problem domain, developers can write code at a higher level of abstraction and generate boilerplate or repetitive code automatically.

This article aims to offer simple instructions on constructing an interpreter for a custom programming language.

Interpreter and compiler

An interpreter is a software component or program that directly reads and executes code without prior compilation. It is often used with scripting languages and high-level programming languages. Notable interpreted languages include Python, JavaScript, and PHP.

In contrast, compilers take code as input and generate assembly or bytecode that requires a virtual machine for execution. Examples of compiled languages are C/C++, Java, C#, Go, and Rust.

Compiled languages tend to offer higher performance compared to interpreted ones because compilers perform extensive optimization during the compilation process.

Main components

Lexer, parser and evaluator

There are 3 main components in an interpreter:

Lexer (or Tokenizer): responsible for breaking down the source code into individual tokens. It scans the code character by character and groups them into meaningful units such as keywords, identifiers, operators, literals, and symbols.
Parser: takes the tokens produced by the lexer and analyzes their structure based on a defined grammar or syntax rules. It ensures that the code follows the correct syntax and creates a data structure called an Abstract Syntax Tree (AST) or parse tree, which represents the hierarchical structure of the code.
Evaluator: is responsible for executing the code and producing the desired results. It takes the parsed and analyzed code, such as an Abstract Syntax Tree (AST) or an intermediate representation, and performs the necessary computations and operations specified by the code

While advanced interpreters may include additional components like the Semantic Analyzer and IR Generator, for the purpose of understanding how to write an interpreter, we won't discuss those specifics here.

Example

Given the code

print 10 + 2;

The following logics will happens in components:

Lexer breaks down the code and represent it as a list of token objects with type. In this example, there are 4 types: print, int number (10 and 2), plus and semicolon.
Parser iterates through the list of tokens to form statement object. In this case, it's a PrintStatement with expression is a infix AST.
Evaluator iterates through the list of statments getting from Parser, evalute them and give the final result. In this case, it's printing 12 to the console

Part 2: Implementation

Utilize OpenAI API to extract information from PDF files

Tho — Sun, 29 Jan 2023 03:10:01 +0000

Why it's hard to extract information from PDF files?

PDF, or Portable Document Format, is a popular file format that is widely used for documents such as invoices, purchase orders, and other business documents. However, extracting information from PDFs can be a challenging task for developers.

One reason why it is difficult to extract information from PDFs is that the format is not structured. Unlike HTML, which has a specific format for tables and headers that developers can easily identify, PDFs do not have a consistent layout for information. This makes it harder for developers to know where to find the specific information they need.

Another reason why it is difficult to extract information from PDFs is that there is no standard layout for information. Each system generates invoices and purchase orders differently, so developers must often write custom code to extract information from each individual document. This can be a time-consuming and error-prone process.

Additionally, PDFs can contain both text and images, making it difficult for developers to programmatically extract information from the document. OCR (optical character recognition) can be used to extract text from images, but this adds complexity to the process and may result in errors if the OCR software is not accurate.

Existing solutions

Existing solutions for extracting information from PDFs include:

Using regex: to match patterns in text after converting the PDF to plain text. Examples include invoice2data and traprange-invoice. However, this method requires knowledge of the format of the data fields.
AI-based cloud services: utilize machine learning to extract structured data from PDFs. Examples include pdftables and docparser, but these are not open-source friendly.

Yet, another solution for PDF data extraction: using OpenAI

One solution to extract information from PDF files is to use OpenAI's natural language processing capabilities to understand the content of the document. However, OpenAI is not able to work with PDF or image formats directly, so the first step is to convert the PDF to text while retaining the relative positions of the text items.

One way to achieve this is to use the PDFLayoutTextStripper library, which uses PDFBox to read through all text items in the PDF file and organize them in lines, keeping the relative positions the same as in the original PDF file. This is important because, for example, in an invoice's items table, if the amount is in the same column as the quantity, it will result in incorrect values when querying for the total amount and total quantity. Here is an example of the output from the stripper:


                                                                                                *PO-003847945*                                           

                                                                                      Page.........................: 1    of    1                        





                Address...........:     Aeeee  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
                                        P.O.Box 1234                                                                                                     
                                        Dooo,                                      PO-003847945                                                          
                                        ABC                                       TL-00074                                   

                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
                Fax...................:                                                                                                                  


               100225                Aaaaaa  Eeeeee                                 Date...................................: 5/10/2020                   
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
               Phone........:                                                       Attention Information                                                
               Fax.............:                                                                                                                         
               Vendor :    TL-00074                                                                                                                      
               AAAA BBBB CCCCCAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     


                                                                                                                         Discount                        
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
                                                     350                                                                                                 
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          

                                                1.25                                                                                                                        
(truncated...)

Once the PDF has been converted to text, the next step is to call the OpenAI API and pass the text along with queries such as "Extract fields: 'PO Number', 'Total Amount'". The response will be in JSON format, and GSON can be used to parse it and extract the final results. This two-step process of converting the PDF to text and then using OpenAI's natural language processing capabilities can be an effective solution for extracting information from PDF files.

The query is as simple as follows with %s replaced by PO text content:

private static final String QUERY = """
    Want to extract fields: "PO Number", "Total Amount" and "Delivery Address".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s
    """;

The query consists of two components:

specifying the desired fields
formatting the field values as JSON data for easy retrieval from API response.

And here is the example response from OpenAI:

{
  "object": "text_completion",
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "\\n{\\n  \\"PO Number\\": \\"PO-003847945\\",\\n  \\"Total Amount\\": \\"1,485.00\\",\\n  \\"Delivery Address\\": \\"Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT\\"\\n}",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  // ... some more fields
}

Decoding the text field's JSON string yields the following desired fields:

{
  "PO Number": "PO-003847945",
  "Total Amount": "1,485.00",
  "Delivery Address": "Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT"
}

Run sample code

Prerequisites:

Java 16+
Maven

Steps:

Create an OpenAI account
Log in and generate an API key
Replace OPENAI_API_KEY in Main.java with your key
Update SAMPLE_PDF_FILE if needed
Execute the code and view the results from the output

Checkout the code here https://github.com/thoqbk/openai-pdf