<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tho</title>
    <description>The latest articles on Forem by Tho (@thoqbk).</description>
    <link>https://forem.com/thoqbk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F142914%2F381b70a3-f967-44ca-a771-345e442cb590.jpeg</url>
      <title>Forem: Tho</title>
      <link>https://forem.com/thoqbk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thoqbk"/>
    <language>en</language>
    <item>
      <title>PDF Query using OpenAI API</title>
      <dc:creator>Tho</dc:creator>
      <pubDate>Thu, 29 Jun 2023 04:03:44 +0000</pubDate>
      <link>https://forem.com/thoqbk/pdf-query-using-openai-api-3ma1</link>
      <guid>https://forem.com/thoqbk/pdf-query-using-openai-api-3ma1</guid>
      <description>&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;The OpenAI API introduces a new programmable approach for retrieving data from a raw text file using AI. We can easily create a prompt by combining raw text content and the data fields we wish to extract from the text. We then send this prompt to OpenAI in string format. By using a well-crafted prompt, we can also specify the desired response format, such as JSON or YAML, which greatly enhances the convenience of the extraction process. A great example exemplifying this is as follows:&lt;/p&gt;

&lt;p&gt;Prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Want to extract fields: "PO Number", "Total Amount" and "Delivery Address".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;%s&lt;/code&gt; will be replaced by the raw text content and here is a sample output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "PO Number": "PO-003847945",
  "Total Amount": "1,485.00",
  "Delivery Address": "Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;There is a limitation on the number of tokens that can be sent to the OpenAI API, and this applies not only to OpenAI but also to other LLM models. To put it simply, tokens can be thought of as words. For example, the token limit for &lt;code&gt;gpt-35-turbo&lt;/code&gt; is 4096 tokens. That means the above approach doesn't work for the large text file (e.g. a file with 100 pages or even less)&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Split the file into smaller chunks that are smaller than the token limitation.&lt;/li&gt;
&lt;li&gt;Utilize vector databases such as &lt;code&gt;FAISS&lt;/code&gt; or &lt;code&gt;Chroma&lt;/code&gt; to store these chunks.&lt;/li&gt;
&lt;li&gt;Use LLM to search for related chunks in the database for each data retrieval request and summarize the information to obtain the final result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Refer to &lt;a href="https://github.com/thoqbk/pdf-query"&gt;chat.py&lt;/a&gt; the details&lt;/p&gt;

&lt;h2&gt;
  
  
  Run sample code
&lt;/h2&gt;

&lt;p&gt;Prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.6+&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/library/venv.html"&gt;Virtualenv&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open terminal, move to the root directory and run the folllowing commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m venv .env
source .env/bin/activate
pip install -r /path/to/requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the first command is only needed to run once to create &lt;code&gt;.env&lt;/code&gt; folder in the root directory for the same code&lt;/li&gt;
&lt;li&gt;the second command is to activate the virtual env, need to run at the beginning of the test session&lt;/li&gt;
&lt;li&gt;the last command is to install dependencies, need to run once unless there're changes in dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;update &lt;code&gt;chat.py&lt;/code&gt; to add your open-ai API key
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;os.environ["OPENAI_API_KEY"] = "YOUR-OPEN-AI-API-KEY"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;run the code and ask questions
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python chat.py 
Enter a query: what is the name of the document?
The name of the document is the "Microsoft Partner Agreement".
Enter a query: what is the legal entity in this document?
The legal entity in this document is Microsoft.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Source code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/thoqbk/pdf-query"&gt;pdf-query&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>chatgpt</category>
      <category>pdf</category>
      <category>dataretrieval</category>
    </item>
    <item>
      <title>Writing an Interpreter: Implementation</title>
      <dc:creator>Tho</dc:creator>
      <pubDate>Sun, 04 Jun 2023 11:29:00 +0000</pubDate>
      <link>https://forem.com/thoqbk/writing-an-interpreter-implementation-3hf8</link>
      <guid>https://forem.com/thoqbk/writing-an-interpreter-implementation-3hf8</guid>
      <description>&lt;p&gt;Part 1 can be found &lt;a href="https://dev.to/thoqbk/writing-an-interpreter-high-level-overview-2eo"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lexer
&lt;/h2&gt;

&lt;p&gt;The Lexer serves as the most basic element. Its primary function involves iterating through the characters present in the source code. It may combine certain characters to create a single token, and subsequently generate a token object with its associated type. This object is then added to the resulting list.&lt;/p&gt;

&lt;p&gt;More in-depth information regarding the implementation can be found &lt;a href="https://github.com/thoqbk/tholangforfun/blob/master/src/main/java/io/thoqbk/tholangforfun/Lexer.java"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Parser
&lt;/h2&gt;

&lt;p&gt;The parser is the most complex component in an interpreter. Before we delve into it, let's understand the difference between an expression and a statement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expression: An expression is a combination of values, variables, operators, and function calls that come together to give a single value. It's like a calculation that produces a result.&lt;/li&gt;
&lt;li&gt;Statement: A statement is a complete section of code that carries out an action or a series of actions. It represents an instruction or a command that the program follows. Expressions can be part of a statement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When we parse an expression, we generate an Abstract Syntax Tree (AST). This is the trickiest part because we need to handle operator precedence (which operations happen first) and the types of operators (like unary or ternary). It becomes even more challenging when expressions involve variables and function calls with their own arguments, which can also be expressions.&lt;/p&gt;

&lt;p&gt;Some examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pratt parser
&lt;/h3&gt;

&lt;p&gt;The Pratt parser is a flexible and efficient method for parsing expressions, especially in languages with complex rules for operator precedence and associativity. It relies on Pratt parsing functions, which are specialized parsing functions associated with operators. Each operator in the language has its own parsing function that determines how expressions involving that operator should be parsed.&lt;/p&gt;

&lt;p&gt;Here are some advantages of the Pratt parser:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operator Precedence: Each operator is assigned a precedence level, ensuring that expressions are evaluated correctly according to the language's precedence rules.&lt;/li&gt;
&lt;li&gt;Left-Associativity and Right-Associativity: The parser can handle both types of associativity. For example, in the expression 3 ^ 2 ^ 3, it correctly interprets it as 3 ^ (2 ^ 3) instead of (3 ^ 2) ^ 3.
Extensibility: It is easy to add new operators by specifying their precedence and implementing their respective parsing function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additional resources for more information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tdop.github.io/"&gt;Top Down Operator Precedence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/"&gt;Pratt Parsers: Expression Parsing Made Easy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pratt parser: Implementation
&lt;/h3&gt;

&lt;p&gt;Pratt parser sounds complex but its implementation is quite simple. The full implementation can be found &lt;a href="https://github.com/thoqbk/tholangforfun/blob/master/src/main/java/io/thoqbk/tholangforfun/Parser.java"&gt;here&lt;/a&gt;. To summarize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First we need to configure parser functions for operators: prefixParsers and infixParsers&lt;/li&gt;
&lt;li&gt;Configure precedences for operators: &lt;code&gt;precedences&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Using the configurations above to implement &lt;code&gt;parseExpression&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Pratt parser may sound complicated, but its implementation is actually quite straightforward. You can find the complete implementation &lt;a href="https://github.com/thoqbk/tholangforfun/blob/master/src/main/java/io/thoqbk/tholangforfun/Parser.java"&gt;here&lt;/a&gt;. Here's a summary of how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, we need to set up parser functions for operators: prefixParsers for infix operators, and infixParsers for prefix ones. It's doable but the implementation doesn't support suffix for now&lt;/li&gt;
&lt;li&gt;We configure the precedence levels for operators using the precedences setup.&lt;/li&gt;
&lt;li&gt;With the above configurations in place, we can now implement the parseExpression function, which parses expressions based on the defined operator rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the basic idea behind the Pratt parser. You can check out the provided link for a more detailed implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parse if-else statement
&lt;/h3&gt;

&lt;p&gt;Now that we have implemented &lt;code&gt;parseExpression&lt;/code&gt;, parsing functions for statements are relatively straightforward, as long as they have clear syntax. Let's take a look at how to parse an &lt;code&gt;if-else&lt;/code&gt; statement. Assuming we have the following syntax in Backus-Naur Form (BNF):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;if-statement&amp;gt; ::= if ( &amp;lt;expression&amp;gt; ) { &amp;lt;statement&amp;gt; }
                | if ( &amp;lt;expression&amp;gt; ) { &amp;lt;statement&amp;gt; } else { &amp;lt;statement&amp;gt; }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The parse function for the if statement would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Statement&lt;/span&gt; &lt;span class="nf"&gt;parseIfStatement&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;retVal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;If&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lexer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentToken&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;assertPeekTokenThenNext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TokenType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LPAREN&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;lexer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;nextToken&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;retVal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCondition&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parseExpression&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;assertPeekTokenThenNext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TokenType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RPAREN&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;assertPeekTokenThenNext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TokenType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LBRACE&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;retVal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setIfBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parseBlockStatement&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;peekTokenIs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TokenType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ELSE&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;lexer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;nextToken&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;lexer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;nextToken&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;retVal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setElseBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parseBlockStatement&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;retVal&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation closely reflects the defined syntax in BNF, making it easy to understand and follow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluator
&lt;/h2&gt;

&lt;p&gt;Details can be found &lt;a href="https://github.com/thoqbk/tholangforfun/blob/master/src/main/java/io/thoqbk/tholangforfun/Evaluator.java"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tdop.github.io/"&gt;Top Down Operator Precedence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/"&gt;Pratt Parsers: Expression Parsing Made Easy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Book &lt;a href="https://interpreterbook.com/"&gt;Write a intepreter in Go&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>interpreter</category>
      <category>compiler</category>
    </item>
    <item>
      <title>Writing an Interpreter: High-Level Overview</title>
      <dc:creator>Tho</dc:creator>
      <pubDate>Sun, 04 Jun 2023 11:28:49 +0000</pubDate>
      <link>https://forem.com/thoqbk/writing-an-interpreter-high-level-overview-2eo</link>
      <guid>https://forem.com/thoqbk/writing-an-interpreter-high-level-overview-2eo</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Creating a new language for specific logics is sometimes unavoidable. Some use cases can be listed out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We want to develop a highly dynamic application that enables non-technical business users and system admins to customize the system and its workflow by writing code. I aim to make the application highly configurable without the need for a complex user interface. This approach is commonly referred to as a domain-specific language (DSL).&lt;/li&gt;
&lt;li&gt;Code Generation and Automation: Building a custom language can facilitate code generation and automation for repetitive or complex tasks. By defining a language that closely matches the problem domain, developers can write code at a higher level of abstraction and generate boilerplate or repetitive code automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article aims to offer simple instructions on constructing an interpreter for a custom programming language.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreter and compiler
&lt;/h2&gt;

&lt;p&gt;An interpreter is a software component or program that directly reads and executes code without prior compilation. It is often used with scripting languages and high-level programming languages. Notable interpreted languages include Python, JavaScript, and PHP.&lt;/p&gt;

&lt;p&gt;In contrast, compilers take code as input and generate assembly or bytecode that requires a virtual machine for execution. Examples of compiled languages are C/C++, Java, C#, Go, and Rust.&lt;/p&gt;

&lt;p&gt;Compiled languages tend to offer higher performance compared to interpreted ones because compilers perform extensive optimization during the compilation process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Main components
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lexer, parser and evaluator
&lt;/h3&gt;

&lt;p&gt;There are 3 main components in an interpreter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lexer (or Tokenizer): responsible for breaking down the source code into individual tokens. It scans the code character by character and groups them into meaningful units such as keywords, identifiers, operators, literals, and symbols.&lt;/li&gt;
&lt;li&gt;Parser: takes the tokens produced by the lexer and analyzes their structure based on a defined grammar or syntax rules. It ensures that the code follows the correct syntax and creates a data structure called an Abstract Syntax Tree (AST) or parse tree, which represents the hierarchical structure of the code.&lt;/li&gt;
&lt;li&gt;Evaluator:  is responsible for executing the code and producing the desired results. It takes the parsed and analyzed code, such as an Abstract Syntax Tree (AST) or an intermediate representation, and performs the necessary computations and operations specified by the code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While advanced interpreters may include additional components like the Semantic Analyzer and IR Generator, for the purpose of understanding how to write an interpreter, we won't discuss those specifics here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Given the code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print 10 + 2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following logics will happens in components:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HBE0Gjcj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p6m2foj8e23y3yahk3un.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HBE0Gjcj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p6m2foj8e23y3yahk3un.png" alt="Main components" width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lexer breaks down the code and represent it as a list of token objects with type. In this example, there are 4 types: print, int number (10 and 2), plus and semicolon.&lt;/li&gt;
&lt;li&gt;Parser iterates through the list of tokens to form statement object. In this case, it's a PrintStatement with expression is a infix AST.&lt;/li&gt;
&lt;li&gt;Evaluator iterates through the list of statments getting from Parser, evalute them and give the final result. In this case, it's printing 12 to the console&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part 2: &lt;a href="https://dev.to/thoqbk/writing-an-interpreter-implementation-3hf8"&gt;Implementation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>interpreter</category>
      <category>compiler</category>
    </item>
    <item>
      <title>Utilize OpenAI API to extract information from PDF files</title>
      <dc:creator>Tho</dc:creator>
      <pubDate>Sun, 29 Jan 2023 03:10:01 +0000</pubDate>
      <link>https://forem.com/thoqbk/utilize-openai-api-to-extract-information-from-pdf-files-4ml1</link>
      <guid>https://forem.com/thoqbk/utilize-openai-api-to-extract-information-from-pdf-files-4ml1</guid>
      <description>&lt;h2&gt;
  
  
  Why it's hard to extract information from PDF files?
&lt;/h2&gt;

&lt;p&gt;PDF, or Portable Document Format, is a popular file format that is widely used for documents such as invoices, purchase orders, and other business documents. However, extracting information from PDFs can be a challenging task for developers.&lt;/p&gt;

&lt;p&gt;One reason why it is difficult to extract information from PDFs is that the format is not structured. Unlike HTML, which has a specific format for tables and headers that developers can easily identify, PDFs do not have a consistent layout for information. This makes it harder for developers to know where to find the specific information they need.&lt;/p&gt;

&lt;p&gt;Another reason why it is difficult to extract information from PDFs is that there is no standard layout for information. Each system generates invoices and purchase orders differently, so developers must often write custom code to extract information from each individual document. This can be a time-consuming and error-prone process.&lt;/p&gt;

&lt;p&gt;Additionally, PDFs can contain both text and images, making it difficult for developers to programmatically extract information from the document. OCR (optical character recognition) can be used to extract text from images, but this adds complexity to the process and may result in errors if the OCR software is not accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Existing solutions
&lt;/h2&gt;

&lt;p&gt;Existing solutions for extracting information from PDFs include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Using regex: to match patterns in text after converting the PDF to plain text. Examples include &lt;a href="https://github.com/invoice-x/invoice2data"&gt;invoice2data&lt;/a&gt; and &lt;a href="https://github.com/thoqbk/traprange/blob/master/_Docs/invoice/README.md"&gt;traprange-invoice&lt;/a&gt;. However, this method requires knowledge of the format of the data fields.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AI-based cloud services: utilize machine learning to extract structured data from PDFs. Examples include &lt;a href="https://pdftables.com/"&gt;pdftables&lt;/a&gt; and &lt;a href="https://docparser.com/"&gt;docparser&lt;/a&gt;, but these are not open-source friendly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Yet, another solution for PDF data extraction: using OpenAI
&lt;/h2&gt;

&lt;p&gt;One solution to extract information from PDF files is to use OpenAI's natural language processing capabilities to understand the content of the document. However, OpenAI is not able to work with PDF or image formats directly, so the first step is to convert the PDF to text while retaining the relative positions of the text items.&lt;/p&gt;

&lt;p&gt;One way to achieve this is to use the PDFLayoutTextStripper library, which uses PDFBox to read through all text items in the PDF file and organize them in lines, keeping the relative positions the same as in the original PDF file. This is important because, for example, in an invoice's items table, if the amount is in the same column as the quantity, it will result in incorrect values when querying for the total amount and total quantity. Here is an example of the output from the stripper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
                                                                                                *PO-003847945*                                           

                                                                                      Page.........................: 1    of    1                        





                Address...........:     Aeeee  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
                                        P.O.Box 1234                                                                                                     
                                        Dooo,                                      PO-003847945                                                          
                                        ABC                                       TL-00074                                   

                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
                Fax...................:                                                                                                                  


               100225                Aaaaaa  Eeeeee                                 Date...................................: 5/10/2020                   
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
               Phone........:                                                       Attention Information                                                
               Fax.............:                                                                                                                         
               Vendor :    TL-00074                                                                                                                      
               AAAA BBBB CCCCCAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     


                                                                                                                         Discount                        
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
                                                     350                                                                                                 
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          

                                                1.25                                                                                                                        
(truncated...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the PDF has been converted to text, the next step is to call the OpenAI API and pass the text along with queries such as "Extract fields: 'PO Number', 'Total Amount'". The response will be in JSON format, and GSON can be used to parse it and extract the final results. This two-step process of converting the PDF to text and then using OpenAI's natural language processing capabilities can be an effective solution for extracting information from PDF files.&lt;/p&gt;

&lt;p&gt;The query is as simple as follows with %s replaced by PO text content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;QUERY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
    Want to extract fields: "&lt;/span&gt;&lt;span class="no"&gt;PO&lt;/span&gt; &lt;span class="nc"&gt;Number&lt;/span&gt;&lt;span class="s"&gt;", "&lt;/span&gt;&lt;span class="nc"&gt;Total&lt;/span&gt; &lt;span class="nc"&gt;Amount&lt;/span&gt;&lt;span class="s"&gt;" and "&lt;/span&gt;&lt;span class="nc"&gt;Delivery&lt;/span&gt; &lt;span class="nc"&gt;Address&lt;/span&gt;&lt;span class="s"&gt;".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s
    """&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query consists of two components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;specifying the desired fields&lt;/li&gt;
&lt;li&gt;formatting the field values as JSON data for easy retrieval from API response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And here is the example response from OpenAI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text_completion&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text-davinci-003&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;choices&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n{&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n  &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="nx"&gt;PO&lt;/span&gt; &lt;span class="nb"&gt;Number&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="nx"&gt;PO&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;003847945&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n  &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="nx"&gt;Total&lt;/span&gt; &lt;span class="nx"&gt;Amount&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;485.00&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n  &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="nx"&gt;Delivery&lt;/span&gt; &lt;span class="nx"&gt;Address&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="nx"&gt;Peera&lt;/span&gt; &lt;span class="nx"&gt;Consumer&lt;/span&gt; &lt;span class="nx"&gt;Good&lt;/span&gt; &lt;span class="nx"&gt;Co&lt;/span&gt;&lt;span class="p"&gt;.(&lt;/span&gt;&lt;span class="nx"&gt;QSC&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;P&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;O&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Box&lt;/span&gt; &lt;span class="mi"&gt;3371&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Dohe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;QAT&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;index&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;logprobs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;finish_reason&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;stop&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="c1"&gt;// ... some more fields&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Decoding the &lt;code&gt;text&lt;/code&gt; field's JSON string yields the following desired fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PO Number&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PO-003847945&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Total Amount&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1,485.00&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Delivery Address&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Run sample code
&lt;/h2&gt;

&lt;p&gt;Prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Java 16+&lt;/li&gt;
&lt;li&gt;Maven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create an OpenAI account&lt;/li&gt;
&lt;li&gt;Log in and generate an API key&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in Main.java with your key&lt;/li&gt;
&lt;li&gt;Update &lt;code&gt;SAMPLE_PDF_FILE&lt;/code&gt; if needed&lt;/li&gt;
&lt;li&gt;Execute the code and view the results from the output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Checkout the code here &lt;a href="https://github.com/thoqbk/openai-pdf"&gt;https://github.com/thoqbk/openai-pdf&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>chatgpt</category>
      <category>java</category>
      <category>pdf</category>
    </item>
  </channel>
</rss>
