Forem: praveenr

UV

praveenr — Fri, 27 Feb 2026 08:54:49 +0000

curl -LsSf https://astral.sh/uv/install.sh | sh

uv python install 3.12

uv venv --python 3.12 --prompt my-project

source .venv/bin/activate

To create pyproject.toml

uv init

To create lock file

uv lock

To install library

uv pip install package_name

To add library

uv add package_name

To sync environments

uv sync

Uninstall package

uv pip uninstall package_name

Remove package from toml and lock files

uv remove package_name

Expose API Using FastAPI

praveenr — Wed, 21 Aug 2024 14:21:44 +0000

In previous blogs we have seen how to install neo4j, load data into it and query it using natural language. This will be the final blog in this series, we are going to create a simple fastAPI app to expose the setup as an API.

You can find the code here - https://github.com/praveenr2998/Creating-Lightweight-RAG-Systems-With-Graphs/blob/main/fastapi_app/app.py

from fastapi import FastAPI
from pydantic import BaseModel
from query_engine import GraphQueryEngine


# Pydantic model
class QueryRequest(BaseModel):
    query: str


app = FastAPI()


@app.post("/process-query/")
async def process_query(request: QueryRequest):
    query_engine = GraphQueryEngine()
    cypher_queries = query_engine.get_response(request.query)
    cypher_queries = query_engine.populate_embedding_in_query(request.query, cypher_queries)
    fetched_data = query_engine.fetch_data(cypher_queries)
    response = query_engine.get_final_response(request.query, fetched_data)
    return {"response": response}

To run this file use the command

uvicorn app:app --reload

In the terminal you'll be able to see the endpoint in my case it is http://127.0.0.1:8000/ add docs to it to open swagger - http://127.0.0.1:8000/docs
click on the Try it out option to get response for your question, enter you question key of your input json

CURL COMMAND

curl -X 'POST' \
  'http://127.0.0.1:8000/process-query/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "do you have headphones within the price of 25000"
}'

Hope this helps... !!!

LinkedIn - https://www.linkedin.com/in/praveenr2998/

Building A Simple Graph Query Engine

praveenr — Wed, 21 Aug 2024 13:53:47 +0000

In the last 2 blogs we saw how to install neo4j and load data into it. In this blog we are going to see how to build a simple graph query engine which answers our question by retrieving data from neo4j.

Step 1 : BUILD CYPHER QUERY

To build a cypher query we need to give schema information, property information to GPT along with our question. Using this metadata GPT will give us query.
I have structured the prompt to return 3 queries for every user input

Regular expressions - This query will have regex patterns to match data in graphDB
Levenshtein Similarity - This query will use levenshtein similarity with a threshold score of greater than 0.5 to match and fetch data from graph DB.
Embedding based match - We have pushed embeddings into our database already, so this query will use embedding of user query to reorder the complete list using score from cosine similarity. Maybe this could be improved to return top 5 too.

class GraphQueryEngine:
    def __init__(self):
        self.client = OpenAI(api_key="")
        self.url = "bolt://localhost:7687"
        self.auth = ("neo4j", "neo4j@123")

    def get_response(self, user_input):
        """Used to get cypher queries from user input"""
        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are an expert in generating Cypher queries for a Neo4j database. Your task is to understand the input and generate only Cypher read queries. Do not return anything other than the Cypher queries, as the returned result will be executed directly in the database."},
                {"role": "user",
                 "content": f"""
                 Schema Information:
                 NODES: Product_type - Contains the distinct types of products such as headphones/mobiles/laptops/washing machines, Product_details - Contains products within a product_type for example apple, samsung within mobiles, DELL within laptops 
                 NODE PROPERTIES: In node Product_type there are name(name of the product type - String), embedding(embedding of the name), and in node Product_details there are name(name of the product - string), price(price of the product - integer), description(description of the product), product_release_date(when product was release on - date), available_stock(stock left - integer), review_rating(product review - float) 
                 DIRECTION OF RELATIONSHIPS: Node Product_type is connected to node Product_details using relationship CONTAINS

                 Based on the schema, generate three read-only Cypher queries related to Product_type (e.g., chairs, headphones, fridge) or Product_details (e.g., name, description) or combination of both. Ensure that product category uses Product_type and product name/ price 

                 Query 1: Use regular expressions (avoid 'contains') - Exclude the 'embedding' property from the result.
                 Query 2: Use `apoc.text.levenshteinSimilarity > 0.5` - Exclude the 'embedding' property from the result.
                 Query 3: Use `gds.similarity.cosine()` to reorder nodes based on similarity scores. The query must include a `%s` placeholder for embedding input but exclude the 'embedding' property in the result.

                 Generate targeted queries using relationships only when necessary. The embedding property should only be used in the logic and must not appear in the query results.

                 Strictly return only the Cypher queries with no embeddings. The returned result will be executed directly in the database.

                 {user_input}
                 """},
            ],
        )

        response = completion.choices[0].message.content

        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are an expert in parsing generating Cypher queries."},
                {"role": "user",
                 "content": f"""Use this input - {response} and parse and return only the cypher queries from the input, ensure that in the cypher query if it returns embeddings then remove the embeddings alone from the query"""},
            ],
            response_format=CypherQuery,
        )
        event = completion.choices[0].message.parsed
        cypher_queries = event.cypher_queries
        print("################################## CYPHER QUERIES ######################################")
        for query in cypher_queries:
            print(query)
        return cypher_queries

STEP 2 - POPULATE EMBEDDINGS IN THE THIRD QUERY

The 3rd query uses gds.similarity.cosine() so we are converting user query to embeddings and populating it in the 3rd query

    def populate_embedding_in_query(self, user_input, cypher_queries):
        """Used to add embeddings of the user input in the 3rd query"""
        model = "text-embedding-3-small"
        user_input = user_input.replace("\n", " ")
        embeddings = self.client.embeddings.create(input=[user_input], model=model).data[0].embedding
        cypher_queries[2] = cypher_queries[2] % embeddings
        return cypher_queries

STEP 3 - QUERY THE DB

Query the DB using the prepared cypher queries

    def execute_read_query(self, query):
        """Execute the cypher query"""
        results = []

        with GraphDatabase.driver(self.url, auth=self.auth) as driver:
            with driver.session() as session:
                try:
                    result = session.run(query)
                    # Collect the result from the read query
                    records = [record.data() for record in result]
                    if records:
                        results.append(records)
                except Exception as error:
                    print(f"Error in executing query")

        return results

    def fetch_data(self, cypher_queries):
        """Return the fetched data from DB post formatting"""
        results = None
        for idx in range(len(cypher_queries)):
            try:
                results = self.execute_read_query(cypher_queries[idx])
                if results:
                    if idx == len(cypher_queries) - 1:
                        results = results[0][:10]
                    break
            except Exception:
                pass
        return results

STEP 4 - AUGMENTED GENERATION

Using the fetched data hit GPT using augmented generation technique to generate response for user query with the help of augumented information

    def get_final_response(self, user_input, fetched_data):
        """Augumented generation using data fetched from DB"""
        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are a chatbot for an ecommerce website, you help users to identify their desired products"},
                {"role": "user", "content": f"""User query - {user_input}
                Use the below metadata to answer my query
                {fetched_data}     
            """},
            ],
        )

        response = completion.choices[0].message.content
        return response

COMPLETE CODE

from openai import OpenAI
from pydantic import BaseModel
from typing import List
from neo4j import GraphDatabase


class CypherQuery(BaseModel):
    cypher_queries: List[str]


class GraphQueryEngine:
    def __init__(self):
        self.client = OpenAI(api_key="")
        self.url = "bolt://localhost:7687"
        self.auth = ("neo4j", "neo4j@123")

    def populate_embedding_in_query(self, user_input, cypher_queries):
        """Used to add embeddings of the user input in the 3rd query"""
        model = "text-embedding-3-small"
        user_input = user_input.replace("\n", " ")
        embeddings = self.client.embeddings.create(input=[user_input], model=model).data[0].embedding
        cypher_queries[2] = cypher_queries[2] % embeddings
        return cypher_queries

    def execute_read_query(self, query):
        """Execute the cypher query"""
        results = []

        with GraphDatabase.driver(self.url, auth=self.auth) as driver:
            with driver.session() as session:
                try:
                    result = session.run(query)
                    # Collect the result from the read query
                    records = [record.data() for record in result]
                    if records:
                        results.append(records)
                except Exception as error:
                    print(f"Error in executing query")

        return results

    def get_response(self, user_input):
        """Used to get cypher queries from user input"""
        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are an expert in generating Cypher queries for a Neo4j database. Your task is to understand the input and generate only Cypher read queries. Do not return anything other than the Cypher queries, as the returned result will be executed directly in the database."},
                {"role": "user",
                 "content": f"""
                 Schema Information:
                 NODES: Product_type - Contains the distinct types of products such as headphones/mobiles/laptops/washing machines, Product_details - Contains products within a product_type for example apple, samsung within mobiles, DELL within laptops 
                 NODE PROPERTIES: In node Product_type there are name(name of the product type - String), embedding(embedding of the name), and in node Product_details there are name(name of the product - string), price(price of the product - integer), description(description of the product), product_release_date(when product was release on - date), available_stock(stock left - integer), review_rating(product review - float) 
                 DIRECTION OF RELATIONSHIPS: Node Product_type is connected to node Product_details using relationship CONTAINS

                 Based on the schema, generate three read-only Cypher queries related to Product_type (e.g., chairs, headphones, fridge) or Product_details (e.g., name, description) or combination of both. Ensure that product category uses Product_type and product name/ price 

                 Query 1: Use regular expressions (avoid 'contains') - Exclude the 'embedding' property from the result.
                 Query 2: Use `apoc.text.levenshteinSimilarity > 0.5` - Exclude the 'embedding' property from the result.
                 Query 3: Use `gds.similarity.cosine()` to reorder nodes based on similarity scores. The query must include a `%s` placeholder for embedding input but exclude the 'embedding' property in the result.

                 Generate targeted queries using relationships only when necessary. The embedding property should only be used in the logic and must not appear in the query results.

                 Strictly return only the Cypher queries with no embeddings. The returned result will be executed directly in the database.

                 {user_input}
                 """},
            ],
        )

        response = completion.choices[0].message.content

        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are an expert in parsing generating Cypher queries."},
                {"role": "user",
                 "content": f"""Use this input - {response} and parse and return only the cypher queries from the input, ensure that in the cypher query if it returns embeddings then remove the embeddings alone from the query"""},
            ],
            response_format=CypherQuery,
        )
        event = completion.choices[0].message.parsed
        cypher_queries = event.cypher_queries
        print("################################## CYPHER QUERIES ######################################")
        for query in cypher_queries:
            print(query)
        return cypher_queries

    def get_final_response(self, user_input, fetched_data):
        """Augumented generation using data fetched from DB"""
        completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system",
                 "content": "You are a chatbot for an ecommerce website, you help users to identify their desired products"},
                {"role": "user", "content": f"""User query - {user_input}
                Use the below metadata to answer my query
                {fetched_data}     
            """},
            ],
        )

        response = completion.choices[0].message.content
        return response

    def fetch_data(self, cypher_queries):
        """Return the fetched data from DB post formatting"""
        results = None
        for idx in range(len(cypher_queries)):
            try:
                results = self.execute_read_query(cypher_queries[idx])
                if results:
                    if idx == len(cypher_queries) - 1:
                        results = results[0][:10]
                    break
            except Exception:
                pass
        return results

LET'S TRY IT

user_input = input("Enter your question : ")
query_engine = GraphQueryEngine()
cypher_queries = query_engine.get_response(user_input)
cypher_queries = query_engine.populate_embedding_in_query(user_input, cypher_queries)
fetched_data = query_engine.fetch_data(cypher_queries)
response = query_engine.get_final_response(user_input, fetched_data)

OUTPUT

In the next blog we'll build a simple FastAPI app to expose this setup as an API.

Hope this helps... !!!

LinkedIn - https://www.linkedin.com/in/praveenr2998/
Github - https://github.com/praveenr2998/Creating-Lightweight-RAG-Systems-With-Graphs/blob/main/fastapi_app/query_engine.py

Load Data Into Neo4j

praveenr — Sun, 18 Aug 2024 03:11:52 +0000

In the previous blog we saw how to install and setup neo4j locally with 2 plugins APOC and Graph Data Science Library - GDS. In this blog I am going to take a toy dataset(products in a e-commerce website) and store that in Neo4j.

Allocating Sufficient Memory For Neo4j

Before starting to load the data if in your use case you have huge data ensure that sufficient amount of memory is allocated to neo4j. To do that :

Click on the three dots to the right of open

Click on Open folder -> Configuration

Click on neo4j.conf

Search for heap in neo4j.conf, uncomment lines 77, 78 and change 256m to 2048m, this ensures 2048mb is allocated for data storage in neo4j.

Creating Nodes

Graphs have two primary components nodes and relationships, let's create the nodes first and later establish the relationships.
The data I am using is present here - data
Use the requirements.txt present here to create a python virtual environment - requirements.txt
Let's define various functions to push data.
Importing necessary libraries

import pandas as pd
from neo4j import GraphDatabase
from openai import OpenAI

We are going to use openai to generate embeddings

client = OpenAI(api_key="")
product_data_df = pd.read_csv('../data/product_data.csv')

To generate embeddings

def get_embedding(text):
    """
    Used to generate embeddings using OpenAI embeddings model
    :param text: str - text that needs to be converted to embeddings
    :return: embedding
    """
    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

As per our dataset we can have two unique node labels, Product_type : Type/Category of product, Product_details: Name of product. Let's create category label, neo4j offers something called property, you can imagine these to be metadata for a particular node. Here name and embedding are the properties. So we are storing the name of category and its corresponding embedding in DB.

def create_product_type(product_data_df):
    """
    Used to generate queries for creating product type nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for category
    """
    cat_query = """CREATE (a:Product_type {name: '%s', embedding: %s})"""
    distinct_product_types = product_data_df['Category'].unique()
    query_list = []
    for type_ in distinct_product_types:
        embedding = get_embedding(type_)
        query_list.append(cat_query % (type_, embedding))
    return query_list

Similarly we can create Product_details nodes, here the properties would be name, description, price, warranty_period, available_stock, review_rating, product_release_date, embedding

def create_product(product_data_df):
    """
    def create_product_details(product_data_df):
    """
    Used to generate queries for creating product_details nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for product
    """
    product_query = """CREATE (a:Product_details {name: '%s', description: '%s', price: %d, warranty_period: %d, 
    available_stock: %d, review_rating: %f, product_release_date: date('%s'), embedding: %s})"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        embedding = get_embedding(row['Product Name'] + " - " + row['Description'])
        query_list.append(product_query % (row['Product Name'], row['Description'], int(row['Price (INR)']),
                                           int(row['Warranty Period (Years)']), int(row['Stock']),
                                           float(row['Review Rating']), str(row['Product Release Date']), embedding))
    return query_list

Now let's create another function to execute the queries generated by the above 2 functions. Update your username and password appropriately.

def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")

Complete code

import pandas as pd
from neo4j import GraphDatabase
from openai import OpenAI

client = OpenAI(api_key="")
product_data_df = pd.read_csv('../data/product_data.csv')


def preprocessing(df, columns_to_replace):
    """
    Used to preprocess certain column in dataframe
    :param df: pandas dataframe - data
    :param columns_to_replace: list - column name list
    :return: df: pandas dataframe - processed data
    """
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'s", "s"))
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'", ""))
    return df


def get_embedding(text):
    """
    Used to generate embeddings using OpenAI embeddings model
    :param text: str - text that needs to be converted to embeddings
    :return: embedding
    """
    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding


def create_product_type(product_data_df):
    """
    Used to generate queries for creating product type nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for category
    """
    cat_query = """CREATE (a:Product_type {name: '%s', embedding: %s})"""
    distinct_product_types = product_data_df['Category'].unique()
    query_list = []
    for type_ in distinct_product_types:
        embedding = get_embedding(type_)
        query_list.append(cat_query % (type_, embedding))
    return query_list


def create_product_details(product_data_df):
    """
    Used to generate queries for creating product_details nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for product
    """
    product_query = """CREATE (a:Product_details {name: '%s', description: '%s', price: %d, warranty_period: %d, 
    available_stock: %d, review_rating: %f, product_release_date: date('%s'), embedding: %s})"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        embedding = get_embedding(row['Product Name'] + " - " + row['Description'])
        query_list.append(product_query % (row['Product Name'], row['Description'], int(row['Price (INR)']),
                                           int(row['Warranty Period (Years)']), int(row['Stock']),
                                           float(row['Review Rating']), str(row['Product Release Date']), embedding))
    return query_list


def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")


# PREPROCESSING
product_data_df = preprocessing(product_data_df, ['Product Name', 'Description'])

# CREATE PRODUCT TYPE
query_list = create_product_type(product_data_df)
execute_bulk_query(query_list)

# CREATE PRODUCT DETAIL
query_list = create_product_details(product_data_df)
execute_bulk_query(query_list)

Creating Relationships

We are going to create relationships between Product_type and Product_details and the name of the relationship would be CONTAINS

from neo4j import GraphDatabase
import pandas as pd

product_data_df = pd.read_csv('../data/product_data.csv')


def preprocessing(df, columns_to_replace):
    """
    Used to preprocess certain column in dataframe
    :param df: pandas dataframe - data
    :param columns_to_replace: list - column name list
    :return: df: pandas dataframe - processed data
    """
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'s", "s"))
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'", ""))
    return df


def create_type_detail_relationship_query(product_data_df):
    """
    Used to create relationship between Product_type and Product_details
    :param product_data_df: dataframe - data
    :return: query_list: list - cypher queries
    """
    query = """MATCH (c:Product_type {name: '%s'}), (p:Product_details {name: '%s'}) CREATE (c)-[:CONTAINS]->(p)"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        query_list.append(query % (row['Category'], row['Product Name']))
    return query_list


def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")


# PREPROCESSING
product_data_df = preprocessing(product_data_df, ['Product Name', 'Description'])

# CATEGORY - FOOD RELATIONSHIP
query_list = create_type_detail_relationship_query(product_data_df)
execute_bulk_query(query_list)

By using MATCH query to match the already created nodes we establish relationships between then.

Visualizing The Created Nodes

Hover over the open icon and click on neo4j browser to visualize the nodes that we have created.

And our data is loaded into neo4j along with their embeddings.

In the fore-coming blogs we'll see how to build a graph query engine using python and use the fetched data to do augmented generation.

Hope this helps... See you !!!

LinkedIn - https://www.linkedin.com/in/praveenr2998/
Github - https://github.com/praveenr2998/Creating-Lightweight-RAG-Systems-With-Graphs/tree/main/push_data_to_db

Installing Neo4j In Ubuntu

praveenr — Fri, 16 Aug 2024 10:20:57 +0000

The first blog in this series is to install neo4j - desktop version and few plugins which would help us to build an application. I am using Ubuntu 22.04.4 LTS.

STEP 1

Go to https://neo4j.com/deployment-center/?desktop-gdb and download the Neo4j Desktop (AppImage)

If you are a mac or windows user, you can try other options available

You will have to fill these details to download the file

Copy the neo4j desktop activation key to a text file, this will be used later

STEP 2

Open your terminal go to the directory where you have downloaded the file and run the following command, this makes the downloaded file executable

chmod +x DOWLOADED_FILE_NAME

To run the application stay in the same directory as your downloaded file and run the following command

./DOWLOADED_FILE_NAME

Once you run the command you'll be able to see the application window, paste the key that you copied in software key text box

STEP 3

Click on new project

Click on add in the top right corner and click on Local DBMS if you want to host the DB locally or Remote Connection if you are using a managed service somewhere. I am going to have it locally.

STEP 4

Add a DB Name and Password, I have set it as praveen_blog and neo4j@123 and click on the create option at the bottom. The default username would be neo4j

Tadaaa your instance is up and running

STEP 5

There are few plugins like APOC and Graph Data Science Library - GDS which are really useful and we'll be using it in this blog series.
To install them click on praveen_blog and a small window will be opened to the right, in the plugins tab you'll see APOC and Graph Data Science Library, click on install for both of them

Once they are installed you are good to go. Create your graphs using cypher query and enjoy.

In the fore-coming blogs we'll see how to load data into neo4j using python and build a light weight application for augmented generation also called as RAG.

Hope this helps... See you !!!

LinkedIn - https://www.linkedin.com/in/praveenr2998/

Tools and Tool_Choice - Azure GPT4

praveenr — Thu, 13 Jun 2024 09:43:25 +0000

When it comes to integrating GPT into our products and especially if a chain of logical decisions are made based on GPT's result then we have to worry about the unstructured nature of GPT's response.

There are several ways to solve this issue

Prompt Engineering - Emphasizing to return a structured output maybe as a JSON. This technique might work but sometimes the result could still be unstructured.
Langchain, Llamaindex and DSPy offer several functionalities to generate structured output and these techniques are usually robust but not native.

In this blog we are going to see a native way to get structured output from GPT4 and the granularity of control is that each of the returned parameter's data type could even be specified and obtained.

Let's look at an example ...

We are going to ask GPT4 few problems to solve and the expected result should have

formula - formula used to solve the problem
substitution - substitute the values from the problem in the formula
result - final answer post substitution
explanation - a simple explanation on what the problem is and how to solve it
difficulty - on a scale of 1-10 how difficult is this problem for an engineering student

So if we are not going to use any frameworks, few shot examples with properly defined output structure in prompt might help us get output in the desired format but natively we have something called tools and tool_choice to make the output structured. These features were released so that the output of GPT could be obtained as parameters and these parameters could be used to call a function.

Let's look at some code

Installation

pip install openai

Defining the output structure that we want

tools = [
        {
            "type": "function",
            "function": {
                "name": "problem_solver",
                "description": "Used to solve the problem",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "formula": {
                            "type": "string",
                            "description": "formula used to solve the problem",
                        },
                        "substitution": {
                            "type": "string",
                            "description": "substitute the values present in the problem into the formula used to solve the problem",
                        },
                        "result": {
                            "type": "string",
                            "description": "the final answer for the problem in float",
                        },
                        "explanation": {
                            "type": "string",
                            "description": "explanation on how the problem is solved in simple words",
                        },
                        "difficulty": {
                            "type": "integer",
                            "description": "on a scale of 1-10, how difficult is the problem to solve for an enginnering student",
                        },
                    },
                    "required": ["formula", "substitution", "result", "explanation", "difficulty"],
                },
            },
        }
    ]

Name of the function is problem_solver
The 5 parameters that we want in the output are formula, substitution, result, explanation, difficulty which are defined inside parameters ==> properties.
For each of the above parameters the type, description should be defined which specifies the data type of the returned value and simple description of what is to be returned.
In the required key which is a list we have to specify the mandatory parameters that have to be returned otherwise GPT might consider it to be optional and might not return it.

Defining a function which is used to call Azure GPT model

def solve_problem(messages_list):
    api_key = 'your_api_key'
    api_base = 'your_api_base_url'
    api_version = '2024-02-01'
    model = 'your_deployment_name'
    client = AzureOpenAI(
        azure_endpoint = api_base,
        api_key=api_key,
        api_version=api_version
        )
    response = client.chat.completions.create(
            model=model,
            temperature=0.8,
            max_tokens=500,
            messages=messages_list,
            tools=tools,
            tool_choice={"type": "function", "function": {"name": "problem_solver"}} 
        )
    res=response.choices[0].message
    return res

In the client.chat.completions.create we have two parameters tools and tool_choice, for the tools parameter we can pass the tools list we created before.
The tool_choice parameter accepts three values

'auto' - This is the default value when we define parameters in the above step and pass it to tools. By specifying 'auto', we allow GPT to choose the function and parameters that we have defined in tools, sometimes GPT might not choose our defined function and parameters so there is a bit of uncertainty with 'auto'.
None - This is the default value when no function and parameters are defined. This is a way to specify not to use this feature.
Specifying a particular function via {"type: "function", "function": {"name": "my_function"}} forces the model to call that function. In our case this would be {"type: "function", "function": {"name": "problem_solver"}}. This reinforces GPT to return the parameters defined under problem_solver.

Let's try asking few questions

result = solve_problem([{
    "role":"user",
    "content": "An airplane accelerates down a runway at 3.20 m/s2 for 32.8 s until is finally lifts off the ground. Determine the distance traveled before takeoff"
}])
tool_calls = result.tool_calls
parameters = eval(tool_calls[0].function.arguments)
print(parameters)

OUTPUT

{'formula': 'd = v_i * t + (1/2) * a * t^2',
 'substitution': 'd = 0 * 32.8 + (1/2) * 3.20 * (32.8)^2',
 'result': '1721.472 m',
 'explanation': 'Since the airplane starts from rest, its initial velocity (v_i) is 0. The acceleration (a) is 3.20 m/s2 and the time (t) is 32.8 seconds. Using the kinematic equation for distance (d), where the first term is zero because the initial velocity is zero, the second term is (1/2) * acceleration * time squared gives the distance. After calculating, the distance comes out to be 1721.472 meters.',
 'difficulty': 3}

We can observe that the output is now a proper python dictionary which is easily parsable and could be used to call any function.
Any if you want any more customization in the parameter data types refer to https://json-schema.org/understanding-json-schema/reference/type.

We can define multiple functions and parameters in tools and let GPT decide which function and parameter to use based on prompt and description provided using 'auto' as tool_choice value or could enforce use of a particular function and its parameters by specifying it in tool_choice.

Hope this helps :))
LINKED IN : https://www.linkedin.com/in/praveenr2998/

My Embeddings Stay Close To Each Other, What About Yours?

praveenr — Thu, 29 Feb 2024 03:44:52 +0000

This blog will help you generate embeddings for your datasets such that semantically related sentences stay close to each other in other words, this blog will help you fine-tune commonly available SBERT(Sentence BERT) models in hugging face using your dataset.

LITTLE BACKGROUND ABOUT SBERT

Sentence BERT was first introduced in the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In this paper, the authors have proposed a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity.

This blog is not about how SBERT works but rather how to finetune a pre-trained SBERT, so let's go ahead.

WHY FINETUNE

Sometimes when you try to retrieve some information using any distance metric like cosine similarity the retriever might fetch unintended information, the reason being the unintended information is closer to your query in vector space.

In the above image your question vector and irrelevant vector are close to each and why does this happen ???
A few reasons might be

Wrong choice of embedding model - The model might be trained on a dataset from a different domain.
The terms or words that you use might be unseen during model training

SO WHAT'S THE SOLUTION

If you find that your use case has some unseen words or you have better datasets which you believe could make the model generate quality embeddings you could go for fine-tuning.

FINE-TUNING SENTENCE BERT FROM HUGGING FACE

We are going to use all-MiniLM-L6-v2 model from hugging face.

Required Libraries

pip3 install torch
pip3 install pandas
pip3 install -U sentence-transformers

Little Bit Of Clarity

By finetuning we mean to ask the model to consider the pair of sentences that we send as training data points to be close to each other, there are several ways to organize your training data and a table explaining it is given below

image credits

In this blog, we are going to use a pair of positive sentences without label for each training data point and the sentence pair denotes closely related sentences. The corresponding loss function would be MultipleNegativesRankingLoss

TRAINING

import pandas as pd
import os
from sentence_transformers import SentenceTransformer
from sentence_transformers import InputExample
from sentence_transformers import losses
from torch.utils.data import DataLoader

class trainSBERT:
    def prepare_training_data(self, source_sentence_list, target_sentence_list):
        """
        Each training data point must have 2 two similar sentences inside a list
        Eg - [sentence 1, sentence 2]

        INPUT
        source_sentence_list - List : All source sentences
        target_sentence_list - List : All target sentences

        RETURNS
        train_dataloader - Pytorch dataloader object 
        """
        train_data_list = []
        for source, target in zip(source_sentence_list, target_sentence_list):
            print(source, target)
            train_data_list.append(InputExample(texts=[source, target]))

        train_dataloader = DataLoader(train_data_list, shuffle=True, batch_size=64)
        return train_dataloader

    def train_sbert(self, model_name_list, n_epochs, source_sentence_list, target_sentence_list, path_to_save_model):
        """
        Used to train various sentence bert model

        INPUT
        model_name_list - List : List of model names from hugging face to be trained
        n_epochs - Int : Epochs to be trained for
        source_sentence_list - List : All source sentences
        target_sentence_list - List : All target sentences
        path_to_save_model - String : Path to save trained model

        RETURNS
        None
        """
        train_dataloader = self.prepare_training_data(source_sentence_list, target_sentence_list)
        for model_name in model_name_list:
            sbert_model = SentenceTransformer(model_name)

            train_loss = losses.MultipleNegativesRankingLoss(model=sbert_model)
            warmup_steps = int(len(train_dataloader) * n_epochs * 0.1) #10% of train data

            sbert_model.fit(train_objectives=[(train_dataloader, train_loss)],
                    epochs=n_epochs,
                    warmup_steps=warmup_steps) 

            os.makedirs(f"{path_to_save_model}/{model_name.replace('/', '_')}")
            sbert_model.save(f"{path_to_save_model}/{model_name.replace('/', '_')}")

We are creating a class with 2 functions
prepare_training_data - Used to convert training data into pytorch data loader format.
train_sbert - Used to train sbert models and save them in your local directory.

This is how your training data CSV file should look like

df = pd.read_csv('training_data.csv')
obj = trainSBERT()
obj.train_sbert(['sentence-transformers/all-MiniLM-L6-v2'], 500, df['source_sentence'].tolist(), df['target_sentence'].tolist(), "/Users/praveen/Desktop/praveen/github/training/model/sbert")

After 500 epochs the trained model will be saved to /Users/praveen/Desktop/praveen/github/training/model/sbert/sentence-transformers_all-MiniLM-L6-v2

All the below files will be saved to your local directory inside sentence-transformers_all-MiniLM-L6-v2 folder

HOW TO USE THE TRAINED MODEL TO GENERATE EMBEDDINGS

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('/Users/praveen/Desktop/praveen/github/training/model/sbert/sentence-transformers_all-MiniLM-L6-v2')
question_embeddings = model.encode([question], convert_to_tensor=True)
answer_embeddings = model.encode([answer], convert_to_tensor=True)
print("Question Embeddings : ", question_embeddings)
print("Answer Embeddings : ", answer_embeddings)

Now you can compare these two using cosine-similarity to calculate how close they are.

Hope this helps :))
LINKED IN : https://www.linkedin.com/in/praveenr2998/

Bayes' Theorem In Layman Terms

praveenr — Mon, 28 Aug 2023 07:55:14 +0000

DEFINITION

Bayes' Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.

Blaa Blaa Blaa - I find definitions to be strange, only after understanding the concept I do understand the definition.

Let's break it down and understand it one step at a time...

Marginal Probability - P(A)

If a random variable is independent then the probability of the event is irrespective of the outcomes of other random variables. In simple words, it's like looking at the probability of something occurring without taking into account any other factors.

Joint Probability - P(A,B)

The probability of 2 or more simultaneous events happening together. Eg Probability of watching TV and Eating.

Conditional Probability - P(A|B)

Probability of one (or more) event given the occurrence of another event. Eg the probability of your father having dessert given that tomorrow he is having a diabetes test is very low. If you notice carefully if there is no diabetes test tomorrow then the probability would have been almost 100%.

Expressing Joint Probability In Terms of Conditional Probability

P (A, B) = P (A ∣ B) * P (B)

Note : P(A,B) = P(B,A) (Symmetrical)

Expressing Conditional Probability In Terms of Joint Probability

P (A ∣ B) = P (A, B) / P (B)

Note : P(A|B) != P(B|A) (Not Symmetrical)

Finally our Bayes' Theorem using the above equations

P (A ∣ B) = (P (B ∣ A) * P (A)) / P (B)

The numerator P(B|A) * P(A) is the joint probability equation given in (1)
P(A|B) ===> Posterior Probability
P(B|A) ===> Likelihood
P(A) ===> Prior Probability
P(B) ===> Evidence

To understand the above equation with an example

Question: What is the probability that there is fire given that there is smoke?

P (F i re ∣ S m o k e) = P (S m o k e ∣ F i re) * P (F i re) / P (S m o k e)

P(Fire|Smoke) ===> Posterior Probability
P(Smoke|Fire) ===> Likelihood
P(Fire) ===> Prior Probability
P(Smoke) ===> Evidence

The probability of fire given that there is smoke is equal to the likelihood multiplied by the probability of fire divided by the probability of smoke. And this is Bayes' theorem to understand its use-case better read further.

Where is Bayes' Theorem used and why Bayes' Theorem

One very common space where you can find the theorem applied is in the evaluation of medical diagnostic tests.

Let us consider a diagnostic test that determines whether a person has a lesion that is malignant or not.

From observation, it is given that

P (T es t = P os i t i v e ∣ M a l i g nan t = T r u e) = 0.85

The above statement means that the probability of diagnostic test results being Positive given that he/she has a malignant tumour is 85%.

What will be a normal person's understanding of the above probability???
If a person takes this diagnostic test and the result turns out to be Positive since the above statement shows that for someone with a Malignant tumour, the test detects 85% per cent correctly there is a good chance that the person assumes that he/she might have a malignant tumour and that's scary, right?

Now let's look at what Bayes' got to say about it
P(Malignant=True|Test=Positive), this is what we are going to analyse using Bayes' theorem

There are a few assumptions that we have to make

P (M a l i g nan t) = 0.0002

P (T es t = P os i t i v e) = 0.05016

This assumption means that on average only 1 in 5000 will have malignant tumours and the probability of the test is positive regardless of whether the person has a malignant tumour or not is 0.05016.

P (M a l i g nan t = T r u e ∣ T es t = P os i t i v e) = P (T es t = P os i t i v e ∣ M a l i g nan t = T r u e) * P (M a l i g nan t = T r u e) / P (T es t = P os i t i v e)

Plugging the values that we have

P (M a l i g nan t = T r u e ∣ T es t = P os i t i v e) = 0.85 * 0.0002/0.05016

P(Malignant=True | Test=Positive) = 0.003389

Wait what... this is a terrible diagnostic test because the the above probability shows that if this diagnostic test for a malignant tumour turns out to be True the probability of it being correct is only 0.33 per cent.
Note: This result was obtained on a few assumptions and if those assumptions are verified and updated the result could change.

CONCLUSION

Bayes' theorem is a significant contribution to the field of statistics and is widely used in machine learning. Bayes' Theorem provides a systematic way to update prior probabilities with new information or evidence. In other words, it helps us adjust our beliefs about the likelihood of an event occurring based on the data we observe.

LinkedIn : https://www.linkedin.com/in/praveenr2998

Semantic Search Using Vectors/Embeddings For Noobs

praveenr — Wed, 17 May 2023 15:12:29 +0000

If you are someone like me who is hearing about semantic search, vectors and embeddings after LLM(Large Language Model) was launched and finds these terms confusing then I hope this blog brings some clarity to you.

What is Semantic Search

Semantic search in Natural Language Processing (NLP) refers to the process of understanding the meaning or intent behind a user's search query and retrieving relevant information based on that understanding. Unlike traditional keyword-based search, which matches queries to documents based on exact word matches, semantic search aims to comprehend the context and semantics of the query to generate more accurate and contextually relevant results.

The next question is how to make computers understand the semantic information... Humans have very high cognitive capabilities so they can easily understand semantics in multiple languages but to make a computer understand semantics is challenging.

In this blog, we are going to see how semantic information is understood using vectors/embeddings. In my previous blog, I have shown how CountVectorizer & TFIDF works now we are going to see an even more advanced yet simple and easy way to do semantic search

What is a vector

Mathematically a vector is a value which has both magnitude and direction.

Here vectors A,B,C,D have magnitudes 4, 2 and A,B,D have same direction but C has a different direction. These are single dimensional vectors.

In mathematics unlike in physics, there could be n dimensions for a vector and these are called multi-dimensional vectors(each arrow in the above figure is a dimension). The all-MiniLM-L6-v2 model that we are going to use in this blog generates a vector with dimension 384. The information stored in these dimensions is used to find semantic similarity.

Sentence Transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. It provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images.

Now we are going to see how to generate embeddings and do a semantic search using a pre-trained model from sentence transformers.

Pretrained Sentence Transformer Model - all-MiniLM-L6-v2

We have a Python library to access the model

pip install -U sentence-transformers

We are going to use the all-MiniLM-L6-v2 model which is a lightweight yet powerful model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

Now we can take a few question-answer sentences and generate embeddings from them and do a semantic search on them.

# Q&A sentences
question_answers = [
           "Q : What is this software used for? A : This software is used to handle you finances and provide useful suggestion",
           "Q : How much does it cost per year? A : It costs 5000 rupees per year",
           "Q : Is there a premium version available? A : Yes it is available for a cost of 7000 rupees per year",
           "Q : Why should I choose this rather than product Y? A : Our product outperforms in W and Z"  
          ]

#Sentences are encoded by calling model.encode()
question_answer_embeddings = model.encode(question_answers, convert_to_tensor=True)

The encode function will generate embeddings and further, we are converting the embeddings into a pytorch tensor.

That was easy !!!!

Now we will ask a question and find semantically relevant content from the embeddings generated.

question = ['Can you explain the use of this software']
question_embeddings = model.encode(question, convert_to_tensor=True)

We should also generate embedding for our question...

from sentence_transformers.util import semantic_search
hits = semantic_search(question_embeddings, question_answer_embeddings, top_k=1)

We are using a utility function called semantic_search which internally uses cosine similarity by default to find the similarity between the two embeddings and returns a similarity score, you can also use any other metric for comparing the vectors like dot product.

print([question_answers[hits[0][i]['corpus_id']] for i in range(len(hits[0]))])

question = ['How much do you charge?']
question_embeddings = model.encode(question, convert_to_tensor=True)

from sentence_transformers.util import semantic_search
hits = semantic_search(question_embeddings, question_answer_embeddings, top_k=1)

print([question_answers[hits[0][i]['corpus_id']] for i in range(len(hits[0]))])

You could observe from the above examples that the question asked is not exactly matching to any input in question_answers but we are able to find the one that closely matches our input.

There are many other models to generate even more powerful embeddings and the quality of embeddings is directly proportional to the semantic similarity.
Happy Learning :))
www.linkedin.com/in/praveenr2998

CountVectorizer vs TFIDF - Logistic Regression

praveenr — Tue, 28 Mar 2023 10:35:37 +0000

Recently I have become curious about how Natural Language Processing(NLP) works. If you are someone like me then this blog could be really helpful.

When beginning with M.L. we would have observed how tabular data was used to train an M.L. model, most of the columns would be numeric columns, rest of the text columns would usually be having 1 word which would be converted to numbers using techniques like one hot encoding.

There are cases where columns have sentences or even paragraphs so new techniques should be applied to convert raw text data to computer-usable form and in this blog we are going to see 2 such ways to do it.

What is Vectorization and why Vectorization

Many machine learning algorithms and almost all deep learning algorithms are not capable of processing text in the raw form, they need numerical inputs. This process of converting text data to numerical data is called vectorization. In the NLP world this process is referred to as embeddings.

CountVectorizer

When we use countvectorizer we create a sparse matrix, and in this sparse matrix we store count of all the words in our corpus, this is a simple but efficient way of converting text to numerical data.
Each row in the sparse matrix contains the word and its corresponding count in that particular line.

In the above diagram, you could observe that in the x-axis all distinct words are present and in the y-axis the sentence index/line index is present(which is represented as doc), this is the sparse matrix representation and this is used to train the M.L model.

Fortunately scikit-learn has an inbuilt module that we could use to generate the sparse matrix and is super easy to use.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

In addition to importing the CountVectorizer module, we also importing word_tokenize, this is nothing but an inbuilt module which could be used to convert each sentence into chunks of words and this is a preprocessing step in almost all NLP algorithms.

Let's create our own sample input

corpus = [
    "hello, how are you, I am Praveen?",
    "You know football is a wonderful sport, what do you think?",
    "Opensource is something that everyone should appreciate, what do you think?"
]

This is our sample input

ctv = CountVectorizer(tokenizer = word_tokenize, token_pattern=None)

We are creating an object and the argument tokenizer is assigned with word_tokenize module so all words and all special characters would be considered as separate tokens.

ctv.fit(corpus)
corpus_transformed = ctv.transform(corpus)

The corpus_transformed variable now holds the sparse matrix, now we can visualize the sparse matrix.

Our input contains 3 lines so 0,1 and 2, and the other number in the tuple-like structure is the unique id of each word in the corpus.

print("Unique index assigned to each word : ",ctv.vocabulary_)

*Unique index assigned to each word : {'hello': 8, ',': 0, 'how': 9, 'are': 4, 'you': 20, '?': 1, 'know': 11, 'football': 7, 'is': 10, 'a': 2, 'wonderful': 19, 'sport': 15, 'what': 18, 'do': 5, 'think': 17, 'opensource': 12, 'something': 14, 'that': 16, 'everyone': 6, 'should': 13, 'appreciate': 3}
*

Term Frequency Inverse Document Frequence (TFIDF)

The above picture is self explanatory for the TFIDF formula, this method too creates a sparse matrix but instead of the count here we have the TFIDF formula applied to each token, the resultant is a float value.

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

Importing the necessary modules

corpus = [
    "hello, how are you?",
    "You know football is a wonderful sport, what do you think?",
    "Opensource is something that everyone should appreciate, what do you think?"
]

Input data that we are going to use

# To also include special characters while creating sparse matrix
tfidf = TfidfVectorizer(tokenizer = word_tokenize, token_pattern=None)
tfidf.fit(corpus)

Creating an object and fitting it on our input data.

# TFIDF vectorizer
corpus_transformed = tfidf.transform(corpus)

corpus_transformed contains the sparse matrix generated which could be used to train the model.

print("Sparse Matrix Representation : ", corpus_transformed)

print("Unique index assigned to each word : ", tfidf.vocabulary_)

*
Unique index assigned to each word : {'hello': 8, ',': 0, 'how': 9, 'are': 4, 'you': 20, '?': 1, 'know': 11, 'football': 7, 'is': 10, 'a': 2, 'wonderful': 19, 'sport': 15, 'what': 18, 'do': 5, 'think': 17, 'opensource': 12, 'something': 14, 'that': 16, 'everyone': 6, 'should': 13, 'appreciate': 3}

Lets use the sparse matrix generated in Logistic Regression

Let's use a kaggle dataset to perform logistic regression, the dataset that we are going to use is IMDB movie review to perform sentiment classification - positive/negative

import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer

Importing the necessary modules

if __name__ == "__main__":
    df = pd.read_csv("/home/praveen/Desktop/Projects/Approching_Almost_Any_ML_Prob_Book/NLP/data/IMDB Dataset.csv")

    # Converting sentiment to 1 and 0
    df.sentiment = df.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

    df["kfold"] = -1

    df = df.sample(frac=1).reset_index(drop=True)

    y = df.sentiment.values

    kf = model_selection.StratifiedKFold(n_splits=5)

    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, 'kfold'] = f

We are converting the sentiment column which is our target column to 0 and 1. We are going to take up the k-fold cross validation approach, this is nothing but converting the total dataset into k(for eg 4) folds and during training 3 folds would be used for training and 1 fold would be used for validation, this is a better way to evaluate our model's performance.

There is something called StratifiedKFold where all the folds would have balanced distribution of classes.
We are adding a new column to the dataframe called kfold to represent which fold the particular data point or row belongs to.

    accuracy_list = []
    for fold_ in range(5):
        train_df = df[df.kfold != fold_].reset_index(drop=True)
        test_df = df[df.kfold == fold_].reset_index(drop=True)

        count_vec = CountVectorizer(tokenizer=word_tokenize, token_pattern=None)
        count_vec.fit(train_df.review)

        xtrain = count_vec.transform(train_df.review)
        xtest = count_vec.transform(test_df.review)

        model = linear_model.LogisticRegression()

        model.fit(xtrain, train_df.sentiment)

        preds = model.predict(xtest)

        accuracy = metrics.accuracy_score(test_df.sentiment, preds)
        accuracy_list.append(accuracy)

        print(f"Fold : {fold_}")
        print(f"Accuracy : {accuracy}")
        print("")

    for i in range(0, 4):
        print(f"Fold : {i+1}, Accuracy : {accuracy_list[i]}")

We are using countvectorizer to transform the review to vectors and using logistic regression to classify the sentiment. We are using accuracy to evaluate our model performance.

We could observe that the model is able to reach almost 90 percent accuracy, that was easy right?

Same implementation using TFIDF

import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer

if __name__ == "__main__":
    df = pd.read_csv("/home/praveen/Desktop/Projects/Approching_Almost_Any_ML_Prob_Book/NLP/data/IMDB Dataset.csv")

    # Converting sentiment to 1 and 0
    df.sentiment = df.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

    df["kfold"] = -1

    df = df.sample(frac=1).reset_index(drop=True)

    y = df.sentiment.values

    kf = model_selection.StratifiedKFold(n_splits=5)

    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, 'kfold'] = f


    accuracy_list = []
    for fold_ in range(5):
        train_df = df[df.kfold != fold_].reset_index(drop=True)
        test_df = df[df.kfold == fold_].reset_index(drop=True)

        tfidf_vec = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
        tfidf_vec.fit(train_df.review)

        xtrain = tfidf_vec.transform(train_df.review)
        xtest = tfidf_vec.transform(test_df.review)

        model = linear_model.LogisticRegression()

        model.fit(xtrain, train_df.sentiment)

        preds = model.predict(xtest)

        accuracy = metrics.accuracy_score(test_df.sentiment, preds)
        accuracy_list.append(accuracy)

        print(f"Fold : {fold_}")
        print(f"Accuracy : {accuracy}")
        print("")

    for i in range(0, 4):
        print(f"Fold : {i+1}, Accuracy : {accuracy_list[i]}")

TFIDF is also close to 90 percent accuracy.

With this I conclude, so in this blog we have seen how sentence/word is converted to vector and how it is used to tackle a classification problem.

Github : https://github.com/praveenr2998/Approching_Almost_Any_ML_Prob_Book/tree/main/NLP
Book : https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf (a generous author :)))
Linked in : profile

See you in the next blog, bye...

Named Entity Recognition(NER) Using ChatGPT

praveenr — Tue, 14 Feb 2023 14:38:41 +0000

This is not like every other ChatGPT blog but here we are going to try to understand how promptify is going to be used along with LLMs(Large Language Models) like ChatGPT to perform named entity recognition(NER) and how this method is much more robust than using ChatGPT directly.

Interesting takeaways

How Promptify is used for prompt engineering
Simple Named Entity Recognition(NER) using promtify+ChatGPT
Custom Labels for Named Entity Recognition(NER) using promtify+ChatGPT
One Shot Named Entity Recognition(NER) using promtify+ChatGPT
Named Entity Recognition(NER) with domain knowledge using promtify+ChatGPT

Let's begin ...

What is Prompt Engineering and Prompting

Prompt engineering is a natural language processing (NLP) concept that involves discovering/creating inputs that yield desirable or useful results. Prompting is the equivalent of telling the Genie in the magic lamp what to do. In this case, the magic lamp is Chat-GPT, ready to give answers to any of our questions, and promtify is used to build and structure our questions in such a way that LLMs like ChatGPT understand the questions better and provide desirable results.

What is Named Entity Recognition(NER)

Entity could be defined as the key information in the text. An entity could be a single word or a group of words. Named Entity Recognition(NER) could be defined as the process of identifying and classifying entities(key information) in text.

Example

Here Person, Country and Designation are the group/class to which the entities belong and the process of identifying these entities and the group to which they belong is called named entity recognition(NER).

What exactly does Promptify do?

The input and output to LLMs like ChatGPT is generally plain unstructured text, but when you pass it through promptify along with certain parameters(many of which are optional), the promtify send these LLMs a structured input which is equivalent to asking a properly structured question that would help these LLMs understand the question better. Then the output from the LLMs is returned as a python object.

OUTPUT - Plain ChatGPT vs Promtify+ChatGPT

We are going to ask ChatGPT to perform named entity recognition on plain text, we are also going to tell which domain the input sentence belongs to, then we are going to try giving the same input using promptify and let's observe the response.

Plain ChatGPT

There is a good chance that the output structure might vary upon trying again and it is comparatively hard to use this output in an application as the structure might vary upon each query.

Promtify+ChatGPT

The entity(E) and its corresponding class/type(T) are returned as python objects from promptify. And you could also observe that when passed through promptify more entities are recognized. Now comparatively this is a much more robust output and could be used in an application easily.

Now let's check out the python implementation of promptify, the code implementation is done using google colab to help explain better...

Promptify - ChatGPT

%%capture
!git clone https://github.com/promptslab/Promptify.git
!pip3 install openai

Clone the promptify repository and install openai library

# Define the API key for the OpenAI model
api_key  = ""

Paste the API key generated by following this blog How to generate API secret key

# Create an instance of the OpenAI model, Currently supporting Openai's all model, In future adding more generative models from Hugginface and other platforms
model = OpenAI(api_key)
nlp_prompter = Prompter(model)

Create an instance of the OpenAI model and pass it to the promptify's Prompter, now you have an object where you could pass your prompt with the required parameters.

# Example sentence that is sent to GPT
sent = "The patient is a 93-year-old female with a medical history of chronic right hip pain, osteoporosis, hypertension, depression, and chronic atrial fibrillation admitted for evaluation and management of severe nausea and vomiting and urinary tract infection"

This sample input is related to the medical domain, and it is about a patient's medical condition.

NAMED ENTITY RECOGNITION(NER) WITH 2 LINES OF CODE

# Named Entity Recognition with No labels, no description, no oneshot, no examples
# Simple prompt with instructions
# domain name gives more info to model for better result generation, the parameter is optional
# Output will be python object -> [ {'E' : Entity Name, 'T': Type of Entity } ]


result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent, 
                          labels      = None)

# Output
pprint(eval(result['text']))

In the output
E - Entity
T - Type/Class the entity belongs to
If you observe the output, the output is a python object and is well structured compared to the raw output from ChatGPT. This functionality/feature from promptify could be extremely useful while integrating LLMs with applications. The domain parameter is optional and passing a domain to your prompt would result in a better-refined response.

That's not all about promptify....

CUSTOM LABELS FOR NAMED ENTITY RECOGNITION(NER)

You can also provide custom labels so that the custom labels and their corresponding entities would be identified from the prompt.

# If want to perform NER with custom tags only (handling out-of-bounds prediction) prompt


result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent, 
                          labels      = ["SYMPTOM", "DISEASE"])

# Output
pprint(eval(result['text']))

You could observe from the output that entities that belong to the custom labels provided were identified.

ONE SHOT - NAMED ENTITY RECOGNITION(NER)

One shot learning as the name suggests is the ability of a model to understand with one training data. That's fascinating right, with a powerful LLM like GPT and with the help of promptify you can actually do it.

one_shot_training_data = "Leptomeningeal metastases (LM) occur in patients with breast cancer (BC) and lung cancer (LC). The cerebrospinal fluid (CSF) tumour microenvironment (TME) of LM patients is not well defined at a single-cell level. We did an analysis based on single-cell RNA sequencing (scRNA-seq) data and four patient-derived CSF samples of idiopathic intracranial hypertension (IIH)"

one_shot_labelled_training_data = [[one_shot, [{'E': 'DISEASE', 'W': 'Leptomeningeal metastases'}, {'E': 'DISEASE', 'W': 'breast cancer'}, {'E': 'DISEASE', 'W': 'lung cancer'}, {'E': 'BIOMARKER', 'W': 'cerebrospinal fluid'}, {'E': 'DISEASE', 'W': 'tumour microenvironment'}, {'E': 'TEST', 'W': 'single-cell RNA sequencing'}, {'E': 'DISEASE', 'W': 'idiopathic intracranial hypertension'}]]]

result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent,
                          examples    = one_shot_labelled_training_data,
                          labels      = ["SYMPTOM", "DISEASE"])


pprint(eval(result['text']))

Here you have provided just 1 labelled data to the model with labels SYMPTOM and DISEASE where
E - Label/Class to which the entity belongs to
W - Entity

You could observe from the output that entities which belong to specific labels(SYMPTOM, DISEASE) that were provided in the one-shot training data were accurately identified along with their corresponding labels.

NAMED ENTITY RECOGNITION - WITH DOMAIN KNOWLEDGE

#If want to give some domain knowledge and description in prompt to enhance the output

result = nlp_prompter.fit('ner.jinja',
                          domain      = 'clinical',
                          text_input  = sent,
                          examples    = one_shot_labelled_training_data,
                          description = "Below Paragraph is from discharge summary of a patient. The Paragraph describes the condition and symptoms of patient.",
                          labels      = ["SYMPTOM", "DISEASE"])

pprint(eval(result['text']))

If you have domain knowledge, in the above case clinical domain and a small description of what the data is about, then that could be passed in the description parameter which would further improve the accuracy of the output.

You have now just scratched the surface of what LLMs could do when used along with promtify. A lot more blogs to come.

Github Repo : Promtify
Colab Notebook : Notebook

HeapSort - MinHeap

praveenr — Sun, 22 Jan 2023 07:46:22 +0000

Definition

The heap data structure is an array object that we can view as a nearly complete binary tree. Each node of the tree corresponds to an element of the array. The tree is completely filled on all levels except possibly the lowest, which is filled from the left up to a point.

The above image represents a min-heap

The tree structure in the diagram is an imaginary structure and this structure is going to be implemented completely in an array, the important point is how we are going to link the elements in an array so that we could logically form a tree and the structure as a whole is more efficient than a normal array.

A basic heap with a parent, one left child and one right child

Visualising The Working Of Heap Sort Using Min Heap

Let's take an array of size 5 and build a min-heap out of it and finally, remove the least element one by one(sort by ascending order). In the process, you'll understand how a min-heap is formed and how sorting by ascending order is done using the heap sort algorithm.

1. Insert 5 elements one by one

In the above image you could observe that as soon as soon as 5(i.e 0005) was inserted it shifted upwards and 7(i.e 0007) now became the left child to 5, this process of shifting is done to maintain the min-heap property

Now as soon as 6(i.e 0006) was inserted as it is greater than 5(i.e 0005) so 5 was not replaced instead 6 was added as the right child to 5.
According to min-heap property the smallest element in the array will always remain on the top.

Now when we try to insert 4(i.e 0004) it replaced the root node 5 as 4 is smaller than 5 and 5 replaced 7 as 5 is smaller than 7. One important thing to note here is that not only the root node tries to maintain the min-heap property, but any swap that happens at any level of the heap follows the min-heap property like the swap that we observed above.

Finally, the last element 8(i.e 0008) is inserted and it takes gets added as the right child to 5.

2. Delete/Remove element from heap
Now let's try to delete or remove the root node from the heap(this operation is nothing but getting the smallest element in the heap)

We could observe that 5(i.e 0005) takes up the new root node position 7 moves up and 8 becomes the left child of 7.

What we imagined and what actually happened...

We performed all the min-heap building operations and operations that helped to maintain the min-heap property, but we imagined that everything happened on the heap(tree-like structure in the above diagrams) but all these operations happen on an array. There are 3 formulae that link an array and the imaginary heap structure in our heads.

Parent index = (index - 1)//2
Left Child index = 2*index+1
Right Child index = 2*index+2

(NOTE - Indexing in python starts from 0 so index of the element 10 in array is 0)
In the above figure lets consider elements 23, 32, 38 and their corresponding indexes are 1, 3, 4, now
Parent(element 38) i.e Parent(index 4) is
(index-1)//2 i.e (4-1)//2 = 1
The element in index 1 is 23

LeftChild(element 23) i.e LeftChild(index 1) is
2*index+1 i.e (2*1)+1 = 3
The element in index 3 is 32

RightChild(element 23) i.e RightChild(index 1) is
2*index+2 i.e (2*1)+2 = 4
The element in index 4 is 38

Building A Min-Heap Using Python

The first step is to build a min heap using the elements of an array.
Let's take an unsorted array with 5 elements

unsorted_array = [2,10,9,12,11]

We are going to code the whole min-heap data structure as a single class (The code snippets below may not follow proper indentation, the whole code is attached as a single block at the end).

class MinHeap:
    def __init__(self, capacity) -> None:
        self.storage = [0]*capacity
        self.capacity = capacity
        self.size = 0

storage - Array equivalent to the size of the heap
capacity - Size of unsorted array/heap
size - This variable is used for building the array

Helper Functions

We will have a few helper functions which would help in building the heap and sorting the elements in it.

def getParentIndex(self, index):
    return (index - 1)//2

def getLeftChildIndex(self, index):
    return 2*index+1

def getRightChildIndex(self, index):
    return 2*index+2

The above three functions are used to get the parent index, left child index and right child index.

def hasParent(self, index):
    return self.getParentIndex(index) >= 0

def hasRightChild(self, index):
    return self.getRightChildIndex(index) < self.size

def hasLeftChild(self, index):
    return self.getLeftChildIndex(index) < self.size

This above functions are used to check whether a node has a parent node or not, has right child or not, has left child or not.

def parent(self, index):
    return self.storage[self.getParentIndex(index)]

def leftChild(self, index):
    return self.storage[self.getLeftChildIndex(index)]

def rightChild(self, index):
    return self.storage[self.getRightChildIndex(index)]

The above functions are used to get the parent element, left child element and right child element.

def isFull(self):
    return self.size == self.capacity

The above function is used to check if the heap is full or not.

def swap(self, index1, index2):
    temp = self.storage[index1]
    self.storage[index1] = self.storage[index2]
    self.storage[index2] = temp

The above function is used to swap two elements in the heap(array).

Now comes 4 important functions that help to build the heap, to maintain the min-heap property and get elements in ascending order from the heap.

def heapifyUp(self, index):
    if self.hasParent(index) and (self.parent(index) > self.storage[index]):
        self.swap(self.getParentIndex(index), index)
        self.heapifyUp(self.getParentIndex(index))

heapifyUp function is used while building the min-heap. When we try to insert an element into the array(heap), this function checks if that index has a parent index(parent node) and whether the parent is smaller or not, if not then a swap happens, and a recursive call is made with the parent index. These operations are done to build the min-heap and maintain the min-heap property.

def insert(self, data):
    if self.isFull():
        raise("Heap Is Full")
    self.storage[self.size] = data
    self.size += 1
    self.heapifyUp(self.size - 1)

insert function is used along with heapifyUp to build the heap. It checks if the heap is already full, and raises an error if full, if not adds elements to storage which is the heap array.

def heapifyDown(self, index):
    smallest = index
    if self.hasLeftChild(index) and self.storage[smallest] > self.leftChild(index):
        smallest = self.getLeftChildIndex(index)
    if self.hasRightChild(index) and self.storage[smallest] > self.rightChild(index):
        smallest = self.getRightChildIndex(index)
    if smallest != index:
        self.swap(index, smallest)
        self.heapifyDown(smallest)

heapifyDown is used while removing elements from the heap, it is used to main the heap property by finding the smallest element in the heap and placing it at the root of the heap. Let's take a scenario where the left child to the root node is smaller than the root node, now smallest is assigned with the left child index but the right child is even smaller than the left child, so now smallest will be assigned with the right child index and a swap will happen between root and right child, now a recursive call will happen until the smallest element in heap is the new root node.

def removeMin(self):
    if self.size == 0:
        raise("Heap is empty")
    data =  self.storage[0]
    self.storage[0] = self.storage[self.size - 1]
    self.size -= 1
    self.heapifyDown(0)
    return data

removeMin is used along with heapifyDown to remove the smallest element(root node from the heap). It removes the smallest element and then assigns the last element as the new root node, reduces the size of the heap array by 1, calls the heapifyDown to restore the min-heap property and returns the removed element.

Complete Implementation Of Min-Heap

class MinHeap:
    def __init__(self, capacity) -> None:
        self.storage = [0]*capacity
        self.capacity = capacity
        self.size = 0

    def getParentIndex(self, index):
        return (index - 1)//2

    def getLeftChildIndex(self, index):
        return 2*index+1

    def getRightChildIndex(self, index):
        return 2*index+2

    def hasParent(self, index):
        return self.getParentIndex(index) >= 0

    def hasLeftChild(self, index):
        return self.getLeftChildIndex(index) < self.size

    def hasRightChild(self, index):
        return self.getRightChildIndex(index) < self.size

    def parent(self, index):
        return self.storage[self.getParentIndex(index)]

    def leftChild(self, index):
        return self.storage[self.getLeftChildIndex(index)]

    def rightChild(self, index):
        return self.storage[self.getRightChildIndex(index)]

    def isFull(self):
        return self.size == self.capacity

    def swap(self, index1, index2):
        temp = self.storage[index1]
        self.storage[index1] = self.storage[index2]
        self.storage[index2] = temp


    def heapifyUp(self, index):
        if self.hasParent(index) and (self.parent(index) > self.storage[index]):
            self.swap(self.getParentIndex(index), index)
            self.heapifyUp(self.getParentIndex(index))

    def insert(self, data):
        if self.isFull():
            raise("Heap Is Full")
        self.storage[self.size] = data
        self.size += 1
        self.heapifyUp(self.size - 1)

    def heapifyDown(self, index):
        smallest = index
        if self.hasLeftChild(index) and self.storage[smallest] > self.leftChild(index):
            smallest = self.getLeftChildIndex(index)
        if self.hasRightChild(index) and self.storage[smallest] > self.rightChild(index):
            smallest = self.getRightChildIndex(index)
        if smallest != index:
            self.swap(index, smallest)
            self.heapifyDown(smallest)

    def removeMin(self):
        if self.size == 0:
            raise("Heap is empty")
        data =  self.storage[0]
        self.storage[0] = self.storage[self.size - 1]
        self.size -= 1
        self.heapifyDown(0)
        return data


###########TESTING WITH A SAMPLE DATA###############

unsorted_array = [2,10,9,12,11]
obj = MinHeap(len(unsorted_array))
for element in unsorted_array:
    obj.insert(element)

# Ascending Order
for i in range(len(unsorted_array)):
    print(obj.removeMin())

TIME COMPLEXITY

The above diagram show the time complexity of merge sort and how it is compared with other algorithms.

Min-Heap Visualisation Tool