Forem: Ambarish Ganguly

Understanding BBC News Q&A with Advanced RAG and Microsoft Phi3

Ambarish Ganguly — Sat, 18 May 2024 11:09:10 +0000

In this blog, we would be doing question and answering on a news data feed.

For this we are using the BBC News Dataset . This is a self updating dataset and is updated daily.

We would be learning Simple and Advanced RAG [ Retrieval Augmented Generation] using a small language model Phi3 mini 128K instruct through this blog.

We would be asking questions like What is the news in Ukraine and the application will provide the answers using this technique.

The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. In comparison GPT-4 has more than a trillion parameters and the smallest Llama 3 model has 8 billion.

RAG has 3 major components

Ingestion
Querying
Generation

Ingestion

For Ingestion, following are the key components

Read the Data Source
Convert the read text into manageable chunks
Convert the manageable chunks into embeddings. This is a technique in which you convert text into an array of numbers
Store the embeddings into a vector database
Store the metadata such as the filename, text , and other relevant things in the vector database

Query the data using Simple RAG

In the query component, we require 3 main components

Orchestrating application which is responsible for coordinating the interactions between the other components such as the user , vector database , Language Model .
Vector Database which stores the information
Language model which is helpful for generating the information after it has been provided contextual information

Data Flow of a Simple RAG

The user inputs the question . Example : What is the news in Ukraine
The Orchestrating application uses a encoder to transform the text into embedding We have used the all-MiniLM-L6-v2 of the Sentence Transformer as the encoder
The embedding is searched in the Vector database. In this case we have used the Qdrant database as the vector database
Search results are obtained from the vector database. We get the top K results from the vector database. The number of results to be obtained is configurable
A consolidated answer or popularly called context is prepared from the answers. In the implementation that we would do is done by concatenating the search results
This context is sent to the language model for generating the answers relevant for the context. In the implementation we have used a small language model Phi3

Data Flow of a Advanced RAG

The steps remain the same.

Except the following

Step 4 - Search results are obtained from the vector database. We get the top K2 results from the vector database. The number of results to be obtained is configurable. The results K2 is larger than K

Step 4A. The results obtained are passed into a new type of block known as the cross-encoder which distills the number of results and provides a smaller set of results which has high similarity between the results and the query. These smaller set of results can be the top K results.

Implementation details

For this implementation , we have used the following

Dataset - BBC News dataset
Vector Database - Qdrant. We have a used in memory version of Qdrant for demonstration
Language Model - Small language model Phi3
Orchestrator application - Kaggle notebook

Setup

Install the python libraries

! pip install -U qdrant-client --quiet
! pip install -U sentence-transformers --quiet

Imports

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer,CrossEncoder

Sentence Transformer Encoder

Instantiate the sentence transformer encoder

encoder = SentenceTransformer("all-MiniLM-L6-v2")

Create the Qdrant Collection

We are creating

In memory qdrant collection
The collection name is BBC
The size of the vector embedding to be inserted is the dimention of the encoder . In this case , the dimension when evaluated is 384
Distance of similarity is cosine

qdrant = QdrantClient(":memory:")

qdrant.recreate_collection(
    collection_name="BBC",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

Data Ingestion

Read the Dataset

Read the BBC News Dataset

LIMIT = 500
df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
docs = df[:LIMIT]

Upload the documents into Qdrant

import uuid
%%capture --no-display
qdrant.upload_points(
    collection_name="BBC",
    points=[
        models.PointStruct(
            id=str(uuid.uuid4()), 
            vector=encoder.encode(row[1]["title"]),
            payload={ "title":row[1]["title"] ,
                     "description":row[1]["description"] }
        )
        for row in docs.iterrows()
    ],
)

Verify the documents have been uploaded into Qdrant

qdrant.count(
    collection_name="BBC",
    exact=True,
)

If you have reached till this point, Congratulations 👌 . You have been able to complete the understanding of the Data Ingestion into Qdrant

Query the Qdrant database

Query for the user

query_string = "Describe the news for Ukraine"

Search Qdrant for the query

For searching , note how we have converted the user input into a embedding

encoder.encode(query_string).tolist()

hits = qdrant.search(
    collection_name="BBC",
    query_vector=encoder.encode(query_string).tolist(),
    limit=35,
)

for hit in hits:
    print(hit.payload, "score:", hit.score)

Refine the result with the CrossEncoder

We are refining the results from the CrossEncoder .

We have got in our implementation K2 = 35 results from Qdrant. We have used the Cross Encoder cross-encoder/ms-marco-MiniLM-L-6-v2 to refine the results The refined results in our case K = 5 after we pass the results through the cross encoder.

CROSSENCODER_MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
RANKER_RESULTS_LIMIT = 5

user_input = query_string

contexts_list = []
for result in hits:
    contexts_list.append(result.payload['description'])

cross_encoder = CrossEncoder(CROSSENCODER_MODEL_NAME)
cross_inp = [[user_input, hit] for hit in contexts_list]
cross_scores = cross_encoder.predict(cross_inp)

cross_scores_text = []
cross_scores_length = len(cross_scores)
for i in range(cross_scores_length):
    d = {}
    d['score'] = cross_scores[i]
    d['text'] = contexts_list[i]
    cross_scores_text.append(d)

hits_selected = sorted(cross_scores_text, key=lambda x: x['score'], reverse=True)
contexts =""
hits = hits_selected[:RANKER_RESULTS_LIMIT]

Create the context

We create the Context for RAG using the search results

contexts =""
for i in range(len(hits)):
    contexts  +=  hits[i]['text']+"\n---\n"

If you have reached till this point, Congratulations 👌 👌 again. You have been able to complete the understanding of the *Getting Results from Qdrant [ Vector Database ] *

Generate the answer with the Small Language Model

Now we have got the context from the Vector Database , Qdrant and we would send the results to our small language model Phi3

We also use the small language model microsoft/Phi-3-mini-128k-instruct model .

From the Hugging Face model card

The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

From the Microsoft blog

Thanks to their smaller size, Phi-3 models can be used in compute-limited inference environments. Phi-3-mini, in particular, can be used on-device, especially when further optimized with ONNX Runtime for cross-platform availability. The smaller size of Phi-3 models also makes fine-tuning or customization easier and more affordable. In addition, their lower computational needs make them a lower cost option with much better latency. The longer context window enables taking in and reasoning over large text content—documents, web pages, code, and more. Phi-3-mini demonstrates strong reasoning and logic capabilities, making it a good candidate for analytical tasks.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

Create the prompt

The prompt is created with 2 components

Context which we created in the section Create the context
User input which is the user input

prompt = f"""Answer based on context:\n\n{contexts}\n\n{user_input}"""

Create the message template

messages = [
     {"role": "user", "content": prompt},
]

Generate the message

%%time
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs =  model_inputs.to('cuda')
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)

Print the answer

print(decoded[0].split("<|assistant|>")[-1].split("<|end|>")[0])

Code

The code can be found in the Kaggle notebook
BBC NEWS Advanced RAG PHI3

Apache Spark architecture

Ambarish Ganguly — Fri, 30 Sep 2022 21:02:07 +0000

apachespark #spark #sparkarchitecture

01 - Spark Architecture Basics in 6 mins. Concepts explained
📓 Application
📓 Driver
📓 Executor
📓 Partition
📓 Job
📓 Stage
📓 Tasks
📓 Slots
📓 Lazy evaluation
📓 Narrow and wide Transformations
📓 Actions

Paddy Disease Classification using Azure AI

Ambarish Ganguly — Sat, 23 Jul 2022 06:13:00 +0000

Rice (Oryza sativa) is one of the staple foods worldwide.

Paddy, the raw grain before removal of husk, is cultivated in tropical climates, mainly in Asian countries. Paddy cultivation requires consistent supervision because several diseases and pests might affect the paddy crops, leading to up to 70% yield loss. Expert supervision is usually necessary to mitigate these diseases and prevent crop loss. With the limited availability of crop protection experts, manual disease diagnosis is tedious and expensive. Thus, it is increasingly important to automate the disease identification process by leveraging computer vision-based techniques that achieved promising results in various domains.

Data

Data is taken from Paddy Disease Classification Dataset from Kaggle We have taken a subset of the data provided in the dataset to demonstrate the power of Azure Cognitive Services

I have made a small dataset [ 1000 images - around 100 images of 10 classes ] from the parent dataset in Kaggle mentioned above for quick experimentation. The data has around 100 images of each of the classes bacterial_leaf_blight, bacterial_leaf_streak , bacterial_panicle_blight , blast , brown spot , dead heart , downy mildew , hispa , normal and tungro

The steps to model and predict for this problem are as follows:

Create a Custom Vision AI project
Add Images to the project
Train on the images and create the model
Publish the model and expose the endpoint for use by other clients
Use the exposed endpoint and predict using new images

Create a Custom Vision AI project

Navigate to https://www.customvision.ai/projects to create a custom vision project.

We created a project with

Name - paddy

Project Type - Classification. Since we are classifying whether the image is having bacterial_leaf_blight, bacterial_leaf_streak , bacterial_panicle_blight , blast , brown spot , dead heart , downy mildew , hispa , normal and tungro

Classification Type - Multiclass. There are 2 choices here, Multiclass and Multilabel. We choose Multiclass since the image is associated with only one class ( bacterial_leaf_blight, bacterial_leaf_streak , bacterial_panicle_blight , blast , brown spot , dead heart , downy mildew , hispa , normal and tungro ). A single image is not associated with multiple classes.

If a single image was associated with multiple classes, then we had to choose the Classification type as Multilabel.

Add Images

We upload the bacterial_leaf_blight images and also tag them.

We also add images of other classes

bacterial_leaf_streak , bacterial_panicle_blight , blast , brown spot , dead heart , downy mildew , hispa , normal and tungro

Train the images

We train the images by clicking the Train button in the portal

Training

We can select Quick Training or Advanced Training for training the images

We choose Advanced Training to train the images. Each of the 10 classess have around 100 images .

We do model training and we can see the various iterations

This is the output of the 4th Iteration [ Advanced Training ] . In Advanced Training, we can limit the budget by specifying the time duration

Publish

We can now Publish the model so that we can use the endpoint of the model for the prediction of unseen images.

Project details

We display the Azure Cognitive project which has the project id, the published endpoint. This will be used for predicting the unseen test images.

Prediction

Import the libraries

from azure.cognitiveservices.vision.customvision.training import CustomVisionTrainingClient
from azure.cognitiveservices.vision.customvision.prediction import CustomVisionPredictionClient
from azure.cognitiveservices.vision.customvision.training.models import ImageFileCreateBatch, ImageFileCreateEntry, Region
from msrest.authentication import ApiKeyCredentials
import os, time, uuid

Set the parameters

ENDPOINT = "YOUR ENDPOINT"
training_key = "YOUR training_key"
prediction_key = "YOUR prediction_key"
prediction_resource_id = "YOUR prediction_resource_id"
project_id = "YOUR project_id"
publish_iteration_name = "YOUR publish_iteration_name"

Complete the prediction

base_image_location = os.path.join (os.path.dirname(__file__), "train_images")


# Now there is a trained endpoint that can be used to make a prediction
prediction_credentials = ApiKeyCredentials(in_headers={"Prediction-key": prediction_key})
predictor = CustomVisionPredictionClient(ENDPOINT, prediction_credentials)

with open(os.path.join (base_image_location, "blast/110406.jpg"), "rb") as image_contents:
    results = predictor.classify_image(
        project_id, publish_iteration_name, image_contents.read())

    # Display the results.
    for prediction in results.predictions:
        print("\t" + prediction.tag_name +
              ": {0:.2f}%".format(prediction.probability * 100))

References

Paddy Disease Classsification Dataset

2.Azure Custom Vision

Hidden Gems Book

Ambarish Ganguly — Mon, 16 May 2022 03:43:30 +0000

If you are interested in
🌟 Data Visualization
🌟 Text Mining
🌟 Network Graphs
🌟 Cosine Similarity Recommenders
🌟 Topic Modelling
🌟 Dimension Reduction using Principal Component Analysis

please check out the notebook written in book format in Hidden Gems Book

What is Kaggle

From the Kaggle website

Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access GPUs at no cost to you and a huge repository of community published data & code.

Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.

Kernels in Kaggle Terminology are scripts , notebooks which have been shared in Kaggle to be viewed by the community

Kaggle also hosts data science competitions , datasets , notebooks shared by the community and also a number of wonderful courses to learn hands on data science

Hidden Gems

Heads or Tails [Martin Henze] has compiled a list of 300 kernels for a period of 100 weeks which he believes are Hidden Gems , Kernels which are Gems but they did not get their due recognition. Thanks to the wonderful effort, Heads or Tails for the Kaggle Community.

Photo by Dan Farrell on Unsplash

Logistic Regression and Classification

Ambarish Ganguly — Tue, 29 Mar 2022 06:10:37 +0000

💎 Concept of Logistic Regression
💎 Applications of Classification
💎 Concepts of True Positive, True Negative, False Positive , False Negative
💎 Concepts of Sensitivity and Specificity
💎 Concepts of Precision and Recall and when to apply what
💎 Concepts of F [ Beta ] Score and F1 Score

PCA with Cricket Analytics and AWS SageMaker in 15 minutes

Ambarish Ganguly — Sun, 20 Mar 2022 16:23:03 +0000

🎁 Best Bowler in Indian Premier League [ Cricket ] 2020 season using PCA
🎁 Principal Component Analysis concepts
🎁 Usage of AWS Sagemaker and PCA
🎁 Code and Data Files in the comments
https://github.com/ambarishg/sagemaker has the code and data files

Central Limit Theorem

Ambarish Ganguly — Tue, 15 Mar 2022 15:46:50 +0000

The following video on Central Limit Theorem in 5 minutes discusses the following topics

✅ Population

✅ Sample

✅ Central Limit Theorem

✅ Central Limit Theorem example

Hope you find it useful in your datascience and machine learning journey

Binomial Distribution and Case studies

Ambarish Ganguly — Sun, 31 Oct 2021 18:39:04 +0000

Descriptive Statistics Part 2

Ambarish Ganguly — Sun, 24 Oct 2021 15:43:16 +0000

statistics
Chebyshev Theorem, Skewness, Kurtosis, Percentiles explained

I enjoyed creating this video and hope all of you will like it.

Descriptive Statistics Part 1

Ambarish Ganguly — Thu, 09 Sep 2021 04:58:54 +0000

This is the 1st video in the Descriptive Statistics playlist. This is an introduction to Descriptive Statistics in a very simple manner.We will discuss the following
💎 Mean , Median , Mode

💎 Variance and Standard Deviation

💎 Covariance and Correlation

I enjoyed creating this video and hope all of you will like it.

Single Neuron video

Ambarish Ganguly — Fri, 02 Jul 2021 19:00:09 +0000

This is the first video. We explain the single neuron here.

Convolution in 1 dimension

Ambarish Ganguly — Mon, 26 Apr 2021 10:48:58 +0000

Basics

The Convolutional block is one of the basic building blocks used in deep learning. We go in-depth with Convolution in 1 dimension and understand the basics of convolution, strides, and padding. We explain visually and also through PyTorch code to verify our concepts.

The Kernel takes an Input and provides an output which is sometimes referred to as a feature map

The Kernel is made up of many things . This is a very simplified picture of the things it has . The weights , biases , strides and padding are some of them

Kernel Size = 1 , Stride = 1

Here the size of the kernel is 1. It has a single weight and bias.

Input is [ 2, 3, 4 ]

Stride is 1, therefore the kernel moves 1 slot after every operation.

Outputs are

2 * weight + bias
3 * weight + bias . The kernel moves 1 slot and operates on 3
4 * weight + bias. The kernel moves 1 slot and operates on 4

We implemented this in Pytorch and obtained the same result.


m = nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size = 1, stride=1)  

input = torch.tensor([[[2.,3.,4.,]]])
print(input)

output = m(input)
print(output)

print(2 * m.weight + m.bias )
print(3 * m.weight + m.bias)
print(4 * m.weight + m.bias)

Kernel Size = 2 , Stride = 1

Here the size of the kernel is 2. It has 2 weights and bias.

Input is [ 2, 3, 4 ]

Step 1:

The weights w0 and w1 operate on inputs 2, 3. This provides the output 2 * w0 + 3 * w1 + bias

Step 2:

The weights w0 and w1 operate on inputs 3, 4. This provides the output 3 * w0 + 4 * w1 + bias

We implemented this in Pytorch and obtained the same result.

m = nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size = 2, stride=1)

m.weight[0][0][0] , m.weight[0][0][1] , m.bias

2 * m.weight[0][0][0] + 3* m.weight[0][0][1] +  m.bias 
3 * m.weight[0][0][0] + 4* m.weight[0][0][1] +  m.bias 

output = m(input)
print(output)

Kernel Size = 2 , Stride = 2

Here the size of the kernel is 2. It has 2 weights and bias.

Input is [ 2, 3, 4 ]

Step 1:

The weights w0 and w1 operate on inputs 2, 3. This provides the output 2 * w0 + 3 * w1 + bias

Step 2:

The kernel moves 2 slots. Therefore, the kernel cannot operate on 4.


m = nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size = 2, stride=2)

m.weight[0][0][0] , m.weight[0][0][1] , m.bias

2 * m.weight[0][0][0] + 3* m.weight[0][0][1] +  m.bias 

output = m(input)
print(output)

Kernel Size = 2 , Stride = 2 , Padding = 1

Here the size of the kernel is 2. It has 2 weights and bias.

With padding = 1, the kernel has zeros on both sides of the input as you can see in the figure

Input is [ 2, 3, 4 ]

Step 1:

The weights w0 and w1 operate on inputs 0, 2. This provides the output 0 * w0 + 2 * w1 + bias

Step 2:

The kernel moves 2 slots.
The weights w0 and w1 operate on inputs 3, 4. This provides the output 3 * w0 + 4 * w1 + bias

m = nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size = 2, stride=2,padding = 1)

m.weight[0][0][0] , m.weight[0][0][1] , m.bias

print(0 * m.weight[0][0][0] + 2* m.weight[0][0][1] +  m.bias )
print(3 * m.weight[0][0][0] + 4 * m.weight[0][0][1] + m.bias )

output = m(input)
print(output)

Kaggle notebook link

Convolution in 1 dimension deep dive