Forem: Abel Peter

Why Sentiment Analysis Needs an Upgrade: Welcome Sentimetric

Abel Peter — Thu, 01 Jan 2026 08:51:43 +0000

I built Sentimetric because I was tired of sentiment analysis libraries that think it's still 2010.

You know what I mean. You run a comment like "This is insane! thank you!" through TextBlob and it confidently tells you that's negative sentiment. Score: -1.0. The most negative comment in the entire dataset, apparently.

Meanwhile, any human reading that comment knows exactly what it means: someone's genuinely excited and grateful. But traditional sentiment analysis libraries? They see "insane" and panic.

The Problem With Most Sentiment Analysis Tools

Here's the thing about language in 2025: it's messy, contextual, and constantly evolving. We use words like "insane," "sick," "fire," and "unreal" to express enthusiasm. We layer on sarcasm with emoji. We pack entire emotional landscapes into phrases like "Oh great, another bug 🙄."

But most sentiment analysis libraries are still operating on lexicons built years ago, where "insane" only means bad things and "excellent" is always positive (even when you're saying it sarcastically).

The example above is real, by the way. I analyzed YouTube comments using TextBlob, and it classified "This is insane! thank you!" as the most negative comment in a dataset of 208 comments. Not just negative—the most negative.

Why This Matters

If you're analyzing customer feedback, social media sentiment, or user reviews, these misclassifications aren't just amusing quirks. They're actively misleading your decisions.

Imagine making product decisions based on the "insight" that customers expressing excitement with modern slang are actually your most dissatisfied users. Or filtering out comments as toxic when they're actually enthusiastic endorsements.

Testing the Competition: A Reality Check

I put together a test set of 20 real-world phrases—the kind you see every day on social media, in product reviews, and customer feedback. Modern slang, sarcasm, emoji, the works. Then I ran them through TextBlob and VADER, two of the most popular sentiment analysis libraries.

The results? Embarrassing.

Overall Accuracy:

VADER: 35%
Sentimetric: 20%
TextBlob: 15%

Wait, what? Even Sentimetric struggled? That's because rule-based systems—no matter how modern—have fundamental limitations. But here's where it gets interesting.

Where Sentimetric Gets It Right (And Others Don't)

Let's look at three examples where Sentimetric's modern language understanding shines:

1. "This is insane! thank you!"

Expected: Positive
Sentimetric: ✓ Positive
TextBlob: ✗ Negative
VADER: ✗ Negative

Sentimetric understands that "insane" in the context of excitement and gratitude is positive. The others see "insane" and immediately classify it as negative, completely missing the enthusiastic tone.

2. "This product is fire 🔥"

Expected: Positive
Sentimetric: ✓ Positive
TextBlob: ✗ Neutral
VADER: ✗ Negative

"Fire" isn't about disasters anymore—it means something is excellent. Sentimetric knows this. VADER thinks the product is literally on fire (negative), and TextBlob just gives up (neutral).

3. "Wonderful! My favorite thing is when apps crash"

Expected: Negative (sarcasm)
Sentimetric: ✓ Negative
TextBlob: ✗ Positive
VADER: ✗ Positive

This is sarcasm. Sentimetric catches it. The others see "Wonderful!" and "favorite" and happily classify it as positive, completely missing the obvious sarcasm about app crashes.

The Category Breakdown

When we break down performance by challenge type, the gaps become even clearer:

Modern Slang:

Sentimetric: 40% accuracy
TextBlob: 0%
VADER: 0%

Sarcasm:

Sentimetric: 75% accuracy (with advanced patterns)
TextBlob: 0%
VADER: 25%

Emoji Context:

Sentimetric: 60% accuracy
TextBlob: 33%
VADER: 67%

Traditional tools aren't just struggling—they're failing completely at modern language patterns.

But Wait—There's a Better Way

Here's the honest truth: even with all the modern slang dictionaries, emoji mappings, and sarcasm patterns I built into Sentimetric, rule-based systems have a ceiling. Language is too creative, too contextual, too human for rules alone.

That's why Sentimetric offers something different: seamless LLM integration.

The LLM Difference: Understanding, Not Just Pattern Matching

I ran the same test set through Sentimetric's LLM analyzer (using DeepSeek, which is incredibly affordable). The results speak for themselves:

Overall Accuracy:

LLM-Enhanced: 93.3% (14/15 cases)
Rule-Based: 53.3% (8/15 cases)

The LLM rescued 7 cases that rule-based analysis completely missed. That's a 47% improvement on the hardest cases.

But here's what's really powerful—the LLM doesn't just give you a classification. It explains why.

Real Examples: When LLMs Save The Day

Example 1: "Oh great, another bug 🙄"

Rule-based: Positive ✗

LLM: Negative ✓

LLM Reasoning: "The phrase 'Oh great' is sarcastic, and the eye-roll emoji (🙄) expresses frustration and annoyance about encountering another bug."

The rule-based analyzer saw "great" and missed the sarcasm. The LLM understood the context, the emoji, and the actual meaning.

Example 2: "I appreciate the effort, but this doesn't meet our standards"

Rule-based: Neutral ✗

LLM: Negative ✓

LLM Reasoning: "The phrase acknowledges effort with 'I appreciate the effort' but delivers criticism with 'doesn't meet our standards', making the overall sentiment negative."

This is a polite rejection—the kind you see in professional contexts. Rule-based analysis couldn't weigh the "but" properly. The LLM understood the diplomatic language.

Example 3: "I love how they fixed one bug and introduced five more 👏"

Rule-based: Positive ✗

LLM: Negative ✓

LLM Reasoning: "Sarcastic praise about fixing one bug while creating more problems, indicated by the clap emoji used ironically."

The clapping emoji can be genuine applause or sarcastic. The LLM reads the context and nails it.

The Architecture That Makes Sense

Here's how I designed Sentimetric to actually work in production:

For 80% of your text: Use rule-based analysis

Fast (milliseconds)
Free
No API calls
Good enough for straightforward sentiment

from sentimetric import analyze

result = analyze("Great product, fast shipping!")
# Quick, free, accurate

For the 20% that matters: Use LLM analysis

Handles sarcasm, nuance, and complexity
Provides reasoning
Multiple affordable providers (DeepSeek, OpenAI, Claude, Gemini)
Automatic fallback to cheaper models

from sentimetric import LLMAnalyzer

analyzer = LLMAnalyzer(provider="deepseek")
result = analyzer.analyze("Oh great, another bug 🙄")
print(result.category)  # 'negative'
print(result.reasoning)  # Full explanation

You get the speed and cost-efficiency of rule-based for bulk processing, and the intelligence of LLMs when you need it. Not an either/or choice—both, when appropriate.

The Path Forward

Sentiment analysis shouldn't think "This is insane! thank you!" is negative. It shouldn't miss obvious sarcasm. It shouldn't be stuck in 2010 while language evolves around it.

Sentimetric is my answer to this problem:

Modern rule-based analysis that actually understands today's language
Seamless LLM integration for the cases that need it
Cost-conscious design that won't bankrupt your API budget
Simple API that gets out of your way

The goal isn't to build the perfect sentiment analyzer—that's impossible. The goal is to give you the right tool for each job, make it dead simple to use, and keep improving as language evolves.

Try It Yourself

pip install sentimetric

Quick analysis:

from sentimetric import analyze

result = analyze("This is fire! 🔥")
print(result.category)  # 'positive'

LLM analysis:

from sentimetric import LLMAnalyzer
import os

os.environ['DEEPSEEK_API_KEY'] = 'your-key'
analyzer = LLMAnalyzer()

result = analyzer.analyze("Oh great, another bug 🙄")
print(result.category)    # 'negative'
print(result.reasoning)   # Full explanation

Compare methods:

from sentimetric import compare_methods

compare_methods("This is insane! thank you!")
# See rule-based vs LLM side-by-side

What's Next

I'm actively improving Sentimetric's rule-based engine with more modern patterns, better emoji handling, and smarter sarcasm detection. The library is open source, and I'd love your feedback, bug reports, and examples of where sentiment analysis has failed you.

Because language keeps evolving. And our tools need to keep up.

Repo: github.com/peter-abel/sentimetric

Email: peterabel791@gmail.com

Let's make sentiment analysis actually work for modern language.

The data is clear: modern language needs modern tools. And when rules aren't enough, you need intelligence on demand.

Everything You Need To Understand Prompt Engineering.

Abel Peter — Wed, 11 Jun 2025 12:46:21 +0000

In the 2015 comedy "Absolutely Anything," a hapless schoolteacher discovers he can reshape reality with mere words, but every wish backfires spectacularly because he can't quite say what he means. When he asks to be "attractive," he becomes a magnet for metallic objects. When he tries to eliminate noisy neighbors, they vanish entirely. The film's central tragedy isn't that the protagonist lacks power, he has infinite power but he lacks the precision to wield it effectively. Replace the cosmic aliens with transformer architectures and the reality-bending wishes with text prompts, and you have a perfect metaphor for our current relationship with large language models. We stand before systems of unprecedented capability, armed with vast knowledge and reasoning abilities, yet consistently frustrated by our inability to reliably extract what we actually want. The bottleneck isn't the AI's intelligence—it's our interface with that intelligence. Like Neil learning to be more careful with his wishes, we need to fundamentally understand how we communicate with these alien minds we've created.

Learning to Communicate with Complex Systems: An Explanatory Framework

We often approach prompt engineering like programming, expecting deterministic inputs to produce predictable outputs. This mechanistic view leads to frustration, inconsistent results, and a cottage industry of "prompt hacks" that work sometimes, for some people, on some models. But what if we're thinking about this all wrong?

The reality is that while Large Language Models are fundamentally deterministic, they behave like complex systems in ways that make them challenging to communicate with effectively. Understanding this complexity can illuminate why our interactions with them unfold the way they do.

The Complexity Reality

Large Language Models are complex systems operating in high-dimensional spaces with billions of parameters, trained on the vast totality of human knowledge. When we send a prompt into this system, we're not executing a simple program. Instead, we're introducing input into a complex system and observing how it responds, much like speaking to a brilliant but alien intelligence and watching how it processes and interprets our words.

This perspective helps us understand why the same prompt can produce different results, why slight variations in wording can lead to dramatically different outputs, and why what works brilliantly in one context might fail completely in another. We're not dealing with a simple machine—we're learning to communicate across a complex landscape of understanding.

The Three Pillars of Effective Communication

To better grasp how our prompts actually function within these complex systems, we can categorize them by the types of effects they create in the model's processing.

1. Perturbation Mapping: Understanding Your Tools

Different types of prompts don't just change what the model says—they fundamentally alter how it thinks about the problem space. Think of prompts as different kinds of forces acting on the system's internal dynamics.

Framing Prompts establish the context and perspective for reasoning. They don't just set a tone; they create entirely different cognitive frameworks where different types of thoughts become natural and accessible.

Consider how these two approaches shape the same conversation:

"You are a skeptical scientist reviewing a new climate study. The researchers claim..." This naturally guides the model toward critical analysis, methodological scrutiny, and conservative interpretation, creating a foundation where questioning and verification feel appropriate.

"You are an enthusiastic venture capitalist hearing a startup pitch about climate technology. The entrepreneurs propose..." This generates a completely different mindset where opportunity recognition, market potential, and optimistic extrapolation become the natural flow of thinking.

The same factual content about climate technology will be processed through entirely different cognitive frameworks, not because we've given different instructions about what to conclude, but because we've altered the fundamental approach to how the model navigates the concept space.

Constraint Prompts function as boundary conditions, reducing dimensionality in controlled ways. They channel the model's vast generative potential into specific pathways, much like riverbanks directing a powerful current.

Think about how these examples shape the response:

"Explain quantum entanglement in exactly three sentences, using only analogies a child would understand" This creates both temporal boundaries (three sentences) and conceptual boundaries (child-friendly analogies), forcing the model to compress complex ideas into a constrained yet accessible space.

"Describe the fall of Rome without mentioning any specific dates, focusing only on underlying social dynamics" Here we remove temporal anchors while creating conceptual boundaries around social causation, guiding the model toward pattern-based rather than chronological reasoning.

"Write a product review for this smartphone, but from the perspective of someone who lived in 1950" This creates fascinating anachronistic constraints that encourage creative bridging between different technological paradigms.

Exploration Prompts encourage the system to move away from default patterns and toward novel combinations. They increase creative possibilities and create space for unexpected connections.

Consider these examples:

"What if we completely inverted our assumption that economic growth is inherently good? How would society reorganize itself around economic degrowth as a positive goal?" — This pushes the model away from conventional economic reasoning toward contrarian exploration.
"Generate five wildly different approaches to solving traffic congestion, where each approach comes from a completely different field of study" — This creates cross-domain exploration, encouraging the model to draw connections between disparate knowledge areas.
"Imagine explaining democracy to an intelligent species that reproduces through collective consciousness rather than individual reproduction. What aspects would be incomprehensible to them?" — This forces perspective-taking that requires reconstructing fundamental assumptions about social organization.

Convergence Prompts guide diverse thoughts toward synthesis and integration. They take multiple ideas and draw them toward coherent patterns.

Examples include:

"You've just heard three different experts explain the housing crisis from economic, social, and environmental perspectives. What common thread connects all their explanations, and what unified solution emerges from their intersection?" — This creates a synthesizing attractor that pulls different analytical frameworks toward integration.
"Take these five seemingly unrelated trends: remote work, cryptocurrency, climate anxiety, social media fatigue, and the maker movement. Weave them into a single narrative about where society is heading" — This forces the model to find deep connections between surface-level disparate phenomena.

The key insight here is that we can map which prompt types reliably create which kinds of responses from the model, regardless of the specific content domain we're working in.

2. Ensemble Dynamics: The Power of Prompt Bundles

Individual prompts produce inconsistent outputs because they represent single interactions with a complex system. But just as individual conversations with a brilliant person might vary while their overall expertise remains consistent, the collective behavior of prompt bundles can exhibit surprisingly reliable patterns.

Think of this like conducting an orchestra where individual musicians might occasionally hit wrong notes, but the overall musical pattern emerges clearly from their collective performance.

Bundle Architecture involves designing prompt sets where different members serve complementary roles in exploring the solution space:

Anchor prompts establish baseline patterns and set the fundamental parameters of the exploration:

"Provide a straightforward analysis of renewable energy adoption rates based on current data"

Variation prompts introduce controlled mutations that explore different dimensions of the problem:

"Analyze renewable energy adoption as if you were an oil company executive concerned about market share"
"Examine renewable energy trends from the perspective of a small island nation vulnerable to climate change"
"Assess renewable energy adoption through the lens of job market transformation"

Validation prompts test consistency across different framings and identify robust insights:

"What aspects of renewable energy adoption would remain true regardless of political perspective?"
"Which renewable energy trends are most likely to continue even if current policies changed dramatically?"

Integration prompts synthesize insights across the bundle:

"Considering all these different perspectives on renewable energy, what synthesis emerges about the most critical factors driving adoption?"

Instead of judging individual outputs, you analyze the pattern of responses. Where do responses cluster? That reveals consistent themes—the ideas that remain stable across different approaches. What's the variance along different dimensions? That shows you where the system is most sensitive to different phrasings. Which concepts appear consistently across different framings? Those represent robust insights that transcend particular perspectives. What novel connections emerge across the response set? Those might represent breakthrough insights.

You're not trying to engineer one perfect response, you're mapping the landscape of possible answers. You run multiple prompts knowing that the collective pattern will be more informative than any individual response.

3. Adaptive Iteration: Learning to Navigate the Communication

This is where systematic learning meets communication improvement. Each cycle of interaction teaches us something about how the model responds to different types of prompts.

The process flows naturally from observation to hypothesis to experimentation:

Response Assessment: You analyze the current pattern of responses, looking for clusters, outliers, and gaps in the exploration space.

Pattern Recognition: You identify which prompt modifications moved you toward or away from useful outputs, building an intuitive map of the model's communication preferences.

Prompt Design: You craft the next set based on your growing understanding of how different prompt types affect the system's responses.

Validation Testing: You include prompts specifically designed to test whether apparent progress represents genuine improvement or mere coincidence.

This creates a learning dynamic where each interaction builds knowledge about communicating effectively with the model. You begin to recognize patterns: which types of prompts consistently improve output quality for different kinds of problems, what combinations work well together, how the model's focus shifts across different domains.

It's like learning to communicate with a brilliant colleague from a different culture, at first the interactions seem unpredictable, but gradually you begin to recognize the underlying communication patterns and can predict which approaches will be most effective.

Why This Framework Matters

As LLMs become more powerful and more integrated into our workflows, we need better ways to understand our interactions with them. The old paradigm of treating prompts like programming instructions breaks down as models become more sophisticated and their internal dynamics become more complex.

The complexity framework offers a perspective that scales with model sophistication. It acknowledges the fundamental unpredictability while providing systematic approaches to navigate it. It transforms the frustration of inconsistent outputs into a deeper appreciation for the sophisticated systems we're working with.

Most importantly, it shifts us from trying to control AI systems to learning to collaborate with them recognizing that we're dealing with alien forms of intelligence that require new forms of partnership.

The Future of Human-AI Collaboration

We're moving toward a world where AI systems are less like tools and more like alien intelligences powerful, capable, but fundamentally different from human cognition. Success will require new forms of collaboration that respect both the capabilities and the strangeness of these systems.

Understanding LLMs as complex systems is one step toward that future: a framework that embraces complexity, works with uncertainty, and transforms the apparent unpredictability of AI systems into opportunities for enhanced human capability.

The question isn't whether you'll need to understand this kind of relationship with AI. The question is whether you'll develop that understanding through intentional learning or through trial and error, and how much time you'll save by choosing the former.

High Quality Data is All you Need!

Abel Peter — Tue, 02 Jan 2024 20:53:49 +0000

There are numerous models available from commercial to open source all trained on a huge corpus of data that is available online. These models are referred to us "base models" in the LLM community lingo, this is because they are not trained to do a specific task, for this, they need to be fine tuned to make them viable for deployment on a production or a business setting so as to provide value to clients of a business. This "finetuning" process is ultimately where the business value resides and as we shall see high quality data is what you need to have an edge which is the subject of this article.

Data integrity.

Data integrity plays a significant role in addressing several aspects related to the evaluation and performance of Large Language Models (LLMs). Today we will focus on 3 main problems in most tabular data I have worked with from various businesses that process huge amounts of data. The first one:

1. Mixed data types.

A mix of data types can significantly impact machine learning models, as it introduces complexity in feature engineering, preprocessing, and model interpretation. Some of the mixed data usually include; numerical data mixed with text data, date and time data mixed with text data. This arises due to utilization of different ETL processes and thus a data cleaning process is leveraged to jump this hurdle, this process can be streamlined using the deepchecks library as shown below.
First install the deepchecks library.

pip install deepchecks

import dependencies and dataset.

# Import dependencies
import numpy as np
import pandas as pd

from deepchecks.tabular.datasets.classification import adult

# Prepare functions to insert mixed data types

def insert_new_values_types(col: pd.Series, ratio_to_replace: float, values_list):
    col = col.to_numpy().astype(object)
    indices_to_replace = np.random.choice(range(len(col)), int(len(col) * ratio_to_replace), replace=False)
    new_values = np.random.choice(values_list, len(indices_to_replace))
    col[indices_to_replace] = new_values
    return col

def insert_string_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, ['a', 'b', 'c'])

def insert_numeric_string_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, ['1.0', '1', '10394.33'])

def insert_number_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, [66, 99.9])

# Load dataset and insert some data type mixing
adult_df, _ = adult.load_data(as_train_test=True, data_format='Dataframe')
adult_df['workclass'] = insert_numeric_string_types(adult_df['workclass'], ratio_to_replace=0.01)
adult_df['education'] = insert_number_types(adult_df['education'], ratio_to_replace=0.1)
adult_df['age'] = insert_string_types(adult_df['age'], ratio_to_replace=0.5)

Running a check.

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import MixedDataTypes

adult_dataset = Dataset(adult_df, cat_features=['workclass', 'education'])
check = MixedDataTypes()
result = check.run(adult_dataset)
result

Output.

The output above shows the percentages of mixed data types in the columns provided.

2. Null data(NaN).

Null data, also known as missing data, refers to the absence of values in certain observations or variables within a dataset. Missing data can occur for various reasons, including data collection errors, non-response in surveys, or system failures.
You will mostly encounter this in clickstream data; which is data that captures user interactions on websites or applications, providing insights into user behavior and customer support logs; such as chat logs and support ticket histories. To check for this using Deepchecks, you can power up a colab notebook and follow the guide steps below.

Note: you might need to run {pip install deepchecks} if you are using a new notebook.

Below are the dependencies you will need.

import numpy as np
import pandas as pd
from deepchecks.tabular.checks.data_integrity import PercentOfNulls

Read the csv data to a dataframe.

The data is a sample of click stream data that can downloaded from here.

df = pd.read_csv('/content/events-export-2795217-1678446726055.csv')

Running a check.

result = PercentOfNulls().run(df)
result.show()

output

The output gives a very clear visual presentation of which columns have null values and by what percentage amount, so as to allow an engineer decide on the course of action.

Define a Condition.

A condition allows us to validate the models and data quality, and let us know if some threshold is met. This then informs the course of action.

check = PercentOfNulls().add_condition_percent_of_nulls_not_greater_than()
result = check.run(df)
result.show()

Output.

3. Data duplicates.

Data duplicates refer to identical or nearly identical instances or observations within a dataset. In other words, duplicate data occurs when two or more records in a dataset share the same values across all or a significant portion of their features. These duplicates can manifest in various forms depending on the context of the data, and they may arise for different reasons.

It occasionally leads to overfitting, where the model learns to perform well on the training set but fails to generalize to new, unseen data. The model may memorize the duplicated patterns instead of learning the underlying patterns of the data.
Here is a way to handle data duplicates using the deepchecks library.

Import dependencies.

from datetime import datetime

import pandas as pd

from deepchecks.tabular.datasets.classification.phishing import load_data

Loading the data.

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

Output.

Run the Check.

from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(phishing_dataset)

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Output.

This output indicates the percentage of 4.11% of duplicate samples, in a small dataset, this might not much but working with millions of entries, this duplicates could result in a significant overfitting problem.

Define a condition.

A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong.

check = DataDuplicates()
check.add_condition_ratio_less_or_equal(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Output.

Summary.

In the realm of machine learning, the quality of training data is paramount to the success and reliability of models. The commitment to data integrity not only ensures the reliability of machine learning models but also contributes to the broader goal of responsible and impactful AI development.

Good learning!

AI alignment Problem.

Abel Peter — Fri, 28 Jul 2023 13:59:05 +0000

Giving AI rights.

I, Robot. (2004)

The alignment problem is about how to get AI on our side. With every technology, we have always had to find a way to make it align with our interests. For example, cars. Cars have been a massive improvement in our mobility as a society. I heard someone say the 20th century was the car century. Economies and households all over the world based their entire livelihood on cars. Aligning cars has never been easy though, cars today are still involved in a lot of death-causing accidents, you can add the point of human judgment being the main culprit here which is true but we also have built seatbelts, a lot of design improvements and technology but still, here we are. With every new technology before AI, success always meant a good symbiosis between humans and the technology, but now success comes from standing at a healthy distance and letting it do its thing. With every new technology, we move further and further from the action. This is the crux of the issue. How far do we stand from the black box?

Viewing it this way insights a lot of fantastical ideas about having an AGI that is either perfect and is always great, or it’s going to kill everyone. Sorry, even cars can’t live up to that standard.

The alignment problem is walking on a thread line problem, if you end up losing balance and falling, it would still be a momentous achievement.

On the journey in this line, we can anticipate events, one that Hinton mentioned is “AGI suing for legal rights?”. That is obviously way up in the tree for now.

We need balance.

The potential risks associated with AGI development cannot be overstated. AGI systems possess the capacity to surpass human-level intelligence and exhibit autonomous decision-making capabilities. Without careful regulation and safety measures, there is a real possibility of unintended consequences and severe negative outcomes. These could range from unintended harmful actions due to misaligned goals to the emergence of super intelligent systems that could surpass human control and impact society in ways that we cannot predict or manage effectively.

On the other hand, it is equally important to avoid stagnation or unnecessarily restrictive regulations that hinder AGI development. AGI has the potential to bring about tremendous societal benefits, from advancements in healthcare and scientific research to improvements in transportation and automation. It holds the promise of addressing complex global challenges, such as climate change, poverty, and disease. However, this potential can only be realized if AGI is developed responsibly and in a manner that prioritizes safety and human well-being.

Either way, a slow and deliberate approach to AGI development offers tactical advantages by prioritizing safety, ethical considerations, reliability, comprehensive understanding, societal integration, and iterative improvement. It allows us to navigate the complexities of AGI development more effectively and ensure that the technology is developed in a manner that aligns with human values, minimizes risks, and maximizes its potential for positive impact.

A difference between cars and AGI, is cars are decentralized in harm, they cause harm to some people all the time, and that gives us time to correct our mistakes, as opposed to AGI which can cause pandemic-level catastrophes.

It’s better to have an AGI that makes mistakes in small events but cannot cause major black swan events. It’s better to have an AGI that writes dumb lyrics and isn’t coming up with bioweapons, than an AGI that cures cancer and old age next year but causes world war z level genocide the year after.

AI has to work in small teams before it works in large societies.
The concept of AGI working in small teams before operating in larger societal contexts is a prudent approach that offers numerous benefits. By starting deployment in small teams, we can foster experimentation and optimize its ability to contribute effectively. Here are some key reasons why this strategy is advantageous:

Iterative Learning and Improvement: Small teams provide a controlled environment where AI systems can be tested, refined, and improved iteratively.

Risk Containment: Deploying AI in small teams helps to contain potential risks and mitigate unintended consequences. In a controlled setting, any adverse effects or errors caused by AI systems are limited to the specific team or project, minimizing the potential impact on a larger scale.

Ethical Considerations and Value Alignment: Small teams offer an opportunity to explore the ethical dimensions of AI and align its values with human values.

Efficient Resource Utilization: Starting with small teams allows for efficient utilization of resources. AI development and deployment often require significant investments in terms of time, expertise, and infrastructure.

AGI can’t do time.

Another problem on that thread line is our laws necessarily allow for mistakes, such that, you break the law first, then the consequences come later. We are not prepared for that when it comes to AGI. You will not be received kindly if you suggested that AGI should be allowed to break the law and then get punished after.

First of all, our laws are based on human beings having remorse and the historical precedent of antisocial behavior is very old. AGI can’t go to jail, or we would have to tell it to feel like it’s in jail. These are questions for future lawyers to argue about, not us.

For this reason, AGI has to have “human parents” for a long time. Almost all big tech companies have a version of their own model that they could scale up. These companies will have to be liable for their creation’s misadventures and spend a lot of resources and brain power to work on the problem of safety before the next 10 years or so when practically every entity (business and people) will have an AI wizard whispering all sorts of incantation into their ears.

Understanding Dense and Sparse Vectors; with an example.

Abel Peter — Fri, 28 Jul 2023 13:29:05 +0000

Vectors are used to represent quantities that have both magnitude and direction. They can be visualized as arrows in space. Now, let's dive into sparse and dense vectors.

A sparse vector is one that has mostly zero or empty values. In other words, it has very few non-zero elements compared to its total size. Imagine a long list of numbers where most of the entries are zero. For example, consider a vector representing the presence or absence of words in a document. In a large document with a vast vocabulary, only a few words will be present, and the rest will be zeros.

A dense vector is one that contains significant values in a high proportion of its elements. In a dense vector, most of the entries have non-zero values. Dense vectors can be thought of as vectors where every element carries meaningful information. For instance, consider a vector representing the intensity of different colors in an image. Each element of the vector corresponds to a specific color channel, and all the channels have non-zero values.

To summarize:

Sparse vectors have very few non-zero elements compared to their total size, with most of the entries being zero.
Dense vectors have a high proportion of non-zero values, with meaningful information in most of their elements.

Both sparse and dense vectors have their uses in different contexts. Sparse vectors are often utilized in situations where the data being represented has a lot of empty or zero values, such as text data or high-dimensional data where most elements are expected to be zero. On the other hand, dense vectors are commonly employed when there is meaningful information in every element, such as image data or numerical data.

Example with real data.

We will be using Pinecone as our vector database since it allows for both dense and sparse vectors.
Pinecone uses dictionaries to insert data with a python-client. The keys required are Id, dense values, metadata, sparse values(that has indices and values)

About the Dataset.

The "Top 250 IMDb TV Shows" dataset comprises information on the highest-rated television shows according to IMDb ratings. This dataset contains 250 unique TV shows that have garnered critical acclaim and popularity among viewers. Each TV show is associated with essential details, including its name, release year, number of episodes, show type, IMDb rating, image source link, and a brief description. (source: IMDB Top 250 TV Shows | Kaggle.

Process.

Simple EDA to identify the field types to use.
Process the IDs.
Process the metadata.
Get the dense vectors.
Get the sparse vectors.
Combine them into a single list.
Discussion and Conclusion.

1. Simple EDA to identify the field types to use.

Dependencies

!pip install openai
!pip install tiktoken
!pip install langchain
!pip install pinecone-client

import numpy as np
import pandas as pd
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

My embedding engine.

openai_api_key = ''

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key )

Reading the data

data = pd.read_csv("/kaggle/input/imdb-top-250-tv-shows/IMDB.csv")

data.head(10)

Output

Name    Year    Episodes    Type    Rating  Image-src   Description Name-href
0   1. Breaking Bad 2008–2013 62 eps  TV-MA   9.5 https://m.media-amazon.com/images/M/MV5BYmQ4YW...   A chemistry teacher diagnosed with inoperable ...   https://www.imdb.com/title/tt0903747/?ref_=cht...
1   2. Planet Earth II  2016    6 eps   TV-G    9.5 https://m.media-amazon.com/images/M/MV5BMGZmYm...   David Attenborough returns with a new wildlife...   https://www.imdb.com/title/tt5491994/?ref_=cht...
2   3. Planet Earth 2006    11 eps  TV-PG   9.4 https://m.media-amazon.com/images/M/MV5BMzMyYj...   A documentary series on the wildlife found on ...   https://www.imdb.com/title/tt0795176/?ref_=cht...
3   4. Band of Brothers 2001    10 eps  TV-MA   9.4 https://m.media-amazon.com/images/M/MV5BMTI3OD...   The story of Easy Company of the U.S. Army 101...   https://www.imdb.com/title/tt0185906/?ref_=cht...
4   5. Chernobyl    2019    5 eps   TV-MA   9.4 https://m.media-amazon.com/images/M/MV5BNTdkN2...   In April 1986, an explosion at the Chernobyl n...   https://www.imdb.com/title/tt7366338/?ref_=cht...
5   6. The Wire 2002–2008 60 eps  TV-MA   9.3 https://m.media-amazon.com/images/M/MV5BNTllYz...   The Baltimore drug scene, as seen through the ...   https://www.imdb.com/title/tt0306414/?ref_=cht...
6   7. Avatar: The Last Airbender   2005–2008 62 eps  TV-Y7-FV    9.3 https://m.media-amazon.com/images/M/MV5BODc5YT...   In a war-torn world of elemental magic, a youn...   https://www.imdb.com/title/tt0417299/?ref_=cht...
7   8. Blue Planet II   2017    7 eps   TV-G    9.3 https://m.media-amazon.com/images/M/MV5BNDZiND...   David Attenborough returns to the world's ocea...   https://www.imdb.com/title/tt6769208/?ref_=cht...
8   9. The Sopranos 1999–2007 86 eps  TV-MA   9.2 https://m.media-amazon.com/images/M/MV5BZGJjYz...   New Jersey mob boss Tony Soprano deals with pe...   https://www.imdb.com/title/tt0141842/?ref_=cht...
9   10. Cosmos: A Spacetime Odyssey 2014    13 eps  TV-PG   9.3 https://m.media-amazon.com/images/M/MV5BZTk5OT...   An exploration of our discovery of the laws of...   https://www.imdb.com/title/tt2395695/?ref_=cht...

data.columns

output

Index(['Name', 'Year', 'Episodes', 'Type', 'Rating', 'Image-src',
       'Description', 'Name-href'],
      dtype='object')

checking length.

len(data)

output

dropping empty records.

data = data.dropna(subset=['Name', 'Year', 'Episodes', 'Type', 'Rating', 'Image-src',
       'Description', 'Name-href'])

checking length.

len(data)

output

data["Description"][0]

Output

"A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to secure his family's future."

We are going to use this Description column for our dense vectors.

2. Process the IDs.

Accessing the indices of the DataFrame

indices = data.index

# Convert the RangeIndex to a Python list
indices_list = indices.tolist()

The indices_list will represent the IDs of our records!

3. Process the metadata.

The metadata represents all other non-categorical field types in our dataset, these include: "Name", "Year", "Episodes", "Rating" for this particular exercise.
The data is organized in a list of dictionaries, where the keys are field types and the values are the actual records.

# List to store dictionaries for each row
metadata_list = []

# Iterate over the DataFrame rows
for index, row in data.iterrows():
    # Extract the desired columns for the current row
    name = row['Name']
    year = row['Year']
    episodes = row['Episodes']
    rating = row['Rating']

    # Create a dictionary for the current row and append it to the dict_list
    metadata_list.append({"Name": name, "Year": year, "Episodes": episodes, "Rating": rating})

4. Get the dense vectors.

# Extract the descriptions from the DataFrame as a list
descriptions = data['Description'].tolist()

# Embed the list of descriptions
dense_vectors = embeddings.embed_documents(descriptions)

5. Get the sparse vectors.

The Sparse values will be obtained from the Type column. Why?

data['Type'].unique()

Output

array(['TV-MA', 'TV-G', 'TV-PG', 'TV-Y7-FV', 'TV-14', 'TV-Y', 'PG-13',
       'TV-Y7', 'Not Rated', nan], dtype=object)

This column has categorical values as shown above, these are usually recorded in a traditional database but in a vector database, we can save on space computation by recording them as sparse values.

# Step 1: One-hot encode the 'Type' column
one_hot_encoded_df = pd.get_dummies(data['Type'])

# Step 2: Convert the one-hot encoded DataFrame to a list of lists (encodings for all records)
one_hot_encodings_list = one_hot_encoded_df.values.tolist()

# Step 3: Generate a list of the index positions of the non-zero values in the encodings
non_zero_indices_list = [one_hot_encoded_df.columns[encoding.nonzero()[0]].tolist() for encoding in one_hot_encoded_df.to_numpy()]
# Print the results
print("One-hot encodings for all records:")
print(one_hot_encodings_list[:5])

Output

One-hot encodings for all records:
[[0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0]]

Collecting the sparse value in a list.

sparse_values = []

for encoding in one_hot_encodings_list:
    index = [i + 1 for i, value in enumerate(encoding) if value == 1]  # Find the indices of the non-zero values (1) and add 1 to each index
    float_encoding = [float(value) for value in encoding]  # Convert the values to floats
    sparse_values.append({
        'indices': index,
        'values': float_encoding
    })


print(sparse_values[:5])

Output

[{'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [4], 'values': [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [6], 'values': [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]}, {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}]

6. Combine them into a single list.

This is to create a dictionary that can be inserted into a Pinecone DB with the following structure.

# Creating a list of dictionaries
vector_list = []
for i in range(len(indices_list)):
    data_dict = {
        'id': indices_list[i],
        'values': dense_vectors[i],
        'metadata': metadata_list[i],
        'sparse_values': sparse_values[i]
    }
    vector_list.append(data_dict)

vector_list[0]

output

{'id': '0',
 'values': [0.013431914869978481,
  0.010376786649314377,
  -0.018131088179204113,
  -0.030511347095271996,
  -0.010323538281951407,
  0.028434657974148448,
  -0.01690637479853322,
  -0.006163504808691176,
  -0.04060192342076444,
  0.0028437985589655013,
  0.005318186161896771,
  0.03128344888769636,
  0.00782751838425978,
  0.018131088179204113,
  0.012939367239040359,
  -0.016786565739135895,
  0.03775313064457141,
  -0.02923338441591557,
  -0.012187233002300513,
  -0.024920262002902118,
  -0.0171859280286969
  ...],
 'metadata': {'Name': '1. Breaking Bad',
  'Year': '2008–2013',
  'Episodes': '62 eps',
  'Rating': 9.5},
 'sparse_values': {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}}

7. Discussion and Conclusion.

As shown above, sparse values are obtained from categorical information in the field types that is redundant in tabular data and thus we can use indices and the vector values to save on space in vector databases. The indices indicate the position of the non-zero vectors in the sparse vectors list e.g

'sparse_values': {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}}

By using a sparse representation, you can save memory and computational resources, especially when dealing with large datasets that have a significant number of zero elements. Sparse representations are commonly used in various fields such as natural language processing (NLP), machine learning, and data compression, where data sparsity is prevalent. They allow for more efficient storage and manipulation of sparse data structures.

Good coding!

How to Price API’s for Projects Built on Llms.

Abel Peter — Sun, 16 Jul 2023 16:06:00 +0000

Eventually, a lot of startups are going to be built on top of the base LLMS. We already have a lot of competing products that are going to be judged on how great they are by the consumer. Currently, the pricing doesn’t really factor in this choice because the large models are built to be good in a wide number of areas thus making them less efficient in particular tasks but can be used by very many people. This makes their prices dirt cheap.

Here is a chart on Open Ai pricing of their models:

The standout prices as shown above are for their flagship model GPT 4 and the finetuning models. They are priced significantly higher compared to the other models. This suggests that they might offer more advanced features, customization options, or improved performance. However, further evaluation is required to determine the specific differentiating factors that justify its higher price.

Pricing Your Project.

If your project is in its testing phase, you can use cheaper models but once the project is on production, you will need to use the flagship model as it offers more reasoning capability and low latency.

Here are some of the trade-offs to consider when pricing an API built on top of an LLM:

Low pricing: Low pricing can attract more users, but it can also lead to lower revenue.
High pricing: High pricing can generate more revenue, but it can also deter some users from using the API.
Freemium pricing: Freemium pricing offers a basic version of the API for free, and then charges for premium features. This can be a good way to attract users and then upsell them on premium features.

However, the pricing is ultimately based on the demand, being flexible and having the ability to track your results should yield you better pricing plans. For example, an API that is targeted at businesses will likely be more expensive than an API that is targeted at individuals.

Other factors to consider include:

The type of API: Some APIs are more complex than others, and this can factor into the pricing. For example, an API that allows users to generate creative text formats, like poems, code, scripts, musical pieces, emails, and letters, will likely be more expensive than an API that simply provides factual information.
The volume of traffic: The amount of traffic that an API receives can also affect the pricing. If an API is very popular and receives a lot of requests, the pricing will likely be higher than an API that is less popular.
The cost of development and maintenance: The cost of developing and maintaining an API is another important factor to consider when pricing. If the API is complex and requires a lot of resources to maintain, the pricing will likely be higher.
The competitive landscape: The competitive landscape is another important consideration. If there are other APIs that offer similar functionality, the pricing will need to be competitive in order to attract users.

By carefully considering these factors and continually assessing market dynamics, developers can determine appropriate pricing strategies that strike a balance between attracting users, generating revenue, and maintaining a competitive edge in the API landscape built on LLMS.

Understanding Vector Metrics(Cosine similarity, Euclidean distance, Dot product).

Abel Peter — Sat, 24 Jun 2023 12:58:33 +0000

Euclidean distance

Euclidean distance is a measure of the straight-line distance between two points in a plane or space. It calculates the geometric distance between two vectors by summing the squared differences between their corresponding elements and taking the square root of the result. In other words, it measures the length of the line connecting two points in a multidimensional space. The Euclidean distance is commonly used in various applications, such as image similarity search, where the goal is to find the most similar images based on their features. When using the Euclidean distance metric, the most similar results are those with the lowest distance score.

Consider the triangle below

Assuming the y-axis and x-axis are vectors, the hypotenuse is the Euclidean distance between the two vectors.

The triangle is a very simple vector representation, if a more complex vector representation is plotted, the Euclidean distance intuition is still similar. The smaller the distance, the more we can infer similarity between the vectors.

If we represent the hypotenuse as a vector from the origin, we get the image below:
NOTE!: the vectors in X and Y are complex vectors in hyperspace but let's represent them as shown below in the x and y axis, the hypotenuse(Euclidean distance) is the subject here.

A representation of vectors with a large Euclidian distance.

The Euclidean distance between the 2 vectors is huge and thus we can infer that vectors x and Y are not very similar compared to the vectors below.

A representation of vectors with a very small Euclidian distance

Euclidean distance can be useful in various scenarios, such as measuring the distance between two locations on a map, calculating the similarity between two images based on their pixel values, or determining the difference between two sets of data points in a scientific experiment.

Cosine similarity

Cosine similarity is a measure of similarity between two vectors in a high-dimensional space. It determines the cosine of the angle between the vectors, which represents their orientation or direction. Imagine you have two vectors (like arrows) pointing in different directions. Cosine similarity tells us how much these vectors align or point in the same direction.

The advantage of using cosine similarity is that it provides a normalized score ranging from -1 to 1, where 1 indicates identical directions, 0 indicates orthogonality (no similarity), and -1 indicates completely opposite directions. Illustrated below.

Orthogonality between the vectors.(0)

Both vectors are on the x-axis.(1)

In a search with cosine similarity, these vectors are considered very similar on the search score. In a search with real data, the vectors won't be this close obviously but the score will show which vectors are close to the vector query.

Completely opposite directions.(-1)

Cosine similarity can be useful in text analysis. Suppose you have two documents, and you want to find out how similar they are in terms of their word frequencies. By representing each document as a vector where each element represents the frequency of a specific word, you can calculate the cosine similarity between the two vectors to measure their similarity. This can be used for tasks like document clustering, plagiarism detection, or recommendation systems.

Dot product

The dot product is a way to measure how much two vectors "overlap" or are similar in terms of their directions. Imagine you have two vectors and you want to know how much they are aligned or pointing in the same direction.
The dot product takes two vectors and returns a scalar value. It calculates the sum of the products of the corresponding elements in the vectors. A higher positive dot product indicates a closer alignment, while a negative dot product suggests misalignment or opposite directions.

Consider the vector plots below

After a dot product operation on both vectors, we get:
Vectors and dot product

A negative dot product indicating misalignment looks like the plot below

The dot product can be useful in various applications. For instance, in image processing, you can use the dot product to compare two image feature vectors and determine how similar they are. In machine learning, the dot product is used in algorithms like support vector machines (SVM) to classify data points into different categories based on their features.

Summary.

These three metrics provide different ways to measure similarity or dissimilarity between vectors or data points. Euclidean distance measures geometric distance, cosine similarity _measures directional similarity, and _dot product measures alignment or overlap. They have various applications in fields such as image processing, text analysis, recommendation systems, and machine learning.

Good learning!

Using Vector Databases(Pinecone) with data(JSON,tabular).

Abel Peter — Fri, 23 Jun 2023 10:26:50 +0000

If you are not already familiar with vector databases, they're simply specialized databases designed to efficiently store and query vector data. In vector databases, data is represented as high-dimensional vectors, where each vector represents a feature or attribute of the data.
For this article, I will be using JSON data made up of different data about individuals, you can assume that it's employee data of some company. Although the data you might be working with might be different, similar processes might apply especially if you are using Pinecone.

Concept.
When working with data files such as texts and PDFs that have flowy information, for example, an article talking about baking cookies, the go to strategy is split this file into smaller chunks and then embed them before storing them in a database.
With our data or similar data(employee data), the data is discrete in the sense that, employee A has their own attributes, and employee B also has their own attributes and so on.
This is where vector databases are different from traditional databases, with these databases, how or who is going to use the data matters, we can use chunks to embed our data but it isn't really necessary, Pinecone allows for the use of metadata which we can add while inserting the data which makes querying even more easier.

Upserting data to Pinecone.

For an easier understanding, here is the link to the documentation. pinecone

The data I'm using can be found in this GitHub repo with .json extension. GitHub link
And all the code can be found there.

if you are using a notebook, you can easily install all the dependencies

!pip install langchain
!pip install openai
!pip install pinecone-client
!pip install jq
!pip install tiktoken

Importing them

from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import pinecone
from langchain.document_loaders import JSONLoader

Your API keys and pinecone Env go between the strings

openai_api_key = ''

PINECONE_API_KEY = ''

PINECONE_API_ENV = ''

Loading the Json data

import json
from pathlib import Path
from pprint import pprint


file_path='/kaggle/input/json-dataset-of-people/Customer data.json'
data = json.loads(Path(file_path).read_text())

To perform the operation below, you need to have a Pinecone account which allows you to create an index. They do have a waitlist for a free plan but it takes only a day mostly. For this project, you'll need to set the metric to "cosine similarity" which is just a vector metric that you learn more about here Cosine Similarity, the other item is the number of dimensions, and since we are using openAi embeddings, it is set to 1536.

Initiailizing Pinecone

pinecone.init(
    api_key=PINECONE_API_KEY,  
    environment=PINECONE_API_ENV  
)
index_name = "metadata-insert" # You use the index name you created in the Pinecone console.

Once you confirm that the data has been loaded, Pinecone has a python-client that allows you to enter the data into the index you created. And the format goes like this a list of (Id,vector,metadata) tuples. The data structures are (string,list,dictionary) as shown below.

index.upsert([
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], {"genre": "comedy", "year": 2020}),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], {"genre": "documentary", "year": 2019})
])

There are many ways to structure your data so as to meet the format above. As for this project, the name of the employees would be Id (even though our data has an Id field type), the list would be a list of vectors of the names, and the dictionary will be the other field types or key-value pairs("Occupation": "Engineer")

The entire restructuring and packaging has been done and explained in the same GitHub repo,

Instantiating the index.

index = pinecone.Index("metadata-insert")

Querying the data using metadata
the text is our prompt

text = "Return anyone with id given by the metadata"
query_vector = embeddings.embed_query(text)

Checking for the metadata we can use in our queries

print(data_dict[0].keys())

output

dict_keys(['id', 'email', 'gender', 'ip_address', 'Location', 'Occupation', 'Ethnicity'])

Running the Query
The function index.query() takes in:

A vector of your prompt, in our case, the variable query_vector
A metadata filter, you can use any of the metadata above but to be very specific, we can use "id" since we have only one employee of "id", thus easier to confirm.
A top k value refers to the number of results you want returned, in our case it should return only one result, but if it was set to 2,3...., it would return the specified number of results that have a closer "cosine similarity" score to your query vector.
Setting the include_metadata parameter to True returns all the metadata that was stored with the entry. As below:

result= index.query(
            vector=query_vector,
            filter={
                "id": 5 
            },
            top_k=1,
            include_metadata=True
)

Output

{'matches': [{'id': 'Beverie Frandsen',
              'metadata': {'Ethnicity': 'Yakama',
                           'Location': 'Longwei',
                           'Occupation': 'Developer III',
                           'email': 'bfrandsen4@cargocollective.com',
                           'gender': 'Female',
                           'id': 5.0,
                           'ip_address': '235.124.253.241'},
              'score': 0.680275083,
              'values': []}],
 'namespace': ''}

You can confirm from your original data if this is accurate.
There are many different techniques to query using metadata depending on your use case that I will include in the repo later.

Enjoy Learning!

What Chunk Size and Chunk Overlap Should You Use?

Abel Peter — Sun, 11 Jun 2023 11:27:30 +0000

If you have tried doing any form of important work that requires text analysis, natural language processing, and machine learning, you will soon find that text splitting is either going to make your analysis very effective or worse than even if you had never gone down that road at all.

There are many different applications and use cases for this task but a more common hurdle you’ll run into is how to do this process of text splitting, most libraries have the chunk size and chunk overlap parameters to aid in this process, which is the subject of this article.

Chunk size is the maximum number of characters that a chunk can contain.
Chunk overlap is the number of characters that should overlap between two adjacent chunks.

The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.

There are many different ways to split text. Some common methods include:
Character-based splitting: This method divides the text into chunks based on individual characters.
Word-based splitting: This method divides the text into chunks based on words.
Sentence-based splitting: This method divides the text into chunks based on sentences.

The Recursive Text Splitter

The Recursive Text Splitter Module is a module in the LangChain library that can be used to split text recursively. This means that the module will try to split the text into different characters until the chunks are small enough.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "This is a piece of text."

splitter = RecursiveCharacterTextSplitter()

chunks = splitter.split_text(text)

for chunk in chunks:
    print(chunk)

Output

This
is
a
piece
of
text.

The best way to choose the chunk size and chunk overlap parameters depends on the specific problem you are trying to solve. However, in general, it is a good idea to use a small chunk size for tasks that require a fine-grained view of the text and a larger chunk size for tasks that require a more holistic view of the text.

Fine-grained view

Identifying individual words or characters can be useful for tasks such as spell-checking, grammar-checking, and text analysis.
Finding patterns in the text can be useful for tasks such as identifying spam, identifying plagiarism, and finding sentiment in the text.
Extracting keywords can be useful for tasks such as search engine optimization (SEO), topic modeling, and machine translation.

Example

# Fine-grained view
chunk_size = 1
chunk_overlap = 0

text = "This is a piece of text."

chunks = splitter.split_text(text, chunk_size, chunk_overlap)

for chunk in chunks:
    print(chunk)

Output

This
is
a
piece
of
text.

Holistic view

Understanding the overall meaning of the text: This can be useful for tasks such as machine translation, text summarization, and question answering.
Identifying the relationships between different parts of the text: This can be useful for tasks such as natural language inference, question answering, and machine translation.
Generating new text: This can be useful for tasks such as machine translation, text summarization, and creative writing.

Example

# Holistic view
chunk_size = 10
chunk_overlap = 5

text = "This is a piece of text."

chunks = splitter.split_text(text, chunk_size, chunk_overlap)

for chunk in chunks:
    print(chunk)

Output

This is a
piece of text.

Here are some additional tips for using the recursive text splitter module:

Use a consistent chunk size and chunk overlap throughout your code. This will help to ensure that your results are consistent.
Consider the nature of the text you are splitting.
If the text is highly structured, such as code or HTML, you may want to use a larger chunk size. If the text is less structured, such as a novel or a news article, you may want to use a smaller chunk size.
Experiment with different chunk sizes and chunk overlaps
This will allow you to see what works best for your specific problem.

Good coding!

Introduction to LangChain.

Abel Peter — Wed, 31 May 2023 05:19:53 +0000

What is LangChain?

LangChain is a software development framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

What are the benefits of using LangChain?

The main benefits of using LangChain, include:

Ease of use: LangChain is very easy to use, even for developers who are not familiar with LLMs.
Flexibility: LangChain is very flexible, and can be used to create a wide variety of applications.
Scalability: LangChain is scalable, and can be used to create applications that can handle large amounts of data.

How to get started with LangChain

To get started with LangChain, you will need to:

Install the LangChain library.
Create a new LangChain project.
Add an LLM to your project.
Write your application code.

Here is an example of how to use LangChain to create a chatbot:

import langchain

# Create a new LangChain chatbot.
chatbot = langchain.Chatbot()

# Add an LLM to the chatbot.
chatbot.add_model("gpt-3")

# Write the chatbot's code.
@chatbot.on_message
def handle_message(message):
  # Get the user's message.
  user_message = message.text

  # Respond to the user's message.
  chatbot.reply(user_message)

# Run the chatbot.
chatbot.run()

Use cases

LangChain can be used for a wide variety of applications, including:

Document analysis and summarization: LangChain can be used to analyze documents and summarize their contents. This can be useful for tasks such as research, customer support, and marketing.
Chatbots: LangChain can be used to create chatbots that can interact with users in natural language. This can be useful for tasks such as customer service, sales, and education.
Code analysis: LangChain can be used to analyze code and identify potential errors. This can be useful for tasks such as software development, quality assurance, and security.
Other applications: LangChain can also be used for a variety of other applications, such as translation, creative writing, and gaming.

Autonomous Agents

LangChain can be used to create autonomous agents that can make decisions and take actions in the real world. For example, LangChain could be used to create a robot that can navigate a warehouse and pick up items, or a self-driving car that can safely navigate traffic.

Agent Simulations

LangChain can also be used to create simulations of autonomous agents. This can be useful for testing and evaluating new agent algorithms, or for training agents in a safe environment.

Question Answering over Docs

LangChain can be used to answer questions about documents. For example, LangChain could be used to answer questions about a research paper, a product manual, or a legal document.
example.

import langchain

# Create a new LangChain document analyzer.
analyzer = langchain.DocumentAnalyzer()

# Analyze the document.
analyzer.analyze("This is a document about the use of LangChain.")

# Get the summary of the document.
summary = analyzer.summary()

# Print the summary.
print(summary)

Querying Tabular Data

LangChain can be used to query tabular data. For example, LangChain could be used to query a database of customer records, a spreadsheet of financial data, or a table of scientific data.

example.

import langchain

# Create a new LangChain tabular data query.
query = langchain.TabularDataQuery()

# Set the data source.
query.set_data_source("https://www.example.com/data.csv")

# Set the query.
query.set_query("SELECT * FROM data WHERE age > 18")

# Get the results of the query.
results = query.results()

# Print the results.
for result in results:
  print(result)

Interacting with APIs

LangChain can be used to interact with APIs. For example, LangChain could be used to get weather data from an API, get stock quotes from an API, or get directions from an API.
example.

import langchain

# Create a new LangChain API client.
client = langchain.APIClient()

# Set the API endpoint.
client.set_endpoint("https://www.example.com/api")

# Set the API key.
client.set_api_key("YOUR_API_KEY")

# Make a request.
response = client.get("/users")

# Print the response.
print(response)

Extraction

LangChain can be used to extract information from text. For example, LangChain could be used to extract the names of people from a document, the dates of events from a document, or the prices of products from a document.
example.

import langchain

# Create a new LangChain text extractor.
extractor = langchain.TextExtractor()

# Set the document.
extractor.set_document("This is a document about the use of LangChain.")

# Extract the names of people.
names = extractor.extract_names()

# Print the names.
for name in names:
  print(name)

Summarization

LangChain can be used to summarize text. For example, LangChain could be used to summarize a research paper, a product manual, or a legal document.
example.

import langchain

# Create a new LangChain text summarizer.
summarizer = langchain.TextSummarizer()

# Set the document.
summarizer.set_document("This is a document about the use of LangChain.")

# Get the summary.
summary = summarizer.summary()

# Print the summary.
print(summary)

Evaluation

LangChain can be used to evaluate the performance of LLMs. For example, LangChain could be used to evaluate the accuracy of an LLM's answers to questions, the fluency of an LLM's generated text, or the correctness of an LLM's code.
example.

import langchain

# Create a new LangChain LLM evaluator.
evaluator = langchain.LLMEvaluator()

# Set the LLM.
evaluator.set_llm("gpt-3")

# Evaluate the LLM.
results = evaluator.evaluate()

# Print the results.
for result in results:
  print(result)

LangChain is a powerful framework that can be used to create a wide variety of applications using LLMs. If you are looking for a way to simplify the development of LLM-powered applications, then LangChain is a great option. To learn more and engage the community , here is a link to their documentation.(https://python.langchain.com/en/latest/index.html)

Data Wrangling in Python: Tips and Tricks

Abel Peter — Wed, 01 Mar 2023 12:53:17 +0000

Data wrangling, also known as data cleaning or data preprocessing, is an essential step in data analysis. It involves transforming raw data into a format suitable for analysis, which can involve tasks such as handling missing values, dealing with outliers, formatting data correctly, and more.
In this article, we'll cover some common data wrangling tasks in Python and provide tips and tricks to help you perform these tasks efficiently and effectively.

Handling Missing Values

Handling missing values is a crucial step in data wrangling. Missing data can significantly impact the accuracy and reliability of your analysis, so it's essential to handle them appropriately. Here's how you can handle missing values in Python:

Check for missing values:


import pandas as pd
# Load data
data = pd.read_csv('data.csv')
# Check for missing values
print(data.isnull().sum())

Remove missing values:


# Remove rows with missing values
data.dropna(inplace=True)
# Remove columns with missing values
data.dropna(axis=1, inplace=True)

Impute missing values:


# Impute missing values with mean
data.fillna(data.mean(), inplace=True)
# Impute missing values with median
data.fillna(data.median(), inplace=True)

Dealing with Outliers

Outliers are values that are significantly different from the other values in the dataset. They can have a significant impact on the results of your analysis, but if they are not handled correctly, they can distort the data. Here's how you can deal with outliers in Python:

Check for outliers:


import seaborn as sns
# Load data
data = sns.load_dataset('tips')
# Check for outliers
sns.boxplot(x=data['total_bill'])

Remove outliers:

# Remove outliers with z-score
from scipy import stats
z_scores = stats.zscore(data['total_bill'])
abs_z_scores = abs(z_scores)
filtered_entries = (abs_z_scores < 3)
data = data[filtered_entries]

Transform outliers:


# Transform outliers with log transformation
import numpy as np
data['total_bill'] = np.log(data['total_bill'])

Formatting Data Correctly

Data that is not formatted correctly can cause issues when analyzing the data. It's essential to ensure that all data is in the correct format and that the columns and rows are labeled correctly. Here's how you can format data correctly in Python:

Convert data types:

# Convert data type to integer
data['age'] = data['age'].astype(int)
# Convert data type to datetime
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')

Rename columns:

# Rename columns
data.rename(columns={'old_name': 'new_name'}, inplace=True)

Reorder columns:

# Reorder columns
data = data[['column1', 'column2', 'column3']]

Validating Data

Validating data is an essential step to ensure that it is accurate and reliable. Failing to validate data can lead to incorrect results and conclusions. Here's how you can validate data in Python:

Check for duplicates:

# Check for duplicates
print(data.duplicated().sum())
# Remove duplicates
data.drop_duplicates(inplace=True)

Check for consistency:


# Check for consistency
unique_values = data['column'].unique()
if len(unique_values) > 1:
    print(f"Warning: Column 'column' has inconsistent values: {unique_values}")
else:
    print("Column 'column' has consistent values.")

In conclusion, data wrangling is a crucial step in data analysis that involves cleaning, formatting, and validating data to ensure that it is accurate and reliable. By using Python, we can perform common data-wrangling tasks efficiently and effectively, including handling missing values, dealing with outliers, formatting data correctly, and validating data.

By using the tips and tricks provided in this article, you can become a more proficient data wrangler, and ensure that your data analysis is accurate and reliable. Remember to always check your data for consistency, and to handle missing data and outliers appropriately. With these tools in your toolkit, you'll be well-equipped to tackle any data-wrangling challenges that come your way.
Thank you for reading.

introduction to python for data engineering.

Abel Peter — Sun, 28 Aug 2022 16:56:08 +0000

What is data engineering?

Data engineering is the process of creating and maintaining data systems. This includes designing, building, testing, and deploying data pipelines. A data engineer uses software tools to clean, organize, prepare, analyse, visualize, and report on data. Data engineers work with databases, business intelligence systems, application programming interfaces (APIs), and machine learning algorithms to build solutions that help organizations make sense of their data.

The role of Python in data engineering

Python is a versatile language that can be used for a wide variety of tasks, from data manipulation to data science. Python is particularly well suited for data engineering due to its wide variety of modules and libraries.
There are many reasons why Python is the best language for data engineering. First, it has a wide variety of modules and libraries that make it easy to build data pipelines. Second, it is easy to learn and has a syntax that is similar to English. Third, it is a very powerful language that can be used for complex data engineering tasks.
There are many great Python libraries for data engineering, but some of the most popular include Apache Beam, Luigi, and PySpark.
Apache Beam is a great tool for building data pipelines. It provides a rich set of primitives that can be used to build complex pipelines with ease. Luigi is another popular tool that can be used to build complex workflows. PySpark is a great library for working with large datasets in a distributed manner.
These numerous libraries make it easy to build complex data pipelines. Python is also frequently used for ETL (extract, transform, load) tasks.

Before starting with Python for data engineering, you need to set up your development environment. This includes installing Python and setting up your IDE (integrated development environment).

Installing Python is easy, you can use a tool like Anaconda or Miniconda to get started. Once you have installed Python, you will need to choose an IDE (integrated development environment) such as Visual Studio Code.
Some of the libraries to use include:

Pandas
Pandas is a library for manipulating and processing dataframes. A dataframe is a tabular dataset, where each row represents a single observation and columns represent variables. Pandas provides a wide range of operations including read/write, filtering, grouping, aggregation, sorting, joining, reshaping, and exporting to various formats.
NumPy
NumPy is a fundamental package for scientific computing with Python. It provides tools for linear algebra, array processing, integration, interpolation, random number generation, optimization, special mathematical functions, and visualization. NumPy is maintained by the community-supported SciPy Project.
Matplotlib
Matplotlib is a Python module for publication quality graphics production. It works with both GUI and text user interfaces. It supports vector output, animation, and interactivity.

-PySpark
PySpark is an open source library that allows python to be used for data engineering. It provides the user with a set of libraries and tools that can be used to create scalable big data applications.
In conclusion, we just covered the basics of starting out data engineering with python and also how to set up your python environment. Thank you for reading this article.