Forem: Abhijith Neil Abraham

How to make LLMs work on large amounts of data

Abhijith Neil Abraham — Sat, 17 Jan 2026 01:16:39 +0000

Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.

One major issue with this is that LLMs cannot actually see the data, they only receive a rough abstraction of it, such as summaries, samples, schema descriptions, or partial slices generated by another system.

What happens when you have a large number of rows to process and feed them into an LLM?

Let's see how we can tackle this using Datatune!

vitalops / datatune

Agentic data transformation on infinite amounts of data

🎵 Datatune

Scalable Data Transformations with row-level intelligence.

Datatune is not just another Text to SQL tool. With datatune, LLMs and Agents can have full access to infinite amount of data, and apply semantic intelligence in every record.

How It Works

Click here to understand how Datatune works

Installation

pip install datatune

Quick Start

import datatune as dt
from datatune.llm.llm import OpenAI
import dask.dataframe as dd
llm = OpenAI(model_name="gpt-3.5-turbo")
df = dd.read_csv("products.csv")

# Extract categories using natural language
mapped = dt.map(
    prompt="Extract categories from the description and name of product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

# Filter with simple criteria
filtered = dt.filter(
    prompt="Keep only electronics products",
    input_fields

…

View on GitHub

The Context Length Problem

LLMs are becoming larger and larger in terms of context length capabilities. However, even with the current pace, a 100M token context length model is no match for the data that comes from an average database of a user.

This means, the data that needs to be transformed can be several orders of magnitude higher than an LLMs's context length.

Consider a mid-sized enterprise with the following very normal setup:

10 million rows in a transactional table
20 columns per row
Average 50 characters per column (IDs, text, timestamps, codes)

That’s:

10,000,000 rows × 20 columns × 50 characters
= 10,000,000,000 characters

Even with aggressive tokenization (≈ 4 characters per token):

10,000,000,000 ÷ 4
≈ 2.5 billion tokens

Now compare this with an extremely optimistic LLM context window of 100 million tokens.

That single table alone is 25× larger than the model’s entire context.

Solving Large Scale Data processing using Datatune

With Datatune, users can give full access to the data for LLMs, with the help of batch processing.

Each row of data is transformed using the input prompt, while this combination is sent to the LLM in a batch, and this process continues until all batches of data are sent. Datatune uses Dask's parallel processing abilities to split the data into partitions and use it to send parallel batches to the LLM.

Understanding Data Transformation Operations

There are 4 first-order data transformation functions (also known as primitives), namely MAP, FILTER, EXPAND, and REDUCE

Datatune is also built on top of these primitives, where each primitive can be performed with natural language operations.

Eg:

mapped = dt.map(
    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

In the above example, a MAP operation is performed using a prompt to get the output fields Category and Subcategory from the input fields such as Description and Name.

Datatune also can be used to chain multiple transformations together.

Here's another example where a MAP and FILTER are used together

# First, extract sentiment and keywords from each review (MAP)
mapped = dt.map(
    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)

# Then, keep only negative reviews for further analysis (FILTER)
filtered = dt.filter(
    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)

Datatune Agents

Datatune has Agents which helps the user perform prompts without having to know what primitives to use. It is also helpful when a query is complex and requires multi step transformations chained together.

Here's an example where the previour MAP and FILTER operations that were chained together was solved with just a single prompt in Agents:

df = agent.do(
    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)

The Agent also executes Python code along with row-level primitives (Map, Filter, etc). This is especially useful for some prompts that doesn't require row-level intelligence (numerical columns etc) as it can utilize Datatune's code generation capabilities to work on the data.

Data Sources

Datatune is designed to work with a wide variety of data sources including DataFrames and Databases. Users can use datatune with Ibis integration to help extend connectivity to Databases such as DuckDB, Postgres, MySQL, etc.

Contributing

We're building Datatune in open source, and we would love your contributions!

Check out the Github repository here:

Repo URL: https://github.com/vitalops/datatune

Redditflow- Find data from any timeline from past to future and feed your ML pipelines

Abhijith Neil Abraham — Sat, 28 May 2022 08:30:50 +0000

Finding data for your ML models can be cumbersome, and there are multiple resources from which you can find data to collect it from. Depending on the data domain and task, you can find suitable data from resources of which some involve social media. At NFFlow, we ensure that data collection and training ML models are made simple for you, and our mission is to simplify the process from data collection to ML model. You can even schedule cron jobs, to collect data which supposedly appears in the future.

USECASE

Imagine you want to train a model with text or image data, and you don't wanna go through all that python jargon where you have to code a scraper and and ML model. That is where redditflow, a reddit api from NFFLOW comes to your rescue!

Let's break down the usage of the API, and how you're gonna benefit from it.

TEXT API

The text api will help you scrape data from any timeline. All you need is a config file, where you specify your topic of interest, and the time period where you want to scrape from. There is an ML enabled classifier algorithm which will help you filter the data you scraped. Optionally, if you want a trained ML model as output from the scraped data, you can do specify that in the config.

Here's a demonstrated example:

config = {
        "sort_by": "best",
         "subreddit_text_limit": 50,
        "total_limit": 200,
        "start_time": "27.03.2021 11:38:42",
        "end_time": "27.03.2022 11:38:42",
        "subreddit_search_term": "healthcare",
        "subreddit_object_type": "comment",
        "ml_pipeline": {""ml_pipeline":{"model_name":'distilbert-base-uncased','model_output_path':'healthcare_27.03.2021-27.03.2022_redditflow"}
    }
from redditflow import TextApi
TextApi(config)

As promised, we saved you from all the python jargon!

We have uploaded a few sample models to huggingface hub using redditflow. Check it out here!

Image API

Say you want to collect all images of a particular topic over a period of time, for eg: collect all images of cats from reddit over the period of a year. Here is how you can do it via few lines of python code.

config = {
        "sort_by": "best",
        "subreddit_image_limit": 3,
        "total_limit": 10,
         "start_time": "13.11.2021 09:38:42",
         "end_time": "15.11.2021 11:38:42",
         "subreddit_search_term": "cats",
         "subreddit_object_type": "comment",
         "client_id": "$CLIENT_ID", # get client id for praw
         "client_secret": $CLIENT_SECRET, #get client secret for praw
         }

from redditflow import ImageApi
ImageApi(config)

Running the API requires praw , a python api for scraping reddit, so you will be required to provide a praw client id and secret.

Contributions

Well, there's a lot we can do for the community through open source. We welcome all contributions which will help us move forward a step in helping making the data science process simpler. Check out https://github.com/nfflow/redditflow

TableQA -Query your tabular data with natural language

Abhijith Neil Abraham — Sat, 28 Nov 2020 17:33:21 +0000

Imagine you have a big database/dataframe of tabular data. And you don't know enough SQL or doesn't know what the values are in the dataframe to make a good enough SQL query to do the task for you. This is where you wish you could ask something in natural language and hope AI got you covered to find the results for you. TableQA is one such product that gets you the result you want.

Suppose you are dealing with cancer death data. You have columns named as Year, Nationality, Gender, Age Group, Cancer site, Death Count. You could use TableQA to simply ask

"what is the maximum age of men having stomach cancer"

and get the result. As simple as that.

TableQA uses huggingface transformers under the hood and uses a combination of an AI-based entity extractor to generate key value pair mapping of column name and values, and custom trained classifiers to classify the aggregate type,i.e COUNT,MAX,MIN,SUM,AVG. From there on it builds the associated SQL query on its own, with a rule-based approach. The rules are pretty much of an advantage because unlike other AI products which rely heavily on the dataset, you can modify the blocks of rules so that you can get a more accurate result.

The use of such rules could be more explained by a schema.
Suppose you want to include custom keywords for the natural language query to detect the column to which it belongs, or you need keywords for any column values, you could add a schema which contains this info and make the performance better.

The output could be visualized as the result of a query, or a bar or pie chart based on the result, or you could get an SQL query so that you could use that on your own database.

The ultimate use-case and userbase are widespread. You could use this on many databases including MySQL, Postgres, SQLite, and the same even on Amazon RDS. You could even use this plugin under your chatbots, Heck yeah!

There are several features I would like to show you that tableQA is capable of supporting. But enough talk, let's get coding! Feel free to check out the source code and also there is a colab example for you to try out!

Github-TableQA

colab example

Feel free to mail me at abhijithneilabrahampk@gmail.com for any questions.

Help!

Abhijith Neil Abraham — Mon, 22 Jul 2019 07:23:39 +0000

I need a standardised git repository in cpp which contains lot of string and char identifiers and identifiers should be named well(not single character names for them)