<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Abhijith Neil Abraham</title>
    <description>The latest articles on Forem by Abhijith Neil Abraham (@abhijithneilabraham).</description>
    <link>https://forem.com/abhijithneilabraham</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F199036%2F10fad0f4-1c71-4f57-bac0-bc27ec3a18d1.jpeg</url>
      <title>Forem: Abhijith Neil Abraham</title>
      <link>https://forem.com/abhijithneilabraham</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/abhijithneilabraham"/>
    <language>en</language>
    <item>
      <title>How to make LLMs work on large amounts of data</title>
      <dc:creator>Abhijith Neil Abraham</dc:creator>
      <pubDate>Sat, 17 Jan 2026 01:16:39 +0000</pubDate>
      <link>https://forem.com/abhijithneilabraham/how-to-make-llms-work-on-large-amounts-of-data-1ol0</link>
      <guid>https://forem.com/abhijithneilabraham/how-to-make-llms-work-on-large-amounts-of-data-1ol0</guid>
      <description>&lt;p&gt;Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.&lt;/p&gt;

&lt;p&gt;One major issue with this is that LLMs cannot actually see the data, they only receive a rough abstraction of it, such as summaries, samples, schema descriptions, or partial slices generated by another system.&lt;/p&gt;

&lt;p&gt;What happens when you have a large number of rows to process and feed them into an LLM?&lt;/p&gt;

&lt;p&gt;Let's see how we can tackle this using Datatune! 

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/vitalops" rel="noopener noreferrer"&gt;
        vitalops
      &lt;/a&gt; / &lt;a href="https://github.com/vitalops/datatune" rel="noopener noreferrer"&gt;
        datatune
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Agentic data transformation on infinite amounts of data
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;🎵 Datatune&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href="https://pypi.org/project/datatune/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/4b2c2924b8872981e7bc0748348c66489acd80671532772f48be651d06216a29/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f6461746174756e652e737667" alt="PyPI version"&gt;&lt;/a&gt;
&lt;a href="https://github.com/vitalops/datatune/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/eb10ca27ad15af81931aba820f94fd4e013dd86f02c064265ca9feecda9305d7/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f766974616c6f70732f6461746174756e65" alt="License"&gt;&lt;/a&gt;
&lt;a href="https://pepy.tech/projects/datatune" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d7d480c2a2a5e3a8f03120765ae9e552c6d2bfeaab5fa4fde50172e97d866502/68747470733a2f2f7374617469632e706570792e746563682f62616467652f6461746174756e65" alt="PyPI Downloads"&gt;&lt;/a&gt;
&lt;a href="https://docs.datatune.ai" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/b88e9ec96e33a755b4ec51fc3361b90847d784730ec04476b4a8c8ceaeff3355/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d646f63732e6461746174756e652e61692d626c7565" alt="Docs"&gt;&lt;/a&gt;
&lt;a href="https://discord.gg/3RKA5AryQX" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/44bbe45880e7ba449096cdcb9a9f8e0075973e6d7b7075d7d6cfd2bbe8861cbe/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d3732383964613f6c6f676f3d646973636f7264266c6f676f436f6c6f723d7768697465" alt="Discord"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Scalable Data Transformations with row-level intelligence.&lt;/p&gt;
&lt;p&gt;Datatune is not just another Text to SQL tool. With datatune, LLMs and Agents can have full access to infinite amount of data, and apply semantic intelligence in every record.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;How It Works&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href="https://blog.datatune.ai/how-to-make-llms-work-on-large-amounts-of-data" rel="nofollow noopener noreferrer"&gt;Click here to understand how Datatune works&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://raw.githubusercontent.com/vitalops/datatune/main/how%20it%20works.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fvitalops%2Fdatatune%2Fmain%2Fhow%2520it%2520works.png" alt="How it works"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Installation&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install datatune&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;datatune&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;datatune&lt;/span&gt;.&lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-s1"&gt;llm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;OpenAI&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;dask&lt;/span&gt;.&lt;span class="pl-s1"&gt;dataframe&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;dd&lt;/span&gt;
&lt;span class="pl-s1"&gt;llm&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;OpenAI&lt;/span&gt;(&lt;span class="pl-s1"&gt;model_name&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"gpt-3.5-turbo"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;df&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dd&lt;/span&gt;.&lt;span class="pl-c1"&gt;read_csv&lt;/span&gt;(&lt;span class="pl-s"&gt;"products.csv"&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Extract categories using natural language&lt;/span&gt;
&lt;span class="pl-s1"&gt;mapped&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt;.&lt;span class="pl-c1"&gt;map&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Extract categories from the description and name of product."&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;output_fields&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s"&gt;"Category"&lt;/span&gt;, &lt;span class="pl-s"&gt;"Subcategory"&lt;/span&gt;],
    &lt;span class="pl-s1"&gt;input_fields&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s"&gt;"Description"&lt;/span&gt;, &lt;span class="pl-s"&gt;"Name"&lt;/span&gt;]
)(&lt;span class="pl-s1"&gt;llm&lt;/span&gt;, &lt;span class="pl-s1"&gt;df&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Filter with simple criteria&lt;/span&gt;
&lt;span class="pl-s1"&gt;filtered&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dt&lt;/span&gt;.&lt;span class="pl-c1"&gt;filter&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Keep only electronics products"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;input_fields&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/vitalops/datatune" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




&lt;p&gt;&lt;strong&gt;The Context Length Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs are becoming larger and larger in terms of context length capabilities. However, even with the current pace, a 100M token context length model is no match for the data that comes from an average database of a user. &lt;/p&gt;

&lt;p&gt;This means, the data that needs to be transformed can be several orders of magnitude higher than an LLMs's context length.&lt;/p&gt;

&lt;p&gt;Consider a mid-sized enterprise with the following very normal setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;10 million rows in a transactional table&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;20 columns per row&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Average 50 characters per column (IDs, text, timestamps, codes)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s:&lt;/p&gt;

&lt;p&gt;10,000,000 rows × 20 columns × 50 characters&lt;br&gt;
= 10,000,000,000 characters&lt;/p&gt;

&lt;p&gt;Even with aggressive tokenization (≈ 4 characters per token):&lt;/p&gt;

&lt;p&gt;10,000,000,000 ÷ 4&lt;br&gt;
≈ 2.5 billion tokens&lt;/p&gt;

&lt;p&gt;Now compare this with an extremely optimistic LLM context window of 100 million tokens.&lt;/p&gt;

&lt;p&gt;That single table alone is 25× larger than the model’s entire context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solving Large Scale Data processing using Datatune&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With Datatune, users can give full access to the data for LLMs, with the help of batch processing.&lt;/p&gt;

&lt;p&gt;Each row of data is transformed using the input prompt, while this combination is sent to the LLM in a batch, and this process continues until all batches of data are sent. Datatune uses Dask's parallel processing abilities to split the data into partitions and use it to send parallel batches to the LLM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1k3dnzw5qdjbqsxbxim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1k3dnzw5qdjbqsxbxim.png" alt="Datatune Batch Processing"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding Data Transformation Operations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are 4 first-order data transformation functions (also known as primitives), namely MAP, FILTER, EXPAND, and REDUCE&lt;/p&gt;

&lt;p&gt;Datatune is also built on top of these primitives, where each primitive can be performed with natural language operations.&lt;/p&gt;

&lt;p&gt;Eg:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mapped = dt.map(
    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above example, a MAP operation is performed using a prompt to get the output fields &lt;code&gt;Category&lt;/code&gt; and &lt;code&gt;Subcategory&lt;/code&gt; from the input fields such as &lt;code&gt;Description&lt;/code&gt; and &lt;code&gt;Name&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Datatune also can be used to chain multiple transformations together.&lt;/p&gt;

&lt;p&gt;Here's another example where a MAP and FILTER are used together&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# First, extract sentiment and keywords from each review (MAP)
mapped = dt.map(
    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)

# Then, keep only negative reviews for further analysis (FILTER)
filtered = dt.filter(
    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Datatune Agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datatune has Agents which helps the user perform prompts without having to know what primitives to use. It is also helpful when a query is complex and requires multi step transformations chained together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtmwbqrlwn0mcsxbtw6t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtmwbqrlwn0mcsxbtw6t.png" alt="Datatune Agents"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's an example where the previour MAP and FILTER operations that were chained together was solved with just a single prompt in Agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = agent.do(
    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Agent also executes Python code along with row-level primitives (Map, Filter, etc). This is especially useful for some prompts that doesn't require row-level intelligence (numerical columns etc) as it can utilize Datatune's code generation capabilities to work on the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Sources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datatune is designed to work with a wide variety of data sources including DataFrames and Databases. Users can use datatune with Ibis integration to help extend connectivity to Databases such as DuckDB, Postgres, MySQL, etc. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contributing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're building Datatune in open source, and we would love your contributions!&lt;/p&gt;

&lt;p&gt;Check out the Github repository here:&lt;/p&gt;

&lt;p&gt;Repo URL: &lt;a href="https://github.com/vitalops/datatune" rel="noopener noreferrer"&gt;https://github.com/vitalops/datatune&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>datascience</category>
      <category>llm</category>
    </item>
    <item>
      <title>Redditflow- Find data from any timeline from past to future and feed your ML pipelines</title>
      <dc:creator>Abhijith Neil Abraham</dc:creator>
      <pubDate>Sat, 28 May 2022 08:30:50 +0000</pubDate>
      <link>https://forem.com/abhijithneilabraham/redditflow-find-data-from-any-timeline-from-past-to-future-and-feed-your-ml-pipelines-jnh</link>
      <guid>https://forem.com/abhijithneilabraham/redditflow-find-data-from-any-timeline-from-past-to-future-and-feed-your-ml-pipelines-jnh</guid>
      <description>&lt;p&gt;Finding data for your ML models can be cumbersome, and there are multiple resources from which you can find data to collect it from. Depending on the data domain and task, you can find suitable data from resources of which some involve social media. At &lt;a href="//www.nfflow.com"&gt;NFFlow&lt;/a&gt;, we ensure that data collection and training ML models are made simple for you, and our mission is to simplify the process from data collection to ML model. You can even schedule cron jobs, to collect data which supposedly appears in the future.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USECASE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine you want to train a model with text or image data, and you don't wanna go through all that python jargon where you have to code a scraper and and ML model. That is where redditflow, a reddit api from NFFLOW comes to your rescue!&lt;/p&gt;

&lt;p&gt;Let's break down the usage of the API, and how you're gonna benefit from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TEXT API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The text api will help you scrape data from any timeline. All you need is a config file, where you specify your topic of interest, and the time period where you want to scrape from. There is an ML enabled classifier algorithm which will help you filter the data you scraped. Optionally, if you want a trained ML model as output from the scraped data, you can do specify that in the config.&lt;/p&gt;

&lt;p&gt;Here's a demonstrated example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = {
        "sort_by": "best",
         "subreddit_text_limit": 50,
        "total_limit": 200,
        "start_time": "27.03.2021 11:38:42",
        "end_time": "27.03.2022 11:38:42",
        "subreddit_search_term": "healthcare",
        "subreddit_object_type": "comment",
        "ml_pipeline": {""ml_pipeline":{"model_name":'distilbert-base-uncased','model_output_path':'healthcare_27.03.2021-27.03.2022_redditflow"}
    }
from redditflow import TextApi
TextApi(config) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As promised, we saved you from all the python jargon!&lt;/p&gt;

&lt;p&gt;We have uploaded a few sample models to huggingface hub using redditflow. &lt;a href="https://huggingface.co/NFflow"&gt;Check it out here!&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Say you want to collect all images of a particular topic over a period of time, for eg: collect all images of cats from reddit over the period of a year. Here is how you can do it via few lines of python code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = {
        "sort_by": "best",
        "subreddit_image_limit": 3,
        "total_limit": 10,
         "start_time": "13.11.2021 09:38:42",
         "end_time": "15.11.2021 11:38:42",
         "subreddit_search_term": "cats",
         "subreddit_object_type": "comment",
         "client_id": "$CLIENT_ID", # get client id for praw
         "client_secret": $CLIENT_SECRET, #get client secret for praw
         }

from redditflow import ImageApi
ImageApi(config)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running the API requires praw , a python api for scraping reddit, so you will be required to provide a praw client id and secret. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contributions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Well, there's a lot we can do for the community through open source. We welcome all contributions which will help us move forward a step in helping making the data science process simpler. Check out &lt;a href="https://github.com/nfflow/redditflow"&gt;https://github.com/nfflow/redditflow&lt;/a&gt; &lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>reddit</category>
      <category>datascience</category>
    </item>
    <item>
      <title>TableQA -Query your tabular data with natural language</title>
      <dc:creator>Abhijith Neil Abraham</dc:creator>
      <pubDate>Sat, 28 Nov 2020 17:33:21 +0000</pubDate>
      <link>https://forem.com/abhijithneilabraham/tableqa-query-your-tabular-data-with-natural-language-39o</link>
      <guid>https://forem.com/abhijithneilabraham/tableqa-query-your-tabular-data-with-natural-language-39o</guid>
      <description>&lt;p&gt;Imagine you have a big database/dataframe of tabular data. And you don't know enough SQL or doesn't know what the values are in the dataframe to make a good enough SQL query to do the task for you. This is where you wish you could ask something in natural language and hope AI got you covered to find the results for you. TableQA is one such product that gets you the result you want.&lt;/p&gt;

&lt;p&gt;Suppose you are dealing with cancer death data. You have columns named as Year, Nationality, Gender, Age Group, Cancer site, Death Count. You could use TableQA to simply ask&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"what is the maximum age of men having stomach cancer"&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 and get the result. As simple as that.&lt;/p&gt;

&lt;p&gt;TableQA uses huggingface transformers under the hood and uses a combination of an AI-based entity extractor to generate key value pair mapping of column name and values, and custom trained classifiers to classify the aggregate type,i.e COUNT,MAX,MIN,SUM,AVG. From there on it builds the associated SQL query on its own, with a rule-based approach. The rules are pretty much of an advantage because unlike other AI products which rely heavily on the dataset, you can modify the blocks of rules so that you can get a more accurate result.&lt;/p&gt;

&lt;p&gt;The use of such rules could be more explained by a schema.&lt;br&gt;
Suppose you want to include custom keywords for the natural language query to detect the column to which it belongs, or you need keywords for any column values, you could add a schema which contains this info and make the performance better.&lt;/p&gt;

&lt;p&gt;The output could be visualized as the result of a query, or a bar or pie chart based on the result, or you could get an SQL query so that you could use that on your own database.&lt;/p&gt;

&lt;p&gt;The ultimate use-case and userbase are widespread. You could use this on many databases including MySQL, Postgres, SQLite, and the same even on Amazon RDS. You could even use this plugin under your chatbots, Heck yeah!&lt;/p&gt;

&lt;p&gt;There are several features I would like to show you that tableQA is capable of supporting. But enough talk, let's get coding! Feel free to check out the source code and also there is a colab example for you to try out!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/abhijithneilabraham/tableQA"&gt;Github-TableQA&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://colab.research.google.com/drive/1Bgd3L-839NVZiP3QqWfpkYIufQIm4Rar?usp=sharing"&gt;colab example&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feel free to mail me at &lt;a href="mailto:abhijithneilabrahampk@gmail.com"&gt;abhijithneilabrahampk@gmail.com&lt;/a&gt; for any questions.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Help!</title>
      <dc:creator>Abhijith Neil Abraham</dc:creator>
      <pubDate>Mon, 22 Jul 2019 07:23:39 +0000</pubDate>
      <link>https://forem.com/abhijithneilabraham/help-59d6</link>
      <guid>https://forem.com/abhijithneilabraham/help-59d6</guid>
      <description>&lt;p&gt;I need a standardised git repository in cpp which contains lot of string and char identifiers and identifiers should be named well(not single character names for them)&lt;/p&gt;

</description>
      <category>cpp</category>
    </item>
  </channel>
</rss>
