DEV Community

Cover image for Generative AI Jargons You Should Know
Rahul @codingkite
Rahul @codingkite

Posted on

Generative AI Jargons You Should Know

๐Ÿค– Ever Wondered How ChatGPT or Gemini Works?

We all have used AI tools like ChatGPT or Gemini, but have you ever wondered how these tools are able to generate such accurate responses to our queries? ๐Ÿค”

In this blog, weโ€™ll get an overview of how AI models generate responses โ€” and along the way, weโ€™ll learn some jargon ๐Ÿง  that you often see floating around the internet ๐ŸŒ.


๐Ÿงฉ AI Jargon Youโ€™ll Learn in This Blog

Letโ€™s walk through the working of an AI model while exploring these terms:

  1. ๐Ÿ”ค LLM
  2. ๐Ÿง  GPT (Generative Pre-trained Transformer)
  3. โš™๏ธ Transformer
  4. ๐Ÿงฑ Tokens
  5. ๐Ÿ“ฅ Encoder / Encoding
  6. ๐Ÿ“ Positional Encoding
  7. ๐Ÿ“ค Decoder / Decoding
  8. ๐Ÿงฎ Vectors
  9. ๐Ÿ”— Embedding
  10. ๐Ÿง  Semantic Meaning
  11. ๐Ÿ‘๏ธ Self Attention
  12. ๐ŸŽฏ SoftMax
  13. ๐Ÿง  Multi-Head Attention
  14. ๐ŸŒก๏ธ Temperature
  15. ๐Ÿ“… Knowledge Cutoff
  16. โœ‚๏ธ Tokenization
  17. ๐Ÿ“š Vocab Size

๐Ÿค” What is AI?

  • AI is basically an algorithm trained on data. After training, it generates output based on learned weights in response to a user query.

โš™๏ธ Transformer

  • An AI model is a mathematical structure that has learned patterns from data.
  • A Transformer is a type of deep learning architecture thatโ€™s especially good at understanding sequences like text or audio.
  • It was introduced by Google in 2017 in a paper called: ๐Ÿ“„ Attention is All You Need

๐Ÿค– GPT

  • GPT stands for Generative Pre-trained Transformer.
  • Itโ€™s a pre-trained transformer that generates the next token based on the data it has seen.
  • ๐Ÿ‹๏ธ Training GPT is expensive and time-consuming, so itโ€™s not retrained frequently.
  • Thus, GPT has a knowledge cutoff โ€” it doesnโ€™t know anything that happened after its last training.

๐Ÿ” Fun Fact: ChatGPT combines GPT with an agent system in the background.


๐Ÿงฑ Transformer Model Architecture

Transformer Model Architecture Diagram

At first glance, this architecture might seem scary ๐Ÿ˜จ, but it becomes simple when explained properly. Letโ€™s break it down:


๐Ÿงพ Input Query & Tokenization

  • Input: The query provided by the user.
  • AI models like LLMs donโ€™t understand human languages directly โ€” they understand numbers.
  • So, we convert the input into numbers โ€” this process is called Tokenization.
    • A token is a word or piece of a word.
    • The token ID is the numeric representation.
  • Libraries like tiktoken (used by OpenAI) perform tokenization.

๐Ÿ”ค Vocabulary Size = Total number of unique tokens in the tokenizer's dictionary.


๐Ÿงฐ Tokenizer

  • A tokenizer is separate from the model.
  • It has a fixed vocabulary: a mapping of words (tokens) to numbers (token IDs).
  • The model doesnโ€™t โ€œknowโ€ words โ€” it only understands token IDs.
  • During inference:
    • If a new word appears, the tokenizer:
    • Breaks it into known sub-tokens, or
    • Uses an unknown token placeholder.
  • So, vocab size = number of unique tokens the model can recognize.

๐Ÿ‘‰ Tokenizer visualizer: tiktokenizer.vercel.app


๐Ÿงช Tokenizer Code Example

from tiktoken._educational import *
import tiktoken

encoder = tiktoken.get_encoding("o200k_base")
print("encoder", encoder)
print("vocab size", encoder.n_vocab)

tokens = encoder.encode('Hello World, How are you?')
print("tokens", tokens)

decoded_text = encoder.decode(tokens)
print("decoded_text", decoded_text)

print(tiktoken.encoding_for_model('text-embedding-3-small'))

encoder_for_model = tiktoken.encoding_for_model("gpt-4o")
print("encoder_for_model", encoder_for_model)

# Train a BPE tokeniser on a small amount of text
enc = train_simple_encoding()
print("enc", enc)

# Visualise how the GPT-4 encoder encodes text
another_encoder = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
print(another_encoder.encode("hello world aaaaaaaaaaaa"))
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Embeddings (Vector & Positional)

๐Ÿงฎ Vector Embedding

  • A vector embedding is the numerical representation of tokens that captures semantic meaning.
  • Semantic meaning = meaning of the word in a specific context.
    • "Reserve Bank" vs "Bank of a river" โ€” same word, different meanings.

โœจ Imagine "cat" as a point in space: [0.2, 1.3, -0.5, ...]

  • These embeddings can be stored in vector databases like:
    • ๐Ÿ” Pinecone
    • ๐ŸŒŒ Chroma DB
    • ๐Ÿงญ Qdrant

๐Ÿงญ Visualize embeddings here: TensorFlow Projector


๐Ÿง  Semantic Example

  • King โžก๏ธ Queen implies Man โžก๏ธ ?
  • If "Queen" is 3 units down and "Man" is 4 units left of "King" in vector space โ€” then the model can estimate the missing word using vector math โœจ.

Embedding Visualization


๐Ÿ“ Positional Embedding

  • Tokens alone donโ€™t carry position info.
  • Two sentences with the same words but different order mean very different things!
    • "The cat sat on the mat"
    • "The mat sat on the cat"

So, we use positional embeddings to give the model context of order.

๐Ÿ› ๏ธ These modify the original embedding to reflect word position in the sentence.


๐Ÿ“ฅ Embedding Example with OpenAI

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

text = "Eiffel tower is in Paris and is a famous landmark, it is 324 meters tall"

response = client.embeddings.create(
    input=text,
    model="text-embedding-ada-002"
)

print("vector embeddings", response.data[0].embedding)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”„ Self Attention Mechanism

In simple terms: Tokens can talk to each other and update themselves! ๐Ÿง ๐Ÿ’ฌ

  • This means when two tokens interact, they can update their vector embeddings based on the sentence and with respect to each other.
  • Example: "river bank"
    • When these 2 tokens talk to each other, they update their embeddings based on the context.
    • So even if the word "bank" appears in both "river bank" and "icici bank", and both have:
    • the same token,
    • the same original vector embedding,
    • and the same positional embedding,
    • the final vector embedding will differ because of how they interact with other tokens in the sentence.
  • Tokens update their embeddings based on all tokens in the sentence โ€” not just one or two.
  • So what does self-attention do?
    • It allows tokens to adjust their embeddings based on the other tokens present in the input. ๐Ÿ”

๐Ÿง  Multi-Head Attention โ€” Seeing Things in Many Ways

Think of it like observing the sentence from multiple perspectives at once! ๐Ÿ‘€๐Ÿ”

  • It helps the model focus on different aspects/perspectives of the tokens simultaneously.
  • At its heart, attention is about weighing the importance of different input tokens when focusing on a specific one.
  • Multi-head attention runs multiple attention operations in parallel (called โ€œheadsโ€) and combines their results.
  • This allows the model to learn various types of relationships between tokens โ€” all at the same time. ๐Ÿ’ก๐Ÿงฉ

๐Ÿ”— Feed Forward

Itโ€™s a neural network that processes each token individually after attention is done. ๐Ÿ› ๏ธ

  • The interaction cycle between Multi-Head Attention and Feed Forward is repeated many times to get a rich contextual result.
  • In GPT (and Transformers in general), after using attention to understand word relationships, the model sends the result through a Feed Forward Neural Network (FFN).

๐Ÿง  "Okay, I understand which words matter (thanks to attention)...

Now let me do some math on each word to extract more meaning!" โž—๐Ÿ”


โš™๏ธ How It Works

  • The Feed Forward block is just a small, regular neural network applied to each token individually.

Here's what happens:

  1. Each token (word embedding) goes into a Linear layer (fully connected).
  2. Then through a ReLU or GELU activation (for non-linearity).
  3. Then again through another Linear layer.
  4. The output replaces the old one and continues through the Transformer stack.

๐Ÿงฑ Formula:

FFN(token) = Linear -> Activation -> Linear
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Why Is It Important?

  • Attention handles relationships between words.
  • Feed Forward handles processing each word by itself, like extracting deeper meanings and features. ๐ŸŒฑ

๐Ÿ” Simple Analogy

Attention: "Hey 'cat', pay attention to 'mat' and 'sat'!" ๐Ÿฑ๐Ÿง˜๐Ÿช‘

Feed Forward: "Cool. Now let me upgrade 'cat' with better features based on that context." ๐Ÿš€๐Ÿ“ˆ



๐Ÿ”„ Self Attention Mechanism

In simple terms: Tokens can talk to each other and update themselves! ๐Ÿง ๐Ÿ’ฌ

  • This means when two tokens interact, they can update their vector embeddings based on the sentence and with respect to each other.
  • Example: "river bank"
    • When these 2 tokens talk to each other, they update their embeddings based on the context.
    • So even if the word "bank" appears in both "river bank" and "icici bank", and both have:
    • the same token,
    • the same original vector embedding,
    • and the same positional embedding,
    • the final vector embedding will differ because of how they interact with other tokens in the sentence.
  • Tokens update their embeddings based on all tokens in the sentence โ€” not just one or two.
  • So what does self-attention do?
    • It allows tokens to adjust their embeddings based on the other tokens present in the input. ๐Ÿ”

๐Ÿง  Multi-Head Attention โ€” Seeing Things in Many Ways

Think of it like observing the sentence from multiple perspectives at once! ๐Ÿ‘€๐Ÿ”

  • It helps the model focus on different aspects/perspectives of the tokens simultaneously.
  • At its heart, attention is about weighing the importance of different input tokens when focusing on a specific one.
  • Multi-head attention runs multiple attention operations in parallel (called โ€œheadsโ€) and combines their results.
  • This allows the model to learn various types of relationships between tokens โ€” all at the same time. ๐Ÿ’ก๐Ÿงฉ

๐Ÿ”— Feed Forward

Itโ€™s a neural network that processes each token individually after attention is done. ๐Ÿ› ๏ธ

  • The interaction cycle between Multi-Head Attention and Feed Forward is repeated many times to get a rich contextual result.
  • In GPT (and Transformers in general), after using attention to understand word relationships, the model sends the result through a Feed Forward Neural Network (FFN).

๐Ÿง  "Okay, I understand which words matter (thanks to attention)...

Now let me do some math on each word to extract more meaning!" โž—๐Ÿ”


โš™๏ธ How It Works

  • The Feed Forward block is just a small, regular neural network applied to each token individually.

Here's what happens:

  1. Each token (word embedding) goes into a Linear layer (fully connected).
  2. Then through a ReLU or GELU activation (for non-linearity).
  3. Then again through another Linear layer.
  4. The output replaces the old one and continues through the Transformer stack.

๐Ÿงฑ Formula:

FFN(token) = Linear -> Activation -> Linear
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Why Is It Important?

  • Attention handles relationships between words.
  • Feed Forward handles processing each word by itself, like extracting deeper meanings and features. ๐ŸŒฑ

๐Ÿ” Simple Analogy

Attention: "Hey 'cat', pay attention to 'mat' and 'sat'!" ๐Ÿฑ๐Ÿง˜๐Ÿช‘

Feed Forward: "Cool. Now let me upgrade 'cat' with better features based on that context." ๐Ÿš€๐Ÿ“ˆ


๐Ÿš€ Two Phases of a Model

  • ๐Ÿ”ง Training Phase
  • ๐Ÿง  Inference Phase (using phase)

๐Ÿ”ง Training Phase

Letโ€™s break it down:

  • In the training phase, we match input and output.
  • We provide the input, the model gives us an output, we compare it with the actual output, calculate the loss, and then backpropagate it (backpropagation = เคตapas jao ๐Ÿ˜„).

Example Flow:

  • Input: <start> my name is piyush
  • Actual Output: <start> my name is piyush <end> I am good
  • Model Output: <start> my name is piyush xsfd@e ๐Ÿ˜…

Then:

  • We calculate the loss between the modelโ€™s output and the actual output.
  • We send it back through the model to adjust the weights.
  • We repeat this until the model starts giving the expected output. ๐Ÿ”

โœ… Goal: Allow the model to update its weights using the training data.

โš™๏ธ This backpropagation and weight update process requires a lot of compute power, hence heavy GPU usage ๐Ÿ’ป๐Ÿ”ฅ.


๐Ÿง  Inference Phase: Using the model

Time to use what we've trained! ๐Ÿ˜Ž

  • We provide an input to the model:
    • token โ†’ vector embedding โž• positional embedding โ†’ ๐ŸŽฏ multi-head attention

โšก Technically, the model generates multiple possible outputs for a given input.


๐Ÿงช Example:

  • Input: <start> how are you?<end>
  • Raw Outputs: I, S, U
  • The linear step calculates probabilities for each token.
  • Example Output (with probabilities):
    • I (98%) โœ…
    • S (5%)
    • U (0.3%)

The Softmax step picks the one with the highest probability.

๐ŸŽ›๏ธ Temperature Parameter:

  • Used by Softmax to control randomness.
  • Higher temperature โ†’ more randomness (might pick a less probable token).
  • Lower temperature โ†’ more deterministic (sticks to highest probability).

๐Ÿชœ Steps Breakdown

  • STEP 1:

    • Input: <start> how are you?<end>
    • Probabilities: I (98%), S (5%), U (0.3%)
    • Chosen: I
    • โœ… Final Output: <start> how are you?<end> I
  • STEP 2:

    • Input: <start> how are you?<end> I
    • Probabilities: _am (80%), few (5%), good (4%)
    • Chosen: _am
    • โœ… Final Output: <start> how are you?<end> I_am
  • STEP 3:

    • The process continues iteratively ๐Ÿ” (until it reaches the <end> token or finishes generating).

๐Ÿ“š Extra Info

  • ๐Ÿ“œ The Transformer model was introduced by Google in the research paper "Attention is All You Need".
  • ๐Ÿ—ฃ๏ธ Originally created for Google Translate, the idea was to understand and translate semantic meaning using NLP.
  • ๐Ÿง  This means it was built with language understanding in mind.

๐Ÿ” GPT (by OpenAI) was based on the transformer โ€”

But instead of translation, it was built for next token prediction ๐Ÿงฉ


Heroku

Amplify your impact where it matters most โ€” building exceptional apps.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (0)

ACI image

ACI.dev: The Only MCP Server Your AI Agents Need

ACI.devโ€™s open-source tool-use platform and Unified MCP Server turns 600+ functions into two simple MCP tools on one serverโ€”search and execute. Comes with multi-tenant auth and natural-language permission scopes. 100% open-source under Apache 2.0.

Star our GitHub!