Forem: Tamás Bereczki

Spring AI: How to use Generative AI and apply RAG?

Tamás Bereczki — Tue, 09 Sep 2025 13:20:51 +0000

Let’s dive into world of AI and investigate how Spring AI works and earn how to use an AI programmatically and generate some content with RAG method.

Generative AI models are powerful, but their knowledge is limited to the data they were trained on. So, how can we make them intelligent about our own specific documents or data? This is where the Retrieval-Augmented Generation (RAG) pattern comes in. In this article, I’ll guide you step-by-step through building a pet project that does exactly that, using a practical, code-first approach.

If you feel the need to learn first about what Artifical Intelligence means and how it is works under the hood, read this article, thank you:
https://dev.to/bereczki/beyond-the-buzzwords-how-generative-ai-really-works-bac

Project idea

A kinda common idea came up in my mind, which is a usual pet project in universities or at home, create a movie database service (like an IMDB). But in this case, I am going to focus on how can I tune this movie database with using AI services.

Project: Customized media suggestion service
Purpose:
We should create a system, which can give personal suggestions on media contents regarding their interesting topics and previously checked contents.

Requirements

Data: Collect contents (movies data) and store them in vector database
User profile: Let’s assume, there are registered users in the system. And the system collects feedbacks from users to watched movies, which one was liked, which one was not liked.
RAG applying: When user login, or request for new suggestions, system will query the liked contents. These contents will be used to find another movies, which are similar to them. Vector database similarity search feature will be used there.
Generative AI will get these suggestions and summarize them in a personalized result.
Fine-tuning: Generative AI can sum up, why the suggested content is relevant for the user and provide a short description about why we think the suggested movie will be liked by them

Take aways

We can check how can RAG be used for personalization
Give possibility to learn how to vectorize content, store it and do similarity search.

Spring AI

What is it?
Spring Framework is a mature and well known tool for Java developers to build webserver. Spring has a lot of different tools which developers can use to provide solutions. A brand new tool is Spring AI, which gives an abstract layer to make easy operating with AI models.

For more information, check Spring AI documentation: https://docs.spring.io/spring-ai/reference/getting-started.html

Why I wanted to use Spring AI?

I mostly start new project with selecting Spring Boot framework
I would like to get familiar as soon as possible with new AI related technologies
Spring AI already has released version, which may have kinda stabil architecture.

Setup

Maven dependency:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-rag</artifactId>
</dependency>

These dependencies are required for embedding and RAG plugin usage, the core AI dependencies will be imported by ollama spring dependency!

Ollama server

What is it?
Ollama is a user-friendly, open-source tool designed to simplify the process of running large language models (LLMs) locally on your computer. Ollama is enabling you to download, run, and interact with these models without relying on cloud-based services.

Most popular models, you can find:

deepseek-r1
mistral

How can it help me?
Ollama permits to use an LLM without subsribe to any cloud-based models (GPT from OpenAI, Gemini/VertexAI by Google, etc.), because it downloads model from their central repository or from Huggingface repository to local machine. After Ollama is a server, which provides API to operate with these models.

Setup (on Linux)
(This Linux may be Ubuntu inside WSL on Windows)

$ curl -fsSL https://ollama.com/install.sh | sh
$ ollama serve

This will download, install and start the ollama server.

By default, Spring pulls not existing models at startup, however sometimes it fails due to timeout. In this case, you should pull model manually by ollama pull <model> command.

Maven dependency:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-autoconfigure-model-ollama</artifactId>
</dependency>

Vector Database

What is a Vector Database?
A vector database is a specialized database designed to store, manage, and query data represented as numerical vectors. These vectors are mathematical representations of data objects (like text, images, or audio) that capture their semantic meaning or characteristics. Essentially, they allow computers to understand and compare data based on similarity rather than exact matches.

Why it is needed?
Beside that, the vector represent the semantic meaning of a data, in our case a movie.

Imagine that there is a movie description in JSON:

{
  "title": "The Godfather", 
  "genre": "Crime", 
  "actors": ["Marlon Brando", "Al Pacino"],
  "plot": "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son."
}

During the embedding operation, from this JSON data a vector is created and stored into Vector Database. Then, when query by plot of “Mafia family”, then similarity search will more likely return “The Godfather” movie.

Summary:

We need Vector Database to store movie and vector for that movie
Provide similarity search function

What Vector Databases are available?
There are several database vendor who created a vector feature for the database engine and Spring AI supports a lot of them. Here are a list about, but check out in Spring AI guide for full list and related informations:

Apache Cassandra
Couchbase
Elasticseach
MariaDB
MongoDB Atlas
OpenSearch
Oracle Database
Postgres
Redis
and so on…

I choose Elasticsearch because previously I worked with it a lot and I don’t want to deep dive now into an unknown database engine.

Setup

Start Elasticsearch server with Docker:

$ docker run -d --name elasticsearch --net somenetwork -p 9200:9200 \
 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:9.0.0

This is going to start Elasticsearch server and expose it on 9200 port. If you found any issue, please check their Docker description.

Configurate Elasticsearch vector store in Spring application.yaml properties file:

spring:
  elasticsearch:
    uris: http://localhost:9200
  ai:    
    vectorstore:
      elasticsearch:
        initialize-schema: true
        index-name: movies
        dimensions: 1024 # vector dimension which depends on selected embedding model, in case of 'mxbai-embed-large' is 1024
        similarity: cosine

Take away:

We have a platform to build a solution for the defined project (Spring Framework✅)
We have tool to manage AI (Spring AI ✅)
We have LLM model to use for embedding and generating content (Ollama ✅)
We have Vector Database to store movies and perform similarity search (Elasticsearch ✅)

Selecting AI model

Spring AI framework has default configurations for embedding and generative features, which you can set up in application.yaml properties file.

spring:
  ai:
    ollama:
      embedding:
        options:
          model: mxbai-embed-large
      chat:
        model: mistral

What is the difference between embedding and chat models?
This is a question regarding to fundation of AI, so first check with another question:
What is the difference between embedding and content generation?
Embedding is procedure when a document is vectorized, it is encoder type of AI.
Content generation is another type of AI — decoder — , when there is no input, but there is output.
(This is an oversimplification, but if it is not clear, please read article referenced at the beginning of this article)

So this is the difference, we can use same or different model for input (vectorization) and for generation (chat model).

Using AI Model for what?
AI model is going to use during vectorize Movies; and going to use to generate text content (response) for the user who requests movie suggestions. This is similar as well known chat models (GPT, Gemini, etc.).

Solution

Let’s see what every developer reader wants, the code itself how all these are resolved.

Note: The code quality is not the best, I know also! Because this project is just a Proof of Concept and to learn, clean code was not high priority. Thanks for understanding!

#1 Create Test Data

First things which we need at the beginning stage of solution:

Design data layer which will suite for our solution
Create Movie descriptions (id, title, year, genre, director, actors, plot)
Create Users (id, username, name, email, age)
Create movie ratings by users (id, userId, movieId, rating, comment, date)

Create Java models for them:

record Ratings(List<RatedMovie> ratings) {}
@JsonIgnoreProperties(ignoreUnknown = true)
record RatedMovie(String movieId, String userId, int rating, String dateRated) {}
record Movies(List<Movie> movies) {}
@JsonIgnoreProperties(ignoreUnknown = true)
record Movie(String id, String title, int year, String director,
  List<String> genre, List<String> actors) {}

To generate test data, I use another AI which is provided in JetBrains IDEA Intellij, the Junie agent. I asked it to create json files into resources for movie, users and ratings regarding defined attributes. Junie successfully created the test data, step by step. It checked defined model classes and used them to declare required attributes, then ask permission to write files into resources folder and generates test data:

{
  "movies": [
    {
      "id": "movie-001",
      "title": "The Shawshank Redemption",
      "year": 1994,
      "genre": ["Drama"],
      "director": "Frank Darabont",
      "actors": ["Tim Robbins","Morgan Freeman","Bob Gunton"],
      "plot": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency."
    },
    {
      "id": "movie-002",
      "title": "The Godfather",
      "year": 1972,
      "genre": ["Crime","Drama"],
      "director": "Francis Ford Coppola",
      "actors": ["Marlon Brando","Al Pacino","James Caan"],
      "plot": "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son."
    },
    {
      "id": "movie-003",
      "title": "The Dark Knight",
      "year": 2008,
      "genre": ["Action","Crime","Drama"],
      "director": "Christopher Nolan",
      "actors": ["Christian Bale","Heath Ledger","Aaron Eckhart"],
      "plot": "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice."
    },
  ...
  ]
}

[
  {
    "id": "user-001",
    "username": "movie_buff_42",
    "name": "John Smith",
    "email": "john.smith@example.com",
    "age": 28,
    "location": "New York, USA"
  },
  {
    "id": "user-002",
    "username": "cinema_lover",
    "name": "Emma Johnson",
    "email": "emma.j@example.com",
    "age": 34,
    "location": "Los Angeles, USA"
  },
  ...
]

{
  "ratings": [
  {
    "id": "rating-001",
    "userId": "user-001",
    "movieId": "movie-001",
    "rating": 5,
    "comment": "Absolutely brilliant film. The performances are outstanding and the story is deeply moving.",
    "dateRated": "2023-01-15"
  },
  {
    "id": "rating-002",
    "userId": "user-001",
    "movieId": "movie-003",
    "rating": 5,
    "comment": "Heath Ledger's Joker is one of the greatest performances in cinema history.",
    "dateRated": "2023-02-03"
  },
  {
    "id": "rating-003",
    "userId": "user-002",
    "movieId": "movie-005",
    "rating": 4,
    "comment": "Mind-bending plot with amazing visuals. Nolan at his best.",
    "dateRated": "2023-01-22"
  },
  ...
  ]
}

#2 Vectorize Movies

‣ Define a TextSplitter bean implementation, which is going to be used during vectorize to split document into tokens.

@Configuration
public class RatingAiConfiguration {

    @Bean
    public TextSplitter textSplitter() {
        return new TokenTextSplitter();
    }
}

‣ Create service to add Movie document into Vector Database (MovieSuggestionService.java)

@Service
public class MovieSuggestionService {
    private final VectorStore vectorStore;

    public MovieSuggestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void add(List<Document> documents) {
        this.vectorStore.add(documents);
    }
}

VectorStore is Spring AI interface which implemented by ElasticsearchVectorStore that is injected.

‣ Initialize Vector Database with data, from resources files:

read test movies,
parse into Java model (Movies.class),
use Json reader to create Document (Document is managed by Vector Databases)
extend each document with movie identification and a randomized popularity index. During similarity search, these metadata informations can be used as filter to documents.
split document into sequence of tokens
store the test movies with vector representation

@Autowired
MovieSuggestionService movieSuggestionService;

@Autowired
TextSplitter textSplitter;

public void indexTestMovies() {
    ObjectMapper objectMapper = new ObjectMapper();

    var movies = objectMapper.readValue(
            AiRagApplication.class.getClassLoader().getResourceAsStream("movies.json"),
            Movies.class
    );

    for (Movie movie : movies.movies()) {
        JsonReader reader = new JsonReader(new ByteArrayResource(objectMapper.writeValueAsBytes(movie)));
        List<Document> documents = reader.read();
        documents.forEach(document -> {
            document.getMetadata().put("popularity", RandomUtils.insecure().randomInt(1, 6));
            document.getMetadata().put("movieId", movie.id());
            movieSuggestionService.add(textSplitter.split(documents));
        });
    }

    logger.info("done");
}

See movies index in Elasticsearch with content and embedding:

#3 Implement similarity search for RAG

Extend the MovieSuggestionService with search function, which needs of

prompt from ‘user’, which is the content to find similarity by.
expression filter, if additional filterin on document metadata is needed
SearchRequestOption, if needed to declare custom search algoritm, like similarity threshold for documents or the topK parameter, which limit results up to K number

@Service
public class MovieSuggestionService {
    public record SearchRequestOption(Double similarityThreshold, Integer topK) {
        }

    private final SearchRequestOption searchRequestOption = new SearchRequestOption(0.6, DEFAULT_TOP_K);

    public List<Document> search(String userPromptText, Filter.Expression filterExpression) {
        return search(userPromptText, filterExpression, this.searchRequestOption);
    }

    public List<Document> search(String userPromptText, Filter.Expression filterExpression, SearchRequestOption searchRequestOption) {
        SearchRequest.Builder searchRequestBuilder = SearchRequest.builder()
                .similarityThreshold(searchRequestOption.similarityThreshold())
                .topK(searchRequestOption.topK()).similarityThresholdAll();
        if (Objects.nonNull(userPromptText) && !userPromptText.isBlank()) {
            searchRequestBuilder.query(userPromptText);
        }
        if (Objects.nonNull(filterExpression)) {
            searchRequestBuilder.filterExpression(filterExpression);
        }
        return search(searchRequestBuilder.build());
    }

    private List<Document> search(SearchRequest searchRequest) {
        log.info("Search request: {}", searchRequest);
        return this.vectorStore.similaritySearch(searchRequest);
    }
}

As soon as this is done, create a SuggestionRestController.java which will contain endpoint definition, but now we only implement there this similarity search function call.

@RestController
public class SuggestionRestController {
    @Autowired
    MovieSuggestionService movieSuggestionService;

    private List<Document> findSimilarMovies(byte[] referenceMovie) {
        return movieSuggestionService.search(
                new JsonReader(new ByteArrayResource(referenceMovie)).read().get(0).getText(),
                new Expression(ExpressionType.GTE, new Key("popularity"), new Value(4))
        );
    }
}

#4 Create Movie Suggestion endpoint and configure Chat Client (with RAG)

Extend RatingAiConfiguration with a ChatClient bean. This client will have a system prompt which defines what is this ChatClient is for and generate content regarding to that.

@Configuration
public class RatingAiConfiguration {
    @Bean
    public ChatClient movieSuggestAi(ChatClient.Builder builder) {
        return builder.defaultSystem(
                        "You are a chat bot for movie suggestions. Use the provided movies suggest another " +
                                "ones to watch and write a interesting summary of the movie. You can append the provided " +
                                "movies with another ones which similar to them. Maximum 3 another movies you can suggest.")
                .build();
    }
}

Go to SuggestionRestController to define suggest endpoint and implement that.

@RestController
public class SuggestionRestController {

  @Qualifier("movieSuggestAi")
  @Autowired
  ChatClient movieSuggestionGenAi;

  @GetMapping("/suggest")
  public String suggestMovies(@RequestParam String userId) throws IOException {
      List<byte[]> ratedMovie = queryUserRatedMovies(userId); // finds rated movies by user
      return movieSuggestionGenAi
                .prompt(Prompt.builder().content("Give some movie suggestions to watch.").build())
                .advisors(advisorSpec -> advisorSpec.advisors(movieSuggestionRag(ratedMovie)))
                .stream()
                .chatResponse()
                .getResult()
                .getOutput()
                .getText();
  }

  /**
  * Get movie documents which were rated by user (1-5). 
  * Do a similarity search by them to find movies similar to user liked.
  */
  private RetrievalAugmentationAdvisor movieSuggestionRag(List<byte[]> ratedMovie) {
      return RetrievalAugmentationAdvisor.builder()
              .documentRetriever(query ->
                      ratedMovie.stream()
                              .flatMap(movieBytes ->
                                      findSimilarMovies(movieBytes).stream()
                              )
                              .limit(3)
                              .toList()
              )
              .build();
  }
}

Aaand that’s it, we have a ai based movie suggestion solution ready! ✨🎉

Testing suggestion endpoint

After starting Spring application on localhost, default port number is 8080, then you will able to send request to our defined movie suggestion endpoint.

Let’s get an existing user from test data: user-001

Send request (in Postman or cURL) to:
http://localhost:8080/suggest?userId=user-001

After a little time, the chat model starts to give back the response with suggestions by already liked movies of ‘user-001’.

Thank you for your attention, I hope I managed to share something useful by my experiences!
You can take a try too and have a nice day. 😊👋

Beyond the Buzzwords: How Generative AI Really Works

Tamás Bereczki — Tue, 09 Sep 2025 10:02:47 +0000

A deep dive into the core mechanics of modern LLMs, explaining the essential concepts that separate a casual user from a true practitioner

LLM Architectures

Its core innovation is the attention mechanism, which allows the model to weigh the importance of different words in the input text when processing and generating language. This architecture is the backbone of most modern LLMs, including models like GPT and BERT. The Transformer is composed of two primary building blocks: Encoders and Decoders. Different models use these blocks in different combinations to achieve their specific capabilities.

Before moving forward let’s make sure the terms are clear, let’s define them:

Text/Document: The full sequence of words you are working with.
Token: The smallest unit the model processes. After tokenization, a sentence is broken down into these pieces. A token is often a word (like “They”) or a sub-word (like “ing” in “running”).
Embedding: A numerical vector that represents the semantic meaning of a token or a sequence of tokens.

Encoders and Decoders

AI model types has different capabilities, i.e. embedding, text generation, text to image generation, etc.
The models of each types have variaty in sizes (number of parameters).

Typically, Encoders are used for embedding, and Decoders are used for generation.

Encoders

What is Embedding?
Model converts a sequence of words to an embedding (a vector representation of words).

The sequence of words is “They sent me a”. This sentence is tokenized (tokenization means a character sequence — typically a sentence — , is broken into small pieces — mostly simple words) into chunks.

Each token and the whole sentence will be embedded, a vector representation is created from them.

What are Vector Embeddings and why are they useful?
A vector embedding is a powerful concept where a word, sentence, or even an entire document is converted into a numerical representation — a list of numbers called a vector. This vector is designed to capture the rich semantic meaning and context of the original text.

Imagine a vast, multi-dimensional space (often with hundreds of dimensions, such as 300 or more). In this space, every concept has a specific location, represented by its vector. The key principle is that semantically similar concepts will have vectors that are close to each other.

It’s not that a single dimension represents a simple, human-readable trait like ‘kindness’. Instead, meaning is encoded in the vector’s overall position and its relationships with other vectors. For example:

The vectors for “polite” and “courteous” would be located very close together.
The vectors for “king” and “queen” would also be near each other.
Furthermore, the model learns complex relationships. The vector relationship between “king” and “queen” is very similar to the relationship between “man” and “woman”.

The primary application of this is enabling similarity search. By storing these embeddings in a specialized vector database, we can find documents or pieces of text that are semantically similar to a user’s query. Instead of just matching keywords, a similarity search finds content that matches the meaning and intent behind the query, leading to much more relevant and intelligent search results.

Decoders

These kind of models take a sequence of words and output the next word. This is based on probability of the vocabulary which model computes.

Important to understand that the, Decoder only produced a single token at a time! We can invoke a decoder to generate as many new tokens as we want.
In another words, to generate a sequence of new tokens first, we need to feed decoder model with initial sequence of tokens (prompt) and invoke the model to produce the next token.

Encoders — Decoders

This kind of models encodes a sequence of words and uses the encoding to output a next word.

Encoders — Decoders models are typically utilized for sequence-to-sequence tasks, like translation.

Traslation workflow is that we send the English tokens to the model, they are gotten by the encoder which embeds tokens and whole sentence. And then the embeddings get passed to the decoder. There you can notice there is a self-referential loops to the decoder. After generationg a token, that token will be passed back to the decoder.

Architectures at a glance

Task	Encoders	Decoders	Encoder-decoder
Embedding text	Yes	No	No
Abstractive QA	No	Yes	Yes
Extractive QA	Yes	Maybe	Yes
Translation	No	Maybe	Yes
Creative writing	No	Yes	No
Abstractive Summarization	No	Yes	Yes
Extractive Summarization	Yes	Maybe	Yes
Chat	No	Yes	No
Forecasting	No	No	No
Code	No	Yes	Yes

Prompting and Prompt Engineering

In-context Learning and Few-shot Prompting

In-context learning — conditaioning (prompting) an LLM with instructions and/or demonstrations of the task it is meant to complete
k-shot prompting — explicitly provideing k examples of the intended task in the prompt

Here you can see a k-shot prompting example where we tell to the model to translate by providing some examples (in this case three-shot).

Take away:
Few-shot prompting is widely belived to improve results over 0-shot prompting.

Advanced Prompting Strategies

Chain-of-Thought (CoT) — Prompt the LLM to emit intermediate reasoning steps

How can this work at all? Remember, when generating text the model is working word by word, 1 word at a time. It doesn’t have a high-level plan about how to solve the problem.
This is precisely why Chain-of-Thought (CoT) is so effective. By explicitly instructing the model to “think step-by-step,” we force it to generate its reasoning process as part of the output. Each new word it generates is conditioned on the reasoning steps it has already written down. This creates a logical sequence that guides the model toward a more accurate conclusion. Instead of trying to jump straight to the answer — which is difficult for complex problems — the model externalizes its thought process, allowing it to break the problem down and build upon its own intermediate conclusions. It’s like a student showing their work on a math problem; writing down the steps helps avoid errors.

Least-to-most — Prompt the LLM to decompose the problem and solve, easy-first

This strategy builds on Chain-of-Thought by prompting the model to first break a complex problem into a series of simpler subproblems and then solve them in sequence. This is particularly useful for tasks where one step logically depends on the answer to a previous one. The key is to guide the model to tackle the easiest parts first, creating a foundation for solving the more difficult parts. This reduces the cognitive load and improves the chances of arriving at a correct final answer.

Example:
Query: “If a car travels at 60 mph for 30 minutes and then gets stuck in traffic for 15 minutes before traveling at 40 mph for another 15 minutes, what is its average speed for the entire trip?”

Least-to-Most Prompting:
Prompt 1 (Decomposition): “Break down the problem of calculating the car’s average speed into smaller steps.”
Model’s Response: “1. Calculate the distance traveled in the first leg. 2. Calculate the distance traveled in the second leg. 3. Calculate the total distance. 4. Calculate the total time. 5. Divide total distance by total time.”
Prompt 2 (Solving): “Great. Now solve each step.”
Model solves sequentially, leading to the correct average speed.

Step-back — Prompt the LLM to indentify high-level concepts pertinent to a specific task

Step-Back prompting encourages the model to generalize and abstract away from the specific details of a question to consider the broader principles or concepts at play. Instead of getting bogged down by the specifics, the model is asked to “take a step back” and think about the fundamental knowledge required to answer the question. It then uses this high-level understanding to formulate a more robust and accurate answer. This is especially effective for complex reasoning tasks and for questions where the details might be misleading.

Issues with Prompting

Prompting models, while a powerful tool, carries several inherent risks that are important to understand and manage. These risks range from security vulnerabilities to ethical dilemmas.

Prompt Injection
Prompt injection is one of the most significant security risks. It is a form of attack where a malicious actor intentionally crafts a prompt to trick an AI model into ignoring its original instructions and developer guidelines. The model often cannot distinguish between developer instructions and user input, allowing an attacker to take control with a cleverly crafted command. For example, it can be manipulated to leak confidential data or generate misinformation and harmful content. This type of attack does not require advanced technical skills; it is carried out simply by deceiving the model using natural language.

Inherent Biases and Prejudices
Generative models operate based on the patterns present in their training data. If this data contains social, cultural, or other prejudices, the model will not only reproduce but can also amplify these biases. This can manifest, for instance, in the model consistently associating certain professions with a specific gender or reinforcing racial and cultural stereotypes. Such biased outputs can perpetuate discriminatory practices in areas like hiring or lending.

Misinformation and “Hallucinations”
Generative AI models are prone to confidently stating falsehoods. This phenomenon is known as “hallucination.” Since models fundamentally predict the next most likely word based on statistical patterns, their responses do not necessarily have a factual basis. This can be particularly dangerous when users rely on AI-generated content for critical decisions, such as for financial or medical advice. Malicious actors can also intentionally use these models to create convincing fake news and disinformation campaigns.

Data Security and Privacy Risks
When users write prompts, they might inadvertently include confidential or personally identifiable information (PII). There is a risk that the model could memorize this information and later reveal it in a response to another user. The risk is especially high with cloud-based or third-party AI services, where input data may not be handled properly, leading to privacy breaches and legal violations (such as GDPR).

Generating Malicious Content
Without proper safety restrictions, models can be used to create offensive, inappropriate, or illegal content. Attackers can generate malicious code or sophisticated phishing emails to target other users or systems, using the AI as a tool to scale their malicious activities.

Training

While prompting is powerful, it can be insufficient when you need an LLM to become a true expert in a specific domain or perform a highly specialized task. This is where training comes in.

As opposed to prompting (which gives the model context), training permanently changes the model’s internal parameters. Think of it as teaching the model a new skill, not just giving it notes for a single test. At a high level, the process involves:

Giving the model an input.
Letting it guess the corresponding output (e.g., a sentence completion or an answer).
Comparing its guess to the “correct” answer and slightly adjusting its parameters so it does better next time.

These adjustments change how the model “thinks” and hopefully improve its performance on your specific task. There are several ways to do this, ranging from massive undertakings to highly efficient tweaks.

Continual Pre-training: Expanding the Knowledge Base

What it is: This technique continues the initial, broad training of an LLM, but using a large corpus of text from a new, specialized domain (e.g., legal documents, medical research, or your company’s internal wiki). You are still just asking the model to predict the next word, but on this new, focused data.
Parameters Modified: All of them.
Data Required: A large amount of unlabeled domain-specific text.
Best for: When the model lacks fundamental knowledge about a specific field. You aren’t teaching it a task, you are teaching it a subject.

Full Fine-Tuning (FT): Teaching a Specific Skill

What it is: This is the classic way to train a model for a specific task. You take a pre-trained model and train it further on a dataset of examples that show exactly what you want it to do (e.g., thousands of question-answer pairs for a customer service bot).
Parameters Modified: All of them.
Data Required: A high-quality, labeled, task-specific dataset.
Best for: Achieving the highest possible performance on a specific task when you have a large budget and a good dataset. However, it is computationally expensive and risks “catastrophic forgetting,” where the model loses some of its general capabilities.

Parameter-Efficient Fine-Tuning (PEFT): The Smart Middle Ground

To get the benefits of fine-tuning without the enormous cost, several PEFT methods have emerged. The core idea is to freeze the original LLM’s billions of parameters and only train a small number of new or specific ones.

Method A: LoRA (Low-Rank Adaptation)

What it is: LoRA is a popular PEFT method where small, trainable “adapter” layers are inserted into the model. The original model remains frozen, and only these tiny new layers are trained. It’s like adding specialized tuning knobs to a complex engine instead of rebuilding it.
Parameters Modified: Only a tiny fraction of new, added parameters.
Data Required: Labeled, task-specific data (often less than full FT).

Method B: Soft Prompting (or Prompt Tuning)

What it is: This technique focuses on the input. It freezes the entire model and instead learns a special “soft prompt” — a sequence of numerical values that are prepended to your actual prompt. You can think of these as perfect, computer-generated keywords that are learned during training to steer the model toward the correct output for your task.
Parameters Modified: A small number of new parameters that represent the soft prompt.
Data Required: Labeled, task-specific data.

Training Style	Parameters Modified	Data	Cost	Use Case
Cont. Pre-training	All	Unlabeled	Very High	Adapting to a new knowledge domain.
Full Fine-Tuning	All	Labeled	High	Max performance on a specific task
PEFT (e.g., LoRA)	Few (new)	Labeled	Low	Cost-effective task specialization
Soft Prompting	Few (new)	Labeled	Very Low	Efficiently tuning for many tasks

Decoding

Decoding is the process an LLM uses to select words from a probability distribution to generate text. After the model processes an input, it doesn’t know the “right” word; it only knows the probability of every word in its vocabulary being the next one.

Let’s get the example:

“I wrote to the zoo to send me a pet. They sent me a _______”

The model produces a probability distribution over its entire vocabulary, which might look something like this:

“lion = 0.03”; “elephant = 0.02”; “dog = 0.45”; “cat = 0.4”; “panther = 0.05”; “aligator = 0.01”; …

The key question is: "how do we pick a word from this list?" This choice happens iteratively:

The model computes the probability distribution.
A word is selected using a decoding strategy.
The chosen word is appended to the input text.
The process repeats until the model generates an end-of-sequence (EOS) token or reaches its maximum length.

There are two main families of decoding strategies:

1.) Greedy Decoding: The Direct Path

This is the simplest and most direct strategy. At each step, we simply pick the word with the highest probability.

In the example above, it selects the “dog” (0.45 probability).

The next input becomes:
“I wrote to the zoo to send me a pet. They sent me a dog ______”

The model then generates a new distribution. Let’s say the highest probability is now for the End-of-Sequence token (EOS = 0.99). Greedy decoding selects it, and the generation stops.

Distribution: 
“EOS = 0.99”; “elephant = 0.001”; “dog = 0.001”; “cat = 0.001”; “panther = 0.005”; “aligator = 0.01”; …

Pros: Fast, predictable, and produces the most “likely” output.
Cons: Can be repetitive and boring. It might miss a more creative or coherent sentence by always choosing the locally optimal word, without considering the global sentence structure.

2.) Sampling: Introducing Controlled Randomness

To produce more creative and human-like text, we can introduce randomness. Instead of always picking the top word, we sample from the probability distribution. Several parameters control this process.

Temperature
Temperature is the most important parameter for controlling randomness. It is a value typically between 0.0 and 2.0. It “re-shapes” the probability distribution before sampling.

Low Temperature (e.g., 0.2): This makes the distribution “peakier.” The probability of high-probability words (like “dog” and “cat”) gets boosted, while low-probability words are suppressed even further. As temperature approaches 0, it becomes identical to greedy decoding. Use when: You want factual, grounded, and predictable answers.
High Temperature (e.g., > 1.0): This “flattens” the distribution, making the probabilities of words more uniform. Rare words have a higher chance of being selected. Use when: You want creative, diverse, and sometimes surprising output, like for writing a story or brainstorming ideas.

Top-K and Top-P (Nucleus) Sampling
These methods are often used with temperature to further refine the word selection. They prevent the model from picking truly nonsensical words by first filtering the vocabulary list.

Top-K Sampling: Consider only the K most likely words. For example, if K=3, you would only sample from “dog”, “cat”, and “panther”, ignoring all others.
Top-P (Nucleus) Sampling: A more dynamic approach. You choose the smallest set of top words whose cumulative probability is greater than P. For example, if P=0.90, you would take “dog” (0.45) and “cat” (0.40), as their sum is 0.85. Since that’s less than 0.90, you’d add the next word, “panther” (0.05), bringing the total to 0.90. You then sample only from that small group.

Take away:
Decoding is a balance between coherence and creativity. For factual applications, use greedy decoding or sampling with a low temperature. For creative tasks, increase the temperature and use Top-K or Top-P to guide the randomness and prevent nonsensical outputs.

Thanks for reading this article, if you have any questions feel free to reach me out!

The LLM application possibilities are coming soon, follow me to get notified.