Lewis Won

Posted on Jul 5

Contextual chunking for Retrieval Augmented Generation

#rag #productivity #tutorial #vectordatabase

Introduction
Why chunking
Why contextual chunking
Why fixed-size chunking is insufficient
How to implement contextual chunking
Docling
Google Gemini 2.5 Pro/Flash
Comparison of Chunking Approaches
Conclusion

Introduction

In many enterprise settings, we often have documents which are highly structured and nicely formatted. There is often a hierarchical approach in how the information is structured. For example, there is usually a content page, and relevant information are often grouped by sections. The section numbers are also in running order, and often follow a certain convention, e.g. "d.d.d Header".

When creating RAG for thousands of such documents, chunking is often necessary so that when queried, the chunks are of manageable length to be passed to LLMs as context. Contextual chunking has significant advantages over simple fixed-size chunking in ensuring the relevancy and completeness of RAG, and in an enterprise setting where precision and accuracy are critical, getting chunking done right can even decide the fate of digital transformation efforts in organisations.

This is a hands-on article with code that you can implement along with me. Instead of reading this article, you may also wish to download my Jupyter notebook directly here.

Why chunking

Given the advent of LLMs such as Google Gemini 2.5 Flash/Pro with 1 million tokens context window, if the number of documents are within the range of 10, chunking may not be necessary. In fact, Gemini can produce very accurate and comprehensive response using such a brute force method, and it is also easy to implement.

The following code demonstrates how we can use Google Gemini to understand a document. First, we set up the Google Gemini client.

from dotenv import load_dotenv
import os

# --- DEBUGGING STEP ---
# Print the current working directory to see where Python is looking.
print(f"Current working directory: {os.getcwd()}")

# Load environment variables from the .env file
load_dotenv()

# Get the API key from the environment variables
# The string "GENAI_API_KEY" must match the variable name in your .env file
api_key = os.getenv("GENAI_API_KEY")

# Check if the API key is loaded correctly
if not api_key:
    raise ValueError("No API key found. Please set the GENAI_API_KEY in your .env file.")

If the GEN_API_KEY is present and loaded, we should only see the current working directory printed in the output:

Current working directory: /home/lewis/github/rag-strategies

Next, we can use the following code to load the document "Functional Specification Document: FIRDS -- Reference Data". I chose this document because it contains structures such as headers defining each section, and relatively complex content such as multi-page tables. For example, we can see that in Annex 1c: Reference Data Content and Consistency Validation Rules, there is a table that spans from page 183 to 187. We can ask Gemini the query "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules" to check if it is able to retrieve the entire table in its entirety.

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client(api_key=api_key)

# Retrieve and encode the PDF byte
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
doc_path = os.path.join(current_dir, "resources", file[0])
filepath = pathlib.Path(doc_path)

prompt = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

The expected response is the following table:

Control executed by the system	Error code	Error Message	Concerned Fields
The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix.	INS-101	The CFI code is not valid against the CFI based validation matrix.	RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table
Check that Mandatory fields are reported according to “CFI-based validations table”.	INS-102	The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)”.	RTS field 3 vs all other RTS fields
Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”.	INS-103	The following Non-Applicable fields are wrongly reported: “List of RTS23 number Id of N/A field(s)”.	All RTS fields
The following checks are performed only in case checks above are passed.
Check that that a record (ISIN, MIC) is not reported twice in the same file.	INS-104	The following records are reported twice in the same file.	RTS field 1,6
The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date).	INS-105	The Trading Venue field contains an invalid MIC code.	RTS field 6
The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file.	INS-107	“Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity.	Reporting Entity RTS field 6
The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL.	INS-108	The Strike Price Currency Code is incorrect.	RTS field 32
The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE).	INS-109	The Notional Currency 1 Code is incorrect.	RTS field 13
The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE).	INS-110	The Notional Currency 2 Code is incorrect.	RTS field 42, 47
The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE).	INS-111	The Currency of nominal value is incorrect.	RTS field 16
The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}.	INS-112	The LEI provided for “Issuer Identifier” is invalid.	RTS field 5
The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}.	INS-113	The LEI provided for “Direct Underlying Issuer” is invalid.	RTS field 27a, 27b
Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation.	INS-114	The ISIN code of the instrument identification code is invalid.	RTS field 1
Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation.	INS-115	The ISIN code of the underlying is invalid.	RTS field 26a, 26b, 26c
Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation.	INS-116	The ISIN code of the Index/Benchmark of a floating rate Bond is invalid.	RTS field 19
The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899).	INS-117	The “Date of admission to trading or date of First trade” is not a consistent date.	RTS field 11
The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899).	INS-118	The Termination Date is not a consistent date.	RTS field 12
The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”.	INS-119	The Termination Date is earlier than the “Date of admission to trading or date of First trade”.	RTS field 11, 12
The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899).	INS-120	The Maturity Date is not a consistent date.	RTS field 15
The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”.	INS-121	The Maturity Date and Date of admission to trading or date of First trade are not consistent.	RTS field 11, 15
The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899).	INS-122	The Expiry Date is not a consistent date.	RTS field 24
The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”.	INS-123	The Expiry Date and The Date of admission to trading or date of First trade are not consistent.	RTS field 11, 24
Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP**** (Put Options).	INS-124	Invalid “PUTO” Option Type	RTS field 3, 30
Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC****(Call Options).	INS-125	Invalid “CALL” Option Type	RTS field 3, 30
The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date.	INS-126	The Termination date is not populated for an expired/matured instrument. N.B.: that check if failed generates a warning only.	RTS field 12, {15 or 24}
The termination date should be earlier or equal in case Expiry date/Maturity date is populated.	INS-127	The Termination date and Expiry date/Maturity date are not consistent. N.B.: that check if failed generates a warning only.	RTS field 12, {15 or 24}
The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority.	INS-128	The following fields are not consistent with the one provided by RCA :<>, RCA_MIC :<>(<>): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only.	RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table.
The currency of the Total issued nominal amount shall be the same as the currency of nominal value	INS-129	The currency of the Total issued nominal amount is not the same as the currency of nominal value	RTS Field 14. Currency RTS Field 16.
The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB.	INS-130	The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB	RTS field 1,6

Even with Gemini 2.5 Flash, we can see that it is able to understand the query, and retrieve the entire table completely, thus fulfilling the requirement of relevancy and completeness.

Here is the entire table of **Annex 1c: Reference Data Content and Consistency Validation Rules** from page 185 of the document:

**TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES**

| Control executed by the system | Error code | Error Message | Concerned Fields |
| :----------------------------- | :--------- | :------------ | :--------------- |
| The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix. | INS-101 | The CFI code is not valid against the CFI based validation matrix. | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table |
| Check that Mandatory fields are reported according to “CFI-based validations table”. | INS-102 | The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)”. | RTS field 3 vs all other RTS fields |
| Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”. | INS-103 | The following Non-Applicable fields are wrongly reported: “List of RTS23 number Id of N/A field(s)”. | All RTS fields |
| **The following checks are performed only in case checks above are passed.** | | | |
| Check that that a record (ISIN, MIC) is not reported twice in the same file. | INS-104 | The following records are reported twice in the same file. | RTS field 1,6 |
| The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date). | INS-105 | The Trading Venue field contains an invalid MIC code. | RTS field 6 |
| The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file. | INS-107 | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity. | Reporting Entity <br> RTS field 6 |
| The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL. | INS-108 | The Strike Price Currency Code is incorrect. | RTS field 32 |
| The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-109 | The Notional Currency 1 Code is incorrect. | RTS field 13 |
| The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-110 | The Notional Currency 2 Code is incorrect. | RTS field 42, 47 |
| The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-111 | The Currency of nominal value is incorrect. | RTS field 16 |
| The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-112 | The LEI provided for “Issuer Identifier” is invalid. | RTS field 5 |
| The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-113 | The LEI provided for “Direct Underlying Issuer” is invalid. | RTS field 27a, 27b |
| Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation. | INS-114 | The ISIN code of the instrument identification code is invalid. | RTS field 1 |
| Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation. | INS-115 | The ISIN code of the underlying is invalid. | RTS field 26a, 26b, 26c |
| Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation. | INS-116 | The ISIN code of the Index/Benchmark of a floating rate Bond is invalid. | RTS field 19 |
| The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899). | INS-117 | The “Date of admission to trading or date of First trade” is not a consistent date. | RTS field 11 |
| The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-118 | The Termination Date is not a consistent date. | RTS field 12 |
| The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-119 | The Termination Date is earlier than the “Date of admission to trading or date of First trade”. | RTS field 11, 12 |
| The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-120 | The Maturity Date is not a consistent date. | RTS field 15 |
| The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”. | INS-121 | The Maturity Date and Date of admission to trading or date of First trade are not consistent. | RTS field 11, 15 |
| The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-122 | The Expiry Date is not a consistent date. | RTS field 24 |
| The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-123 | The Expiry Date and The Date of admission to trading or date of First trade are not consistent. | RTS field 11, 24 |
| Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP\*\*\*\* (Put Options). | INS-124 | Invalid “PUTO” Option Type | RTS field 3, 30 |
| Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC\*\*\*\*(Call Options). | INS-125 | Invalid “CALL” Option Type | RTS field 3, 30 |
| The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date. | INS-126 | The Termination date is not populated for an expired/matured instrument. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
| The termination date should be earlier or equal in case Expiry date/Maturity date is populated. | INS-127 | The Termination date and Expiry date/Maturity date are not consistent. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
| The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority. | INS-128 | The following fields are not consistent with the one provided by RCA :<<Upcoming RCA>>, RCA\_MIC :<<MIC>>(<<MIC’s country>>): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only. | RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table. |
| The currency of the Total issued nominal amount shall be the same as the currency of nominal value | INS-129 | The currency of the Total issued nominal amount is not the same as the currency of nominal value | RTS Field 14. Currency <br> RTS Field 16. |
| The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB. | INS-130 | The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB | RTS field 1,6 |

While simple to code and effective in the quality of response, the brute-force approach of loading the entire document as context is not a silver bullet. For enterprise use cases with more than thousands of documents which would be well above 1 million tokens, chunking is still a necessary evil.

In addition, the tens of thousands of tokens which represent the document is consumed per query as part of the input tokens, and they could contribute significant cost as a fixed amount of input tokens to each query sent to the LLM.

Using Gemini 2.5 Pro as an example, the input tokens are charged about $1.25 USD per million input tokens. Assuming that a document has 30,000 tokens, this amounts to a baseline cost of $0.0375 USD per query. If had used RAG for semantic similarity search instead, the total tokens of the top retrieved number of chunks would have been much lower than 30,000 tokens.

(Note: There are offerings from some providers offering cost savings by caching contexts such as documents, so that the cost of having such documents as part of the input prompt is much lower.)

Why contextual chunking

Documents have structures that may not be sufficiently captured by methods such as fixed-sized chunking. For example, a typical document may have a header for each section, and because the length of each section usually varies, we may face either of the following two problems when deciding the optimal size of each chunk:

1) Relevancy: For shorter sections, unnecessary information may be included in these chunks;

2) Completeness: For longer sections, they may be broken up into too many little chunks, hence resulting in incomplete retrieval.

By chunking according to the characteristics of each document, we can better ensure that the information retrieved from queries are both complete and relevant.

Why fixed-size chunking is insufficient

While simple to implement, fixed-size chunking may not sufficiently capture hierchical relationships such as multi-page tables which falls under the same section. Using the technical manual as an example, we can use the following code to implement fixed-size chunking.

First, let us import the dependencies and use sentence-transformers/all-mpnet-base-v2 as the embedder, cross-encoder/ms-marco-MiniLM-L-6-v2 as the encoder, and "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules" as the query.

import os
from fpdf import FPDF

# LangChain components
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Define constants
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
PDF_PATH = os.path.join(current_dir, "resources", file[0])
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
USER_QUERY = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"

The following code does the heavy lifting which involves:

loading the PDF, and chunking it into fixed sizes of 800 with chunk overlap 100.
Embed each chunk
Store the chunks in a vector database. We use FAISS in this example.
Retrieve the top 10 relevant chunks.
Rerank the retrieved chunk and return the top 3.

# ==============================================================================
#  STEP 1: LOAD AND CHUNK THE DOCUMENT
# ==============================================================================
print("\n--- Step 1: Loading and Chunking PDF ---")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    length_function=len,
)
chunks = text_splitter.split_documents(documents)
print(f"PDF loaded and split into {len(chunks)} chunks.")

# ==============================================================================
#  STEP 2: EMBED THE CHUNKS
# ==============================================================================
print(f"\n--- Step 2: Embedding Chunks using '{EMBEDDING_MODEL_NAME}' ---")
# This will download the model from Hugging Face on its first run.
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
print("Embedding model loaded.")

# ==============================================================================
#  STEP 3: STORE IN A VECTOR DATABASE
# ==============================================================================
print("\n--- Step 3: Storing chunks in FAISS in-memory vector database ---")
# The from_documents method handles embedding and storing in one step.
vector_store = FAISS.from_documents(chunks, embeddings)
print("Chunks embedded and stored in FAISS.")

# ==============================================================================
#  STEP 4: RETRIEVE RELEVANT CHUNKS
# ==============================================================================
print("\n--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---")
print(f"\nUser Query: \"{USER_QUERY}\"")

# Retrieve the top 5 most similar chunks
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})

print("\n--- Top 10 Retrieved Chunks ---")
initial_results = base_retriever.get_relevant_documents(USER_QUERY)
print(f"\n--- Top 5 Initial Results (from vector search alone) ---")
for i, chunk in enumerate(initial_results[:5], 1):
    print(f"\n--- Initial Result {i} ---\n")
    print(chunk.page_content)

The output below returns the top 5 out of 10 chunks from FAISS. We can see from the output that the table which we are interested in, i.e. Annex 1c, is only ranked 3 (see "--- Initial Result 3 ---"):


--- Step 1: Loading and Chunking PDF ---
PDF loaded and split into 625 chunks.

--- Step 2: Embedding Chunks using 'sentence-transformers/all-mpnet-base-v2' ---
Embedding model loaded.

--- Step 3: Storing chunks in FAISS in-memory vector database ---
Chunks embedded and stored in FAISS.

--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---

User Query: "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"

--- Top 10 Retrieved Chunks ---

--- Top 5 Initial Results (from vector search alone) ---

--- Initial Result 1 ---

ESMA REGULAR USE 


33 / 216 
Upcoming RCA The country of the Relevant Competent Authority of that instrument, as last 
determined by the system for the upcoming publication. 
Free-text fields 
used for 
consistency 
checks 
 “Free-text fields used for consistency checks” fields in “RTS23 Fields table” 
as listed in section 6.9 RTS23 Fields table. 
Non-free-text 
fields used for 
consistency 
checks 
 “Non-free-text fields used for consistency checks ” fields in “RTS23 Fields 
table” as listed in section 6.9 RTS23 Fields table.  
TABLE 6 - FIELDS OF REFERENCE FIELDS TABL E 

Finally, as per section 3.3.10, the ESMA system updates, recursively based on the already existing 
records, a new table called “Consistent Reference Data Table ” 4  as follows: for each record

--- Initial Result 2 ---

ESMA REGULAR USE 


198 / 216 
15 Annex 5 ISO reference data tables 
15.1 Country reference data table 
Field Name M/O Data field 
description 
Data field 
Values 
ISO Description Source 
CountryCode M 2(a)  ISO 
3166 
The 2-character ISO Country Code 
identifier. 
• data provider 
• ESMA manual update 
CountryName M 70(z)   The ISO description of the country 
name. 
• data provider 
• ESMA manual update 
EEACountryFlag M TRUEFALSE 
Indicator TRUE/FALSE  Flag which indicates whether the 
Country is EEA. 
• ESMA manual update  
• Default value is FALSE 
ValidityStartDate M 
Date 
YYYYMMDD 
  Date at which the record becomes 
valid  Generated by the ESMA System 
ValidityEndDate O 
Date 
YYYYMMDD 
  Date of which the records ends to be 
valid Generated by the ESMA System 
LastUpdatedDate M

--- Initial Result 3 ---

ESMA REGULAR USE 


183 / 216 

9 Annex 1c: Reference Data Content and 
Consistency Validation Rules  
Control executed by the system 
Error 
code 
Error Message 
Concerned 
Fields 
The value of “Instrument Classification” shall 
be a valid ISO 10962 code and shall be 
covered by at least one of the CFI constructs 
in the CFI-based validation matrix. 
INS-101 
The CFI code is not valid 
against the CFI based 
validation matrix. 
RTS field 3 against the list of 
valid CFI codes table and 
against the list of CFI Construct 
(Primary Key) in the CFI based 
validation table 
Check that Mandatory fields are reported 
according to “CFI-based validations table”. 
INS-102 The following mandatory 
fields are not reported: 
“List of RTS23 number Id 
of missing field(s)”.

--- Initial Result 4 ---

ESMA REGULAR USE 


35 / 216 
In addition, the system shall have mechanisms in place to avoid that interfacing systems needing 
access reference data during the 00:00 – 08:00 period retrieve inconsistent data due to ongoing updates 
taking place during the post-processing phase. Proposals on the best approach will be expected from 
the provider in charge of the technical specifications and development of the system5.  
3.3.3 Perform Reference Data Content Validation 
Goal The goal of this use case is for individual records within a received file 
to be validated by ESMA.   
Actors TV/SI (in the jurisdiction of a delegating NCA) - submits data 
NCA (not delegating data collection in its jurisdiction) - submits data 
The ESMA System – validates data

--- Initial Result 5 ---

as per contained in the Consistent Reference Data T able. In order to ensure security of the data 
contained in the Consistent Reference data table, the public user will access a copy of that table, the 
publication table, which is updated on daily basis during the publication process.

Ideally, we want our chunk to be returned as the top chunk and not be buried behind other chunks which are not relevant. We can use a reranker to help improve the relevancy in ranking of the chunks. The following code reranks the 10 retrieved chunks, and returns the top 3 retrieved chunks.

# ==============================================================================
#  STEP 5: RERANK RETRIEVED CHUNKS
# ==============================================================================
# The cross-encoder model will be downloaded on the first run.
# It takes the query and a list of documents and returns them, scored and re-ordered.
print(f"\n--- Initializing Reranker with '{RERANKER_MODEL_NAME}' ---")
model = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL_NAME)
reranker = CrossEncoderReranker(model=model, top_n=3)

# 5d. Create the full retrieval pipeline with the reranker
# The ContextualCompressionRetriever uses the base retriever to fetch documents
# and then the reranker to re-order them based on relevance.
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)
print("Reranking pipeline created.")

# 5e. Perform the final, reranked search
print("\n--- Performing search with reranking... ---")
reranked_chunks = compression_retriever.get_relevant_documents(USER_QUERY)
print("\n\n=========================================================")
print(f"--- Top 3 Reranked & Most Relevant Chunks ---")
print("=========================================================")
for i, chunk in enumerate(reranked_chunks, 1):
    print(f"\n--- Final Result {i} ---\n")
    print(chunk.page_content)

Based on the ouput of the top 3 reranked chunks, we can see that the chunk we are looking for is now ranked number 1, which is ideal (see "--- Final Result 1 ---"). While the correct chunk has been retrieved, it is not ideal because we can see that the table has been truncated at row three of the 5-page table, i.e. "List of RTS23 number Id of missing field(s)". If we pass this as context to the LLM, the LLM will only be able to provide a truncated table back to the user instead of the full table as expected. This means while relevancy is achieved, completeness is not.

--- Performing search with reranking... ---


=========================================================
--- Top 3 Reranked & Most Relevant Chunks ---
=========================================================

--- Final Result 1 ---

ESMA REGULAR USE 


183 / 216 

9 Annex 1c: Reference Data Content and 
Consistency Validation Rules  
Control executed by the system 
Error 
code 
Error Message 
Concerned 
Fields 
The value of “Instrument Classification” shall 
be a valid ISO 10962 code and shall be 
covered by at least one of the CFI constructs 
in the CFI-based validation matrix. 
INS-101 
The CFI code is not valid 
against the CFI based 
validation matrix. 
RTS field 3 against the list of 
valid CFI codes table and 
against the list of CFI Construct 
(Primary Key) in the CFI based 
validation table 
Check that Mandatory fields are reported 
according to “CFI-based validations table”. 
INS-102 The following mandatory 
fields are not reported: 
“List of RTS23 number Id 
of missing field(s)”.

--- Final Result 2 ---

address errors on previous submission.  
Business 
Rules 
Table 33 - Reference Data Content and Consistency Validation Rules. 
Assumptions N/A 

3.3.4 Update the Received Reference Data Table 
Goal 
The goal of this use case is to update the Received Reference Data 
Table according to a submitted record which passed the content 
validation checks. 
Actors The ESMA System. 
Preconditions The ESMA System has performed the content validation on the submitted 
record.  
Trigger 
The ESMA System has successfully validated the content of the submitted 
record. 
Postcondition The ESMA System has updated the Received Reference Data Table 
according to the submitted record. 
Normal Flow 
(Referenced 
records – 
DATINS file 
submission)

--- Final Result 3 ---

as per contained in the Consistent Reference Data T able. In order to ensure security of the data 
contained in the Consistent Reference data table, the public user will access a copy of that table, the 
publication table, which is updated on daily basis during the publication process.

How to implement contextual chunking

I will explore two methods of implementing contextual chunking:

1) Docling: According to the Docling Technical Report, it is "an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget".

2) Google Gemini 2.5 Pro/Flash: We can take advantage of Google Gemini's large context window to understand a single large document in its entirety, and use prompts to ask Gemini to generate metadata about the document which we can then use to chunk the document. While this method is relatively more expensive than specialised frameworks/models such as Docling, it is also more flexible, hence ideal for fast iteration and prototyping.

Docling

Docling has a a fully integrated solution that can parse, chunk, embed and ingest into a vector database.

We will focus on how well Docling can parse PDFs. Docling is able to parse documents into a unified document representation called DoclingDocument, which captures information such as main content and headers and layout information such as bounding boxes. The code to parse PDF is below.

First, let us import the dependencies that we need for Docling.

# Import dependencies
import os
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, FormatOption
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
import time
from pathlib import Path

Next, let us list all the files we have in the resources folder, and currently there is only the single file firds_reference_data_functional_specifications_v2.10.pdf.

# List of files
current_dir = os.getcwd()
data_dir = os.path.join(current_dir, "resources")
files = os.listdir(data_dir)

We can then define the pipeline for Docling, and key is because we are focused on parsing tables, the settings are focused on enabling Docling to parse the tables properly.

# Pipeline configs
accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.accelerator_options = accelerator_options

Next, we shall set up the converter for Docling:

# Setup converter
converted = DocumentConverter(
    allowed_formats=[InputFormat.PDF],
    format_options={
        InputFormat.PDF: FormatOption(
            pipeline_cls=StandardPdfPipeline,
            pipeline_options=pipeline_options,
            backend=PyPdfiumDocumentBackend
        )
    }
)

Finally, we will parse the file from PDF into markdown. I am using a Nvidia RTX 4070 Super with 12GB VRAM, and it took me about 76.35 seconds to parse the document. You may need a longer time if the pipeline is not GPU-enabled.

# Begin parsing
for file in files:
    pdf_path = os.path.join(data_dir, file)
    # Check if file exists
    if not os.path.exists(pdf_path):
        print(f"Error: File '{pdf_path}' does not exist.")
        exit(1)
    print(f"Parsing file '{pdf_path}'...")

    start_time = time.time()
    print("Converting PDF to text...")
    conv_res = converted.convert(pdf_path)
    print("Converting done.")
    output_dir = Path("parsed")
    output_dir.mkdir(parents=True, exist_ok=True)
    doc_filename = conv_res.input.file.stem
    # Save markdown
    md_filename = output_dir / f"{doc_filename}.md"
    conv_res.document.save_as_markdown(md_filename)
    end_time = time.time() - start_time
    print(f"Parsing done. Time elapsed: {end_time:.2f} seconds.")

If you are pulling my git repo, you may find the parsed markdown in the folder parsed. We can see that Docling is able to capture information from a complex technical document such as headers, and represent structures such as tables accurately as markdown tables. However, we can also see that Docling is not perfect in handling relatively more complex data such as multi-page tables, which are broken into multiple tables if they span across pages.

In addition, Docling also occasionally confused page headers as section headers, which presents a significant problem as we will rely heavily on section headers as dividers for chunking. Using the example below for reference, ideally only "## 9 Annex 1c: Reference Data Content and Consistency Validation Rules" should be regarded as the section header, and "## ESMA REGULAR USE" should not be treated as a section header. As a result of mis-classifying "## ESMA REGULAR USE" as a section header, further post-processing is necessary before we can use section headers as the basis for chunking.

## 9 Annex 1c: Reference Data Content and Consistency Validation Rules

| Control executed by the system                                                                                                                                                                                                                 | Error  code                                                              | Error Message                                                                                        | Concerned Fields                                                                                                                                |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| The value of “Instrument Classification” shall  be a valid ISO 10962 code and shall be  covered by at least one of the CFI constructs  in the CFI - based validation matrix.                                                                   | INS - 101                                                                | The CFI code is not valid  against the CFI based  validation matrix.                                 | RTS field 3 against the list of  valid CFI codes table and  against the list of CFI Construct  (Primary Key) in the CFI based  validation table |
| Check that Mandatory fields are reported  according to “CFI-based validations table”.                                                                                                                                                          | INS - 102                                                                | The following mandatory  fields are not reported:  “ List of RTS23 number Id  of missing field(s)” . | RTS field 3 vs all other RTS fields                                                                                                             |
| Check that Non - Applicable fields (N/A) are  not reported according to “CFI-based  validations table”.                                                                                                                                        | INS - 103                                                                | The following Non Applicable fields are  wrongly reported: “List of  RTS23 number Id of N/A  field(s)” .                                                                                                      | All RTS fields                                                                                                                                  |
| The following checks are performed only in case checks above are passed.                                                                                                                                                                       | The following checks are performed only in case checks above are passed. | The following checks are performed only in case checks above are passed.                             | The following checks are performed only in case checks above are passed.                                                                        |
| Check that that a record (ISIN, MIC) is not  reported twice in the same file.                                                                                                                                                                  | INS-104                                                                  | The following records are  reported twice in the  same file.                                         | RTS field 1,6                                                                                                                                   |
| The MIC identifier in the  TradingVenueRelatedAttributes block shall  exist in the Trading venue mapping view  which satisfies the following conditions: ValidityStartDate is prior or equal to the  current date and (ValidityEndDate is NULL | INS - 105                                                                | The Trading Venue field  contains an invalid MIC  code.                                              | RTS field 6                                                                                                                                     |
| The Reporting entity identification associated  to the MIC [field 6] in Reporting Flow view  (TV / SI MIC) is equal to the Reporting Entity  identifier in the header of the XML file.                                                         | INS - 107                                                                | “Trading Venue” field is  not registered at ESMA  or is not reported by the  right reporting entity. | Reporting Entity RTS field 6                                                                                                                    |

<!-- image -->

## ESMA REGULAR USE

| The Strike Price Currency Code shall exist  as an active ISO 4217 Currency Code in the  currency reference data table (based on  records with ValidityEndDate is NULL .                                                                                                                                                                                                  | INS - 108   | The Strike Price  Currency Code is  incorrect.                | RTS field 32       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|---------------------------------------------------------------|--------------------|
| The Notional Currency 1 Code shall exist as  an ISO 4217 Currency Code in the currency  reference table (based on records which  ValidityEndDate is NULL or PreEuroFlag is  TRUE).                                                                                                                                                                                       | INS - 109   | The Notional Currency 1  Code is incorrect.                   | RTS field 13       |
| The Notional Currency 2 Code shall exist as  an ISO 4217 Currency Code in the currency  reference table (based on records which  ValidityEndDate is NULL or PreEuroFlag is  TRUE).                                                                                                                                                                                       | INS - 110   | The Notional Currency 2  Code is incorrect.                   | RTS field 42, 47   |
| The Currency of nominal value shall exist as  an ISO 4217 Currency Code in the currency  reference table (based on records which  ValidityEndDate is NULL or PreEuroFlag is                                                                                                                                                                                              | INS - 111   | The Currency of nominal  value is incorrect.                  | RTS field 16       |
| The value of the “Issuer Identifier” shall exist  in the LEI reference table and comply with  the following conditions: ( ValidityEndDate is NULL  OR date of termination of the respective record is  between any period specified by  ValidityStartDate and ValidityEnddate in LEI  reference table for this LEI ) AND                                                 | INS - 112   | The LEI provided for  “Issuer Identifier” is  invalid.        | RTS field 5        |
| "Pending transfer", "Pending archival}. The value of the “Direct Underlying issuer ”  shall exist in the LEI reference table and  comply with the following conditions: ( ValidityEndDate is NULL  OR date of termination of the respective record is  between any period specified by  ValidityStartDate and ValidityEnddate in LEI  reference table for this LEI ) AND | INS - 113   | The LEI provided for  “Direct Underlying Issuer”  is invalid. | RTS field 27a, 27b |

<!-- image -->

Google Gemini 2.5 Pro/Flash

We now revisit the use of Gemini by having it generate metadata about the document, which only needs to happen once. For example, we can use the following prompt to request Gemini to extract the section headers, and the page number representing the start of each section.

First, let us set up the environment with the API key:

from dotenv import load_dotenv
import os

# --- DEBUGGING STEP ---
# Print the current working directory to see where Python is looking.
print(f"Current working directory: {os.getcwd()}")

# Load environment variables from the .env file
load_dotenv()

# Get the API key from the environment variables
# The string "GENAI_API_KEY" must match the variable name in your .env file
api_key = os.getenv("GENAI_API_KEY")

# Check if the API key is loaded correctly
if not api_key:
    raise ValueError("No API key found. Please set the GENAI_API_KEY in your .env file.")

Next, we create a few helper functions read and extract the pdf:

def read_pdf_as_bytes(file_path):
    try:
        with open(file_path, "rb") as file:
            pdf_bytes = file.read()
        return pdf_bytes
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return None

def extract_text_from_pdf(pdf_path):
    try:
        try:
            from PyPDF2 import PdfReader
        except ImportError:
            from pypdf import PdfReader
        full_text = ""
        page_text = []

        # Open and read PDF
        with open(pdf_path, "rb") as file:
            pdf_reader = PdfReader(file)

            for page_num, page in enumerate(pdf_reader.pages):
                text = page.extract_text()
                page_text.append({"page": page_num + 1, "text": text})
                full_text += f"\n\n--- Page {page_num + 1} ---\n\n{text}"
        full_text_bytes = full_text.encode("utf-8")
        return read_pdf_as_bytes(pdf_path), page_text

    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return "", []

The below section is key -- it contains the prompt for Gemini to extract the section headers and the corresponding page number. You may have to tweak this prompt slightly to suit each document and their unique structure and naming convention of the headers.

import json


def get_section_map_from_gemini(full_text):
    print("Asking Gemini to identify the document structure...")
    prompt = """
    You are a technical document parser. Your task is to analyse the provided text from a PDF.
    Identify all the file specification sections. A section typically starts with a pattern like "d.dd XXXXX", "d.d XXXXXX", "d XXXXXXX", or "d Annex dd: XXXXXXXX". These section headers are bolded.

    Extract the following for each section found:
    1. The full section title (e.g., '6.11 Rejection statistics table'.
    2. The page number where the section title appears.

    Return the result as a JSON array of objects. Each object should have two keys: 'title' and 'start_page'.
    Ensure the page number is an integer.

    Example of a single JSON object in the array:
    {
        "section_title": "6.11 Rejection statistics table",
        "start_page": 10
    }
    """

    client = genai.Client(api_key=api_key)
    response = client.models.generate_content(
        model="gemini-2.5-pro",
        config={
            'temperature': 0.0,
            'response_mime_type': 'application/json'
        },
        contents=[
            types.Part.from_bytes(
                data=full_text,
                mime_type='application/pdf'
            ),
            prompt
        ]
    )
    try:
        section_map = json.loads(response.text)
        print(f"Gemini successfully identified {len(section_map)} sections.")
        return section_map
    except json.JSONDecodeError:
        print("Error: Gemini did not return a valid JSON response.")
        print(response.text)
        return None

Next, the following helper function chunks the pdf into chunks:

def create_logical_chunks(page_texts, section_map):
    print("Creating logical chunks based on the section map...")
    text_by_page = {p["page"]: p["text"] for p in page_texts}

    chunks = []
    sorted_sections = sorted(section_map, key=lambda x: x["start_page"])

    for i, section in enumerate(sorted_sections):
        start_page = section["start_page"]
        section_title = section["section_title"]

        end_page = None
        if i + 1 < len(sorted_sections):
            end_page = sorted_sections[i + 1]["start_page"]

        if end_page is None or end_page < start_page:
            end_page = len(page_texts)

        chunk_text = ""
        # we use end_page + 1 to overlap with one additional page, to handle the case where a single page has 2 sections
        for page_num in range(start_page, end_page + 1):
            if page_num in text_by_page:
                chunk_text += text_by_page[page_num] + "\n"

        # Clean up the chunk: find the start of the current section text
        title_pos = chunk_text.find(section_title)
        if title_pos != -1:
            chunk_text = chunk_text[title_pos:]

        # Create LangChain Document object
        doc = Document(
            page_content=chunk_text.strip(),
            metadata={
                "section_title": section_title,
                "start_page": start_page,
                "end_page": end_page
            }
        )
        chunks.append(doc)


    print(f"Created {len(chunks)} logical chunks.")
    return chunks

Finally, we can generating the metadata from the document, to extract the section headers and their corresponding page numbers:

# ==============================================================================
#  STEP 1: LOAD AND CHUNK THE DOCUMENT
# ==============================================================================
print("\n--- Step 1: Loading and Chunking PDF ---")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

parsed_dir = "parsed"
os.makedirs(parsed_dir, exist_ok=True)
section_map_path = os.path.join(parsed_dir, "section_map.json")

full_doc_text, pages = extract_text_from_pdf(PDF_PATH)
if os.path.exists(section_map_path):
    with open(section_map_path, "r") as f:
        section_map = json.load(f)
    print("Loaded existing section map.")
else:
    section_map = get_section_map_from_gemini(full_doc_text)
    if section_map:
        with open(section_map_path, "w") as f:
            json.dump(section_map, f, indent=2)
        print("Saved section map to section_map.json.")

Using the prompt which we provided above, Gemini generated the following metadata (relevant extract below). We can see from the metadata that for Gemini correctly identified "9 Annex 1c: Reference Data Content and Consistency Validation Rules" as the section header, and it did not confuse page headers as section headers, which is a significant improvement over Docling. If you pulled the github repo, you may find this metadata in the parsed folder, in the file section_map.json.

{
    "section_title": "8 Annex 1b: Format Validation Rules",
    "start_page": 184
  },
  {
    "section_title": "9 Annex 1c: Reference Data Content and Consistency Validation Rules",
    "start_page": 185
  },
  {
    "section_title": "10 Annex 1d: Non-working Days Content Validation Rules",
    "start_page": 190
  }

Given that Gemini is able to generate the metadata properly, we can now chunk the PDF according to the metadata, embed the chunks, and ingest the embeddings into a vector database. Note that from this point onwards, no LLM is required, hence the only significant cost involved is using Gemini to generate the metadata, which is a one-time cost.

# ==============================================================================
#  STEP 1: CHUNK THE PDF
# ==============================================================================

chunks = create_logical_chunks(pages, section_map)

print(f"PDF loaded and split into {len(chunks)} chunks.")

# ==============================================================================
#  STEP 2: EMBED THE CHUNKS
# ==============================================================================
print(f"\n--- Step 2: Embedding Chunks using '{EMBEDDING_MODEL_NAME}' ---")
# This will download the model from Hugging Face on its first run.
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
print("Embedding model loaded.")

# ==============================================================================
#  STEP 3: STORE IN A VECTOR DATABASE
# ==============================================================================
print("\n--- Step 3: Storing chunks in FAISS in-memory vector database ---")
# The from_documents method handles embedding and storing in one step.
vector_store = FAISS.from_documents(chunks, embeddings)
print("Chunks embedded and stored in FAISS.")

# ==============================================================================
#  STEP 4: RETRIEVE RELEVANT CHUNKS
# ==============================================================================
print("\n--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---")
print(f"\nUser Query: \"{USER_QUERY}\"")

# Retrieve the top 5 most similar chunks
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})

print("\n--- Top 10 Retrieved Chunks ---")
initial_results = base_retriever.get_relevant_documents(USER_QUERY)
print(f"\n--- Top 5 Initial Results (from vector search alone) ---")
for i, chunk in enumerate(initial_results[:5], 1):
    print(f"\n--- Initial Result {i} ---\n")
    print(chunk.page_content)

The top 5 matched chunks are below. We can see that the chunk we are looking for is under "--- Initial Result 5 ---", and unlike our previous approach with fixed-size chunking, this time we are able to retrieve the full table spanning all 5 pages.

Creating logical chunks based on the section map...
Created 171 logical chunks.
PDF loaded and split into 171 chunks.

--- Step 2: Embedding Chunks using 'sentence-transformers/all-mpnet-base-v2' ---
Embedding model loaded.

--- Step 3: Storing chunks in FAISS in-memory vector database ---
Chunks embedded and stored in FAISS.

--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---

User Query: "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"

--- Top 10 Retrieved Chunks ---

--- Top 5 Initial Results (from vector search alone) ---

--- Initial Result 1 ---

ESMA REGULAR USE  


198 / 216 
15 Annex 5  ISO reference data tables  
15.1 Country reference data table  
Field Name  M/O Data field 
description  Data field  
Values  ISO Description  Source  
CountryCode  M 2(a)  ISO 
3166  The 2-character  ISO Country Code 
identifier.  • data provider  
• ESMA manual update  
CountryName  M 70(z)    The ISO description of the country 
name.  • data provider  
• ESMA manual update  
EEACountryFlag  M TRUEFALSE 
Indicator  TRUE/FALSE   Flag which indicates whether the 
Country is EEA.  • ESMA manual update   
• Default value is FALSE  
ValidityStartDate  M Date  
YYYYMMDD    Date at which the record becomes 
valid  Generated by the ESMA System  
ValidityEndDate  O Date  
YYYYMMDD    Date of which the records ends to be 
valid Generated by the ESMA System  
LastUpdatedDate  M DateTime  
YYYYMMDD 
HH:MI:SS    Date at which the record was last 
updated  Generated by the ESMA System

--- Initial Result 2 ---

15.4 List of valid CFI codes table  
Field Name  M/O Data field description  Data field description 
Values  ISO Description  Source  
CFI code  M 6(a) 
  10962   The CFI Code  Updated manually by the 
ESMA Business Administrator  
ValidityStartDate  M Date  
YYYYMMDD    Date at which the record 
becomes valid  Generated by the ESMA  
System  
ValidityEndDate  O Date  
 YYYYMMDD    Date of which the records 
ends to be valid  Generated by the ESMA 
System  
TABLE 45 - LIST OF VALID CFI  CODES TABLE  








ESMA REGULAR USE  


203 / 216 
15.5 LEI reference data table  
That table contains LEI records, including historical records, composed of all LEI attributes  described in http://www.leiroc.org/publications/gls/lou_20140620.pdf . Only the fields 
relevant for the COU files will be retained. In addition,  for each LEI record, two technical attributes  are to be appended (in order to manage history) : 
Field Name  M/O Data field description  Data field description 
Values  ISO Description  Source  
ValidityStartDate  M Date  
YYYYMMDD    Date at which the record becomes 
valid  Generated by the ESMA 
System  
ValidityEndDate  O Date  
 YYYYMMDD    Date of which the records ends to 
be valid  Generated by the ESMA 
System  
TABLE 46 - TECHNICAL ATTRIBUTES OF LEI REFERENCE DATA TABL E

--- Initial Result 3 ---

6.1 Reporting Files Table  
Field Name  M/O Data field description  Data field Values  ISO Description  Source  
FileName [PK]  M 5(a)_6(a)_5(a)_5(a) -6(n)_2(a)    Name of a submitted file excluding 
HUBEX/HUBDE timestamp  ESMA System  
ESMA reception 
Date Time  M YYYYMMDDHHMMSS    The timestamp in the name of a 
submitted file  ESMA System  
TABLE 13 - REPORTING FILES TABLE   


ESMA REGULAR USE  


163 / 216 
6.2 NCA reference data table  
Field Name  M/O Data field 
description  Data field 
Values  ISO Description  Source  
Country Code  M 2(a) ISO 3166 - 
Country Code  3166  The 2-character  ISO Country Code identifier.  Updated by ESMA IT administrator 
from registration process  
AuthorityName  M 30(x)    The official name  of the NCA  Updated by ESMA business 
administrator from registration process  
Address  M 250(z)    The address of the  NCA  Updated by ESMA business 
administrator from registration process  
Generic  
EmailAddress  O    The email address to be used for the RCA 
change process es. Updated by ESMA business 
administrator from registration process  
Contact.  
Name  M 250(z)    The name of the contact  Updated by ESMA business 
administrator from registration process  
Contact.  
EmailAddress  M    The email address of the contact  Updated by ESMA business 
administrator from registration process  
Contact.  
PhoneNumber  M    The phone Number  of the contact  Updated by ESMA business 
administrator from registration process  
Level of 
delegation  M 1(a) N/C/T  N in case Non-delegating  NCA  
C in case  NCA delegating data collection  and 
transparency calculations  
T in case  NCA delegating transparency 
calculations  but not data collection in their 
jurisdiction  Updated by ESMA business 
administrator from registration process  

Withdrawn flag  M TRUEFALSE 
Indicator    Flag which indicates whether the NCA is 
withdrawn from the system  Updated by ESMA business 
administrator from registration process  
TABLE 14 - NCA  REFERENCE DATA TABL E

--- Initial Result 4 ---

15.5 LEI reference data table  
That table contains LEI records, including historical records, composed of all LEI attributes  described in http://www.leiroc.org/publications/gls/lou_20140620.pdf . Only the fields 
relevant for the COU files will be retained. In addition,  for each LEI record, two technical attributes  are to be appended (in order to manage history) : 
Field Name  M/O Data field description  Data field description 
Values  ISO Description  Source  
ValidityStartDate  M Date  
YYYYMMDD    Date at which the record becomes 
valid  Generated by the ESMA 
System  
ValidityEndDate  O Date  
 YYYYMMDD    Date of which the records ends to 
be valid  Generated by the ESMA 
System  
TABLE 46 - TECHNICAL ATTRIBUTES OF LEI REFERENCE DATA TABL E 



ESMA REGULAR USE  


204 / 216 

16 Annex 6 Scenarios  of Instrument reference data reporting and distribution  
The system shall ensure compliance with the following scenarios.  
16.1 Modified instrument reported on time

--- Initial Result 5 ---

ESMA REGULAR USE  


183 / 216 

9 Annex 1c: Reference Data Content and 
Consistency Validation Rules   
Control executed by the system  Error 
code  Error Message  Concerned  
Fields  
The value of “Instrument Classification” shall 
be a valid ISO 10962 code and shall be 
covered by at least one of the CFI constructs 
in the CFI -based validation matrix.  INS-101 
The CFI code is not valid 
against the CFI based 
validation matrix.  RTS field 3  against  the list of 
valid CFI codes table and 
against the list of CFI Construct 
(Primary Key) in the CFI based 
validation table  
Check that  Mandatory field s are reported 
according to “CFI-based validations table”.  INS-102 The following mandatory 
fields are not reported: 
“List of RTS23 number Id 
of missing field(s)” . RTS field 3 vs a ll other RTS  
fields  
Check that Non-Applicable  fields (N/A) are 
not reported  according to “CFI-based 
validations table”.  INS-103 The following Non-
Applicable  fields are 
wrongly reported: “ List of 
RTS23 number Id of N/A 
field(s)” . All RTS fields  
The following checks are performed only in case checks above are passed.  
Check that that a record (ISIN, MIC) is not 
reported twice in the same file.  INS-104 The following records are 
reported twice in the 
same file.  RTS field 1,6  
The MIC identifier in the 
TradingVenueRelatedAttributes block shall 
exist in the Trading venue mapping view 
which satisfies the following conditions:  
ValidityStartDate is prior or equal to the 
current date and (ValidityEndDate is NULL  
OR is later or equal to the current date ).  INS-105 The Trading Venue field 
contains an invalid MIC 
code.  RTS field 6  
The Reporting entity identification associated 
to the MIC [field 6] in Reporting Flow view 
(TV / SI MIC) is equal to the Reporting Entity 
identifier in the header of the XML file.  INS-107 “Trading Venue” field is 
not registered at ESMA 
or is not reported by the 
right reporting entity.  Reporting Entity  

RTS field 6  


ESMA REGULAR USE  


184 / 216 
The Strike Price Currency Code shall exist 
as an active ISO 4217 Currency Code  in the 
currency reference data table  (based on 
records with ValidityEndDate is NULL . INS-108 The Strike Price 
Currency Code is 
incorrect.  RTS field 32  
The Notional Currency 1 Code shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDat e is NULL  or PreEuroFlag is 
TRUE ). INS-109 The Notional Currency 1 
Code is incorrect.  RTS field 13  
The Notional Currency 2 Code shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDate is NULL  or PreEuroFlag is 
TRUE ). INS-110 The Notional Currency 2 
Code is incorrect.  RTS field 42, 47 
The Currency of nominal value shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDate is NULL  or PreEuroFlag is 
TRUE ). INS-111 The Currency of nominal 
value is incorrect.  RTS field 1 6 
The value of the “Issuer Identifier” shall exist 
in the LEI reference table and comply with 
the following conditions:  
( 
ValidityEndDate is NULL  
OR 
date of termination of the respective record is 
between any period specified by 
ValidityStartDate and ValidityEnddate in LEI 
reference table for this LEI  
) 
AND  
register status in {“Issued", "Lapsed", 
"Pending transfer", "Pending archival}.  INS-112 The LEI provided for 
“Issuer Identifier” is 
invalid.  RTS field 5  
The value of the “ Direct Underlying issuer ” 
shall exist in the LEI reference table and 
comply with the following conditions:  
( 
ValidityEndDate is NULL  
OR 
date of termination of the respective record is 
between any period specified by 
ValidityStartDate and ValidityEnddate in LEI 
reference table for this LEI  
) 
AND  
register status in {“Issued", "Lapsed", 
"Pending transfer", "Pending archival}.  INS-113 The LEI provided for 
“Direct Underlying Issuer” 
is invalid.  RTS field 27a, 27b  


ESMA REGULAR USE  


185 / 216 
Check the last digit of the ISIN code of the 
“instrument identification code” according to 
the algorithm of ISIN validation .19 INS-114 The ISIN code of the 
instrument identification 
code is invalid.  RTS field 1  
Check the last digit of the ISIN code of the 
“underlying instrument” should be valid 
according to the algorithm of ISIN 
validation .20 INS-115 The ISIN code of the 
underlying is invalid.  RTS field 26a, 26b, 26c  
Check the last digit of the ISIN code of the 
Identifier of the “Index/Benchmark of a 
floating rate Bond” should be valid according 
to the algorithm of ISIN validation.21 INS-116 The ISIN code of the 
Index/Benchmark of a 
floating rate Bond is 
invalid.  RTS field 19 
The “Date of admission to trading or date of 
First trade” should a valid date and in a 
sensible range (no prior than 31 -12-189922). INS-117 The “Date of admission 
to trading or date of First 
trade”  is not a consistent 
date.  RTS field 11  
The Termination Date should a valid date 
and in a sensible range (no prior than 31 -12-
189923). INS-118 The Termination Date is 
not a consistent date.  RTS field 12  
The Termination Date should be equal to or 
later than the “Date of admission to trading 
or date of First trade”.  INS-119 The Termination Date is 
earlier than the “Date of 
admission to trading or 
date of First trade”.  RTS field 11, 12  
The Maturity Date should a valid date and in 
a sensible range (no prior than 31 -12-
189924). INS-120 The Maturity Date is not 
a consistent date.  RTS field 15  
The Maturity Date should b e equal to or later 
than “Date of admission to trading or date of 
First trade”.  INS-121 The Maturity Date and 
Date  of admission to 
trading or date of First 
trade are not consistent.  RTS field 11, 15  

19 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 
20 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 
21 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 

22 The oldest ins trument traded according to RDS System database. That date must be configurable.  

23 The oldest instrument traded according to RDS System database.  That date must be configurable.  

24 The oldest instrument traded according to RDS System database.  That date m ust be configurable.  




ESMA REGULAR USE  


186 / 216 
The Expiry Date should a valid date and in a 
sensible range (no prior than 31 -12-189925). INS-122 The Expiry Date is not a 
consistent date.  RTS field 24  
The Expiry date should be equal to or later 
than the “Date of admission to trading or 
date of First trade”.  INS-123 The Expiry Date and The 
Date of admission  to 
trading or date of First 
trade are not consistent.  RTS field 11, 24  
Field “Option Type” shall only contain value 
“PUTO” when the “Instrument Classification” 
refers to the following CFI Codes: OP**** 
(Put Options).  INS-124 Invalid “PUTO” Option 
Type  RTS field 3, 30  
Field “Option Type” shall only contain value 
“CALL” when the “Instrument Classification” 
refers to the following CFI Codes: OC**** 
(Call Options).  INS-125 Invalid “CALL” Option 
Type  RTS field 3, 30  
The termination date should be populated  in 
case Maturity date/Expiry date is populated 
and is strictly earlier than the current 
reporting date.  INS-126 The Termination date is 
not populated for an 
expired/matured 
instrument.  
N.B.: tha t check if  failed 
generates a warning 
only.  RTS field 12, {15  or 24}  
The termination date should be earlier or 
equal in case Expiry date/Maturity date is 
populated.   
INS-127 The Termination date 
and Expiry date/Maturity 
date are not consistent.  
N.B.: tha t check if  failed 
generates a warning 
only.  RTS field 12, {15  or 24}  
The field listed in Table 1 BRD 43. shall be 
consistent with the values provided by the 
Relevant competent Authority.26 
INS-128 The following fields are 
not consistent with the 
one provided by RCA 
:<<Upcoming RCA>> , 
RCA_ MIC 
:<<MIC>> (<<MIC’s 
country>> ): List of RTS23 RTS field s used for consistency 
checks  as stated  in Table 21 - 
RTS23 Fields table . 

25 The oldest instrument traded according to RDS System database.  That date must be configurable.  

26 Generated during the consistency checks.  


ESMA REGULAR USE  


187 / 216 
number Id of missing 
field(s)”.  
N.B.: that check if failed 
generates a warning 
only.  
The currency of the Total issued nominal 
amount shall be the same as the currency of 
nominal value  INS-129 The currency of the Total 
issued nominal amount is 
not the same as the 
currency of  nominal value  RTS Field 14. Currency  
RTS Field 16.  
The ISIN-MIC combination, received for a 
cancellation  record , should exists in FIRDS 
DB. INS-130 The ISIN-MIC 
combination, received 
from a cancellation  
record , doesn’t exists in 
FIRDS DB  RTS field 1,6  
TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES  




ESMA REGULAR USE  


188 / 216 
10 Annex 1d: Non -working Days Content Validation 
Rules  
Control executed by the system  Error code  Error Message  
If the non -working day is provided for a Market TV/SI  
(NonWorkgDay/Id/ MktIdCd  is populated ):  the system 
check s that the MIC exists in the Reporting Flow View 
under “TV / SI MIC”, and that there exists a line in the 
Reporting Flow View which maps this “TV / SI MIC” 
with “Reporting Entity” documented in the 
RptHdr/RptgNtty  
If the non -working day is provided for an APA or CTP 
(NonWorkgDay/Id/ Othr/Id is populated ):  the system 
check s that the identification code under Other/Id 
exists in the Reporting Flow View under “Reporting 
Entity”  and is the same as the entity reported under 
RptHdr/RptgNtty/Id/Othr  
 NWD -001 The TV/SI/APA/CTP identified under 
NonWorkgDay/Id  is not registered at 
ESMA or is not consistent with  the 
reporting  entity  in the header.  
In case the identification code of the record is a NCA27, 
that code shall exist in the NCA reference data table in 
the Registers system  and must be equal to the 
Reporting Entity identifier in the header of the XML file.  NWD -002 The NCA identified by the “Trading 
Venue identification code”  field is not 
registered at ESMA or is not equal to 
the reporting entity  in the header.  
The Non -working Date of a record should be a valid 
date.  NWD -003 This date does not exist . 
TABLE 34 - NON-WORKING DAYS CONTENT VALIDATION RULES28 
11 Reminder Message code and description  
Code  Code description  
RMD -001 No file has been submitted to ESMA on the day <<current reporting date>>  or was 
submitted after the cut -off time.  
RMD -002 The instrument was not  reported on the day <<current reporting date>>  or was reported 
after the cut -off time.  
TABLE 35 - REMINDER MESSAGE CODE AND DESC RIPTION  

27 Used in case the non -working day refers to an NCA

To improve the relevancy of the retrieved chunks so that the expected chunk appears right at the very top, we can now use our reranker again.

# ==============================================================================
#  STEP 5: RERANK RETRIEVED CHUNKS
# ==============================================================================
# The cross-encoder model will be downloaded on the first run.
# It takes the query and a list of documents and returns them, scored and re-ordered.
print(f"\n--- Initializing Reranker with '{RERANKER_MODEL_NAME}' ---")
model = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL_NAME)
reranker = CrossEncoderReranker(model=model, top_n=3)

# 5d. Create the full retrieval pipeline with the reranker
# The ContextualCompressionRetriever uses the base retriever to fetch documents
# and then the reranker to re-order them based on relevance.
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)
print("Reranking pipeline created.")

# 5e. Perform the final, reranked search
print("\n--- Performing search with reranking... ---")
reranked_chunks = compression_retriever.get_relevant_documents(USER_QUERY)
print("\n\n=========================================================")
print(f"--- Top 3 Reranked & Most Relevant Chunks ---")
print("=========================================================")
for i, chunk in enumerate(reranked_chunks, 1):
    print(f"\n--- Final Result {i} ---\n")
    print(chunk.page_content)

With the application of the reranker, our expected chunk now appears as "--- Final Result 1---", and the full multi-page table is now properly retrieved.


--- Initializing Reranker with 'cross-encoder/ms-marco-MiniLM-L-6-v2' ---
Reranking pipeline created.

--- Performing search with reranking... ---


=========================================================
--- Top 3 Reranked & Most Relevant Chunks ---
=========================================================

--- Final Result 1 ---

ESMA REGULAR USE  


183 / 216 

9 Annex 1c: Reference Data Content and 
Consistency Validation Rules   
Control executed by the system  Error 
code  Error Message  Concerned  
Fields  
The value of “Instrument Classification” shall 
be a valid ISO 10962 code and shall be 
covered by at least one of the CFI constructs 
in the CFI -based validation matrix.  INS-101 
The CFI code is not valid 
against the CFI based 
validation matrix.  RTS field 3  against  the list of 
valid CFI codes table and 
against the list of CFI Construct 
(Primary Key) in the CFI based 
validation table  
Check that  Mandatory field s are reported 
according to “CFI-based validations table”.  INS-102 The following mandatory 
fields are not reported: 
“List of RTS23 number Id 
of missing field(s)” . RTS field 3 vs a ll other RTS  
fields  
Check that Non-Applicable  fields (N/A) are 
not reported  according to “CFI-based 
validations table”.  INS-103 The following Non-
Applicable  fields are 
wrongly reported: “ List of 
RTS23 number Id of N/A 
field(s)” . All RTS fields  
The following checks are performed only in case checks above are passed.  
Check that that a record (ISIN, MIC) is not 
reported twice in the same file.  INS-104 The following records are 
reported twice in the 
same file.  RTS field 1,6  
The MIC identifier in the 
TradingVenueRelatedAttributes block shall 
exist in the Trading venue mapping view 
which satisfies the following conditions:  
ValidityStartDate is prior or equal to the 
current date and (ValidityEndDate is NULL  
OR is later or equal to the current date ).  INS-105 The Trading Venue field 
contains an invalid MIC 
code.  RTS field 6  
The Reporting entity identification associated 
to the MIC [field 6] in Reporting Flow view 
(TV / SI MIC) is equal to the Reporting Entity 
identifier in the header of the XML file.  INS-107 “Trading Venue” field is 
not registered at ESMA 
or is not reported by the 
right reporting entity.  Reporting Entity  

RTS field 6  


ESMA REGULAR USE  


184 / 216 
The Strike Price Currency Code shall exist 
as an active ISO 4217 Currency Code  in the 
currency reference data table  (based on 
records with ValidityEndDate is NULL . INS-108 The Strike Price 
Currency Code is 
incorrect.  RTS field 32  
The Notional Currency 1 Code shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDat e is NULL  or PreEuroFlag is 
TRUE ). INS-109 The Notional Currency 1 
Code is incorrect.  RTS field 13  
The Notional Currency 2 Code shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDate is NULL  or PreEuroFlag is 
TRUE ). INS-110 The Notional Currency 2 
Code is incorrect.  RTS field 42, 47 
The Currency of nominal value shall exist as 
an ISO 4217 Currency Code in the currency 
reference table (based on records which 
ValidityEndDate is NULL  or PreEuroFlag is 
TRUE ). INS-111 The Currency of nominal 
value is incorrect.  RTS field 1 6 
The value of the “Issuer Identifier” shall exist 
in the LEI reference table and comply with 
the following conditions:  
( 
ValidityEndDate is NULL  
OR 
date of termination of the respective record is 
between any period specified by 
ValidityStartDate and ValidityEnddate in LEI 
reference table for this LEI  
) 
AND  
register status in {“Issued", "Lapsed", 
"Pending transfer", "Pending archival}.  INS-112 The LEI provided for 
“Issuer Identifier” is 
invalid.  RTS field 5  
The value of the “ Direct Underlying issuer ” 
shall exist in the LEI reference table and 
comply with the following conditions:  
( 
ValidityEndDate is NULL  
OR 
date of termination of the respective record is 
between any period specified by 
ValidityStartDate and ValidityEnddate in LEI 
reference table for this LEI  
) 
AND  
register status in {“Issued", "Lapsed", 
"Pending transfer", "Pending archival}.  INS-113 The LEI provided for 
“Direct Underlying Issuer” 
is invalid.  RTS field 27a, 27b  


ESMA REGULAR USE  


185 / 216 
Check the last digit of the ISIN code of the 
“instrument identification code” according to 
the algorithm of ISIN validation .19 INS-114 The ISIN code of the 
instrument identification 
code is invalid.  RTS field 1  
Check the last digit of the ISIN code of the 
“underlying instrument” should be valid 
according to the algorithm of ISIN 
validation .20 INS-115 The ISIN code of the 
underlying is invalid.  RTS field 26a, 26b, 26c  
Check the last digit of the ISIN code of the 
Identifier of the “Index/Benchmark of a 
floating rate Bond” should be valid according 
to the algorithm of ISIN validation.21 INS-116 The ISIN code of the 
Index/Benchmark of a 
floating rate Bond is 
invalid.  RTS field 19 
The “Date of admission to trading or date of 
First trade” should a valid date and in a 
sensible range (no prior than 31 -12-189922). INS-117 The “Date of admission 
to trading or date of First 
trade”  is not a consistent 
date.  RTS field 11  
The Termination Date should a valid date 
and in a sensible range (no prior than 31 -12-
189923). INS-118 The Termination Date is 
not a consistent date.  RTS field 12  
The Termination Date should be equal to or 
later than the “Date of admission to trading 
or date of First trade”.  INS-119 The Termination Date is 
earlier than the “Date of 
admission to trading or 
date of First trade”.  RTS field 11, 12  
The Maturity Date should a valid date and in 
a sensible range (no prior than 31 -12-
189924). INS-120 The Maturity Date is not 
a consistent date.  RTS field 15  
The Maturity Date should b e equal to or later 
than “Date of admission to trading or date of 
First trade”.  INS-121 The Maturity Date and 
Date  of admission to 
trading or date of First 
trade are not consistent.  RTS field 11, 15  

19 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 
20 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 
21 See Formula  for computing modulus 10 "Double -Add-Double" check digit  as per ISO 6166 specifications . 

22 The oldest ins trument traded according to RDS System database. That date must be configurable.  

23 The oldest instrument traded according to RDS System database.  That date must be configurable.  

24 The oldest instrument traded according to RDS System database.  That date m ust be configurable.  




ESMA REGULAR USE  


186 / 216 
The Expiry Date should a valid date and in a 
sensible range (no prior than 31 -12-189925). INS-122 The Expiry Date is not a 
consistent date.  RTS field 24  
The Expiry date should be equal to or later 
than the “Date of admission to trading or 
date of First trade”.  INS-123 The Expiry Date and The 
Date of admission  to 
trading or date of First 
trade are not consistent.  RTS field 11, 24  
Field “Option Type” shall only contain value 
“PUTO” when the “Instrument Classification” 
refers to the following CFI Codes: OP**** 
(Put Options).  INS-124 Invalid “PUTO” Option 
Type  RTS field 3, 30  
Field “Option Type” shall only contain value 
“CALL” when the “Instrument Classification” 
refers to the following CFI Codes: OC**** 
(Call Options).  INS-125 Invalid “CALL” Option 
Type  RTS field 3, 30  
The termination date should be populated  in 
case Maturity date/Expiry date is populated 
and is strictly earlier than the current 
reporting date.  INS-126 The Termination date is 
not populated for an 
expired/matured 
instrument.  
N.B.: tha t check if  failed 
generates a warning 
only.  RTS field 12, {15  or 24}  
The termination date should be earlier or 
equal in case Expiry date/Maturity date is 
populated.   
INS-127 The Termination date 
and Expiry date/Maturity 
date are not consistent.  
N.B.: tha t check if  failed 
generates a warning 
only.  RTS field 12, {15  or 24}  
The field listed in Table 1 BRD 43. shall be 
consistent with the values provided by the 
Relevant competent Authority.26 
INS-128 The following fields are 
not consistent with the 
one provided by RCA 
:<<Upcoming RCA>> , 
RCA_ MIC 
:<<MIC>> (<<MIC’s 
country>> ): List of RTS23 RTS field s used for consistency 
checks  as stated  in Table 21 - 
RTS23 Fields table . 

25 The oldest instrument traded according to RDS System database.  That date must be configurable.  

26 Generated during the consistency checks.  


ESMA REGULAR USE  


187 / 216 
number Id of missing 
field(s)”.  
N.B.: that check if failed 
generates a warning 
only.  
The currency of the Total issued nominal 
amount shall be the same as the currency of 
nominal value  INS-129 The currency of the Total 
issued nominal amount is 
not the same as the 
currency of  nominal value  RTS Field 14. Currency  
RTS Field 16.  
The ISIN-MIC combination, received for a 
cancellation  record , should exists in FIRDS 
DB. INS-130 The ISIN-MIC 
combination, received 
from a cancellation  
record , doesn’t exists in 
FIRDS DB  RTS field 1,6  
TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES  




ESMA REGULAR USE  


188 / 216 
10 Annex 1d: Non -working Days Content Validation 
Rules  
Control executed by the system  Error code  Error Message  
If the non -working day is provided for a Market TV/SI  
(NonWorkgDay/Id/ MktIdCd  is populated ):  the system 
check s that the MIC exists in the Reporting Flow View 
under “TV / SI MIC”, and that there exists a line in the 
Reporting Flow View which maps this “TV / SI MIC” 
with “Reporting Entity” documented in the 
RptHdr/RptgNtty  
If the non -working day is provided for an APA or CTP 
(NonWorkgDay/Id/ Othr/Id is populated ):  the system 
check s that the identification code under Other/Id 
exists in the Reporting Flow View under “Reporting 
Entity”  and is the same as the entity reported under 
RptHdr/RptgNtty/Id/Othr  
 NWD -001 The TV/SI/APA/CTP identified under 
NonWorkgDay/Id  is not registered at 
ESMA or is not consistent with  the 
reporting  entity  in the header.  
In case the identification code of the record is a NCA27, 
that code shall exist in the NCA reference data table in 
the Registers system  and must be equal to the 
Reporting Entity identifier in the header of the XML file.  NWD -002 The NCA identified by the “Trading 
Venue identification code”  field is not 
registered at ESMA or is not equal to 
the reporting entity  in the header.  
The Non -working Date of a record should be a valid 
date.  NWD -003 This date does not exist . 
TABLE 34 - NON-WORKING DAYS CONTENT VALIDATION RULES28 
11 Reminder Message code and description  
Code  Code description  
RMD -001 No file has been submitted to ESMA on the day <<current reporting date>>  or was 
submitted after the cut -off time.  
RMD -002 The instrument was not  reported on the day <<current reporting date>>  or was reported 
after the cut -off time.  
TABLE 35 - REMINDER MESSAGE CODE AND DESC RIPTION  

27 Used in case the non -working day refers to an NCA

--- Final Result 2 ---

8 Annex 1b: Format Validation Rules  
Initial data validation is done to confirm file sent by the Submitting Entity can be processed. This 
includes whether the file can be uncompressed , conforms to  expected XSD schema and common file 
identifiers are valid.  
Possible Errors encountered are:  
Error  
code  Error Message  Control  
Feedback messages related to file validation  
FIL-104 The ISO 20022 Message Identifier in the 
BAH (*.xsd) is not valid.  The ISO 20022 Message Identifier in the 
BAH must refer to the latest schema 
approved by ITMG.  
FIL-105 The file structure does not correspond to the 
XML schema: [result of XML validation].  Validate that the file sent fits to the 
corresponding XML schema. For information 
purposes, if there is an error in the validation, 
the error message produced by the XML 
parser is displayed in place of [result of XML 
validation].  
FIL-106 The Reporting Entity  is not registered at 
ESMA or the Submitting Entity shall not 
submit this data . Validate the file as follows:  
1) Extracts from  Table 19 - Reporting Flow  
view the Submitting entity identification 
associated to the Reporting entity  
identifier code in the Reporting header  of 
the submitted file.  
2) Checks that the Submitting entity 
identification extract ed in step 1 is equal 
to the sender code of the submitted file . 
FIL-107 File <Filename> has already been submitted 
once.  When a file is received, the system checks 
whether it exists in the Reporting Files Table 
as described in Table 13 - Reporting Files 
table a record which filename is composed of 
the same sender, filetype, recipient, Key1, 
Key2 Year.  
TABLE 32 - FORMAT VALIDATION RULES  




ESMA REGULAR USE  


183 / 216 

9 Annex 1c: Reference Data Content and 
Consistency Validation Rules   
Control executed by the system  Error 
code  Error Message  Concerned  
Fields  
The value of “Instrument Classification” shall 
be a valid ISO 10962 code and shall be 
covered by at least one of the CFI constructs 
in the CFI -based validation matrix.  INS-101 
The CFI code is not valid 
against the CFI based 
validation matrix.  RTS field 3  against  the list of 
valid CFI codes table and 
against the list of CFI Construct 
(Primary Key) in the CFI based 
validation table  
Check that  Mandatory field s are reported 
according to “CFI-based validations table”.  INS-102 The following mandatory 
fields are not reported: 
“List of RTS23 number Id 
of missing field(s)” . RTS field 3 vs a ll other RTS  
fields  
Check that Non-Applicable  fields (N/A) are 
not reported  according to “CFI-based 
validations table”.  INS-103 The following Non-
Applicable  fields are 
wrongly reported: “ List of 
RTS23 number Id of N/A 
field(s)” . All RTS fields  
The following checks are performed only in case checks above are passed.  
Check that that a record (ISIN, MIC) is not 
reported twice in the same file.  INS-104 The following records are 
reported twice in the 
same file.  RTS field 1,6  
The MIC identifier in the 
TradingVenueRelatedAttributes block shall 
exist in the Trading venue mapping view 
which satisfies the following conditions:  
ValidityStartDate is prior or equal to the 
current date and (ValidityEndDate is NULL  
OR is later or equal to the current date ).  INS-105 The Trading Venue field 
contains an invalid MIC 
code.  RTS field 6  
The Reporting entity identification associated 
to the MIC [field 6] in Reporting Flow view 
(TV / SI MIC) is equal to the Reporting Entity 
identifier in the header of the XML file.  INS-107 “Trading Venue” field is 
not registered at ESMA 
or is not reported by the 
right reporting entity.  Reporting Entity  

RTS field 6

--- Final Result 3 ---

ESMA REGULAR USE  


35 / 216 
In addition, the system shall have mechanisms in place to avoid that interfacing systems needing 
access reference data during the 00:00 – 08:00 period retrieve inconsistent data due to  ongoing updates 
taking place during the post -processing phase. Proposals on the best approach will be expected from 
the provider in charge of the technical specifications and development of the system5.  
3.3.3  Perform Reference Data Content Validation  
Goal  The goal of this use case is for individual records within a received file 
to be validated by ESMA .   
Actors  TV/SI (in the jurisdiction of a delegating NCA) - submits data  
NCA (not delegating data collection in its jurisdiction) - submits data  
The ESMA Sy stem  – validates data  
Preconditions  ESMA has received and successfully validated the format of a received file 
Trigger  ESMA has received and successfully validated the format of a received file 
Postcondition  The ESMA System  has extracted the subset of records which passed the 
data content validations . 
Normal Flow  1. The ESMA System  validates each record  against data all content  
validation rules  sequentially in the order as described in Table 33 - 
Reference Data Content and Consistency Validation Rules . 
2. Validation is successful finding no errors . 
Alternate 
Flow  1: 
Preliminary 
Content 
validation 
Errors  1a. The ESMA System  validates each record  against data content validation 
rules sequentially in the order stated in Annex  1c. One of the following  
check s INS-101, INS-102 and INS-103 fails. The ESMA System  logs the 
error , stops the validation of the record  and runs again  the validation 
process  on the next record .  
2a. The ESMA System  logs the erroneous records and the list of errors  and 
rejects the erroneous record.  
Alternate 
Flow 2: 
Blocking  
content 
validation 
Errors  1b. The ESMA System  validates each record against data content validation 
rules sequentially in the order as described in  Table 33 - Reference Data 
Content and Consistency Validation Rules .  
Checks  INS-101, INS-102, INS-103 are passed.  
At least o ne of the following  checks INS-104 to INS-125 or INS-129 to INS-
130 fails. 
The ESMA System  logs the error  and continues the validation process  of that 
record  until the last content check  and runs again the validation process on 
the next record.  

5 As an example, the system may work on a temporar y copy of the CRDT during the post -processing phase, then lock and 
commit the ch anges to the CRDT only once this  post-processing phase is complete.  


ESMA REGULAR USE  


36 / 216 
2c. The ESMA System  logs the erroneous records and the list of errors  and 
rejects the erroneous record , 
Alternate 
Flow 3: 
Warning s  1b. The ESMA System validates each record against checks INS-126 and 
INS-127 as described in  Table 33 - Reference Data Content and Consistency 
Validation Rules . Each time a check fails the ESMA System logs the error but  
continues the validation process of that record until the last content check.  
2b. The ESMA System logs the record and associated list of warning s. 
Frequency  Once each file is submitted by a submitting entity. Each entity is expected to 
submit at least one file per day but can also make multiple submissions to 
address errors on previous submission.   
Business 
Rules  Table 33 - Reference Data Content and Consistency Validation Rules . 
Assumptions  N/A 

3.3.4  Update the Received Reference Data Table  
Goal  The goal of this use case is to update the Received Reference Data 
Table  according to a submitted record which passed the content 
validation checks.  
Actors  The ESMA System . 
Preconditions  The ESMA System has performed the content validation on the submitted 
record.  
Trigger  The ESMA System has s uccessfully validated the content of the submitted 
record.  
Postcondition  The ESMA System has updated the Received Reference Data Table  
according to the submitted record.  
Normal Flow  
(Referenced 
records – 
DATINS file 
submission)  In case the system identifies the submitted record as “ReferenceRcd” 
then  

1. The ESMA System determines  the HCRR of the submitted record by 
calculating the hash value of the whole set  of all RTS23 fields , using a 
hash function with sufficient collision resistance to  ensure that two 
different versions of the RTS23 fields will not lead to the same hash 
value for the same ISIN, MIC combination6. 

2. The ESMA System checks whether it exists in the “Received reference 
data table” a record having the same ISIN, MIC, HCRR and Latest 
Received flag is TRUE.  

3. In case no such record is found in step 2 , the ESMA System:  

6 The choice of the hash function will be discussed during the system technical specifications

We can now use a LLM to interpret the retrieved chunks, combine it with the original user query and give a coherent response to the query based on the chunks.

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client(api_key=api_key)

# Retrieve and encode the PDF byte
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
doc_path = os.path.join(current_dir, "resources", file[0])
filepath = pathlib.Path(doc_path)

prompt = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules".join([chunk.page_content for chunk in reranked_chunks])
response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=prompt
)
print(response.text)

And we get the nicely formatted complete table of Annex 1c. Not only is the result accurate, Gemini 2.5 Flash also took a shorter time of 10s 665ms, as compared to 14s 390ms of the brute-forced method. Not to mention that such chunking approach is also much cheaper as we only passed the top 3 reranked chunk to Gemini.

Here is the entire "TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES" from the provided text:

**TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES**

| Control executed by the system                                                                                                                                                                                                                                                                                                                      | Error code | Error Message                                                                                                                                      | Concerned Fields                                                                                                                              |
| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :--------- | :----------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------- |
| The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix.                                                                                                                                                                       | INS-101    | The CFI code is not valid against the CFI based validation matrix.                                                                       | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table   |
| Check that Mandatory fields are reported according to “CFI-based validations table”.                                                                                                                                                                                                                                                            | INS-102    | The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)” .                                           | RTS field 3 vs all other RTS fields                                                                                                           |
| Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”.                                                                                                                                                                                                                                              | INS-103    | The following Non-Applicable fields are wrongly reported: “ List of RTS23 number Id of N/A field(s)” .                                      | All RTS fields                                                                                                                                |
| **The following checks are performed only in case checks above are passed.**                                                                                                                                                                                                                                                                  |            |                                                                                                                                            |                                                                                                                                               |
| Check that that a record (ISIN, MIC) is not reported twice in the same file.                                                                                                                                                                                                                                                                  | INS-104    | The following records are reported twice in the same file.                                                                               | RTS field 1,6                                                                                                                                 |
| The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date ).                                                         | INS-105    | The Trading Venue field contains an invalid MIC code.                                                                                    | RTS field 6                                                                                                                                   |
| The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file.                                                                                                                                                           | INS-107    | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity.                                         | Reporting Entity RTS field 6                                                                                                                  |
| The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL .                                                                                                                                                                        | INS-108    | The Strike Price Currency Code is incorrect.                                                                                             | RTS field 32                                                                                                                                  |
| The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ).                                                                                                                                                            | INS-109    | The Notional Currency 1 Code is incorrect.                                                                                               | RTS field 13                                                                                                                                  |
| The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ).                                                                                                                                                            | INS-110    | The Notional Currency 2 Code is incorrect.                                                                                               | RTS field 42, 47                                                                                                                              |
| The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ).                                                                                                                                                            | INS-111    | The Currency of nominal value is incorrect.                                                                                              | RTS field 16                                                                                                                                  |
| The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-112    | The LEI provided for “Issuer Identifier” is invalid.                                                                                     | RTS field 5                                                                                                                                   |
| The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-113    | The LEI provided for “Direct Underlying Issuer” is invalid.                                                                              | RTS field 27a, 27b                                                                                                                            |
| Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation.                                                                                                                                                                                                             | INS-114    | The ISIN code of the instrument identification code is invalid.                                                                          | RTS field 1                                                                                                                                   |
| Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation.                                                                                                                                                                                                     | INS-115    | The ISIN code of the underlying is invalid.                                                                                              | RTS field 26a, 26b, 26c                                                                                                                       |
| Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation.                                                                                                                                                                  | INS-116    | The ISIN code of the Index/Benchmark of a floating rate Bond is invalid.                                                                 | RTS field 19                                                                                                                                  |
| The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899).                                                                                                                                                                                                       | INS-117    | The “Date of admission to trading or date of First trade” is not a consistent date.                                                      | RTS field 11                                                                                                                                  |
| The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899).                                                                                                                                                                                                                                           | INS-118    | The Termination Date is not a consistent date.                                                                                           | RTS field 12                                                                                                                                  |
| The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”.                                                                                                                                                                                                                       | INS-119    | The Termination Date is earlier than the “Date of admission to trading or date of First trade”.                                          | RTS field 11, 12                                                                                                                              |
| The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899).                                                                                                                                                                                                                                              | INS-120    | The Maturity Date is not a consistent date.                                                                                              | RTS field 15                                                                                                                                  |
| The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”.                                                                                                                                                                                                                              | INS-121    | The Maturity Date and Date of admission to trading or date of First trade are not consistent.                                            | RTS field 11, 15                                                                                                                              |
| The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899).                                                                                                                                                                                                                                                | INS-122    | The Expiry Date is not a consistent date.                                                                                                | RTS field 24                                                                                                                                  |
| The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”.                                                                                                                                                                                                                           | INS-123    | The Expiry Date and The Date of admission to trading or date of First trade are not consistent.                                          | RTS field 11, 24                                                                                                                              |
| Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP**** (Put Options).                                                                                                                                                                                      | INS-124    | Invalid “PUTO” Option Type                                                                                                               | RTS field 3, 30                                                                                                                               |
| Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC**** (Call Options).                                                                                                                                                                                       | INS-125    | Invalid “CALL” Option Type                                                                                                               | RTS field 3, 30                                                                                                                               |
| The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date. N.B.: that check if failed generates a warning only.                                                                                                                              | INS-126    | The Termination date is not populated for an expired/matured instrument.                                                                 | RTS field 12, {15 or 24}                                                                                                                      |
| The termination date should be earlier or equal in case Expiry date/Maturity date is populated. N.B.: that check if failed generates a warning only.                                                                                                                                                                                 | INS-127    | The Termination date and Expiry date/Maturity date are not consistent.                                                                   | RTS field 12, {15 or 24}                                                                                                                      |
| The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority.                                                                                                                                                                                                                 | INS-128    | The following fields are not consistent with the one provided by RCA :<<Upcoming RCA>> , RCA_ MIC :<<MIC>> (<<MIC’s country>> ): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only. | RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table .                                                           |
| The currency of the Total issued nominal amount shall be the same as the currency of nominal value                                                                                                                                                                                                                                    | INS-129    | The currency of the Total issued nominal amount is not the same as the currency of nominal value                                         | RTS Field 14. Currency RTS Field 16.                                                                                                          |
| The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB.                                                                                                                                                                                                                                                | INS-130    | The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB                                                | RTS field 1,6                                                                                                                                 |

Comparison of Chunking Approaches

Approach	Pros	Cons
1. Brute Force (Full Document as Context)	- Extremely High Quality: Achieves excellent relevancy and completeness by providing the full document to the LLM. - Simple to Implement: Requires minimal code; just load the document and query.	- Not Scalable: Infeasible for large document sets (thousands of files) that exceed the LLM's context window. - High Per-Query Cost: The entire document's token count is charged for every single query. - Potential Latency: Processing massive contexts can be slower.
2. Fixed-Size Chunking	- Very Simple to Implement: The most common and straightforward chunking method. - Fast and Computationally Cheap: Does not require complex analysis of the document.	- Structurally Unaware: Arbitrarily splits content, breaking apart tables, code blocks, and logical sections. - Poor Completeness: Fails to retrieve complete information (e.g., gets only the first page of a 5-page table). - Low Relevancy: Often includes irrelevant, adjacent text in chunks, requiring aggressive reranking.
3. Contextual Chunking via Docling	- Structure-Aware: Capable of identifying basic document structures like headers and tables. - Potentially Fast Parsing: Can leverage a GPU for quick processing. - LLM-Free Parsing: Avoids LLM costs for the initial document parsing step.	- Prone to Errors: Confuses page headers with section headers, which corrupts the chunking logic. - Incomplete on Complex Structures: Fails to merge multi-page tables, breaking them into separate, incomplete chunks. - Requires Post-Processing: The errors necessitate additional code to clean the parsed output before it can be used.
4. Contextual Chunking via Gemini Metadata	- Superior Structural Understanding: Accurately identifies true section headers and ignores noise like page headers. - Achieves Relevancy & Completeness: Enables retrieval of entire logical sections, like the full multi-page table. - Highly Flexible: Can be adapted to different document structures or extraction tasks (e.g., entity relationships) by simply changing the prompt. - Cost-Effective for RAG: The expensive LLM call is a one-time cost per document for metadata generation. Subsequent queries are cheap.	- Upfront Cost: Incurs a one-time LLM cost for every document processed to generate metadata. - Requires Prompt Engineering: The quality of the output depends on a well-crafted prompt, which may need tuning for different document types. - Dependency on LLM Provider: Creates a dependency on the specific LLM API (e.g., Google Gemini).

Conclusion

We have shown that chunking for RAG is a necessary evil when dealing with large quantities of documents. When handling structured documents with complex features such as multi-page table, contextual chunking helps to improve the relevancy and completeness of information by taking into account these document-specific features.

We have also demonstrated two approaches with contextual chunking, one via Docling, the other via a LLM such as Gemini 2.5 Pro. We also showed how each approaches can be implemented.

With the latter approach, we can easily make changes to the prompt to optimise for different document types, and extract metadata of different forms with minimal efforts. For example, when building knowledge graphs, we can ask Gemini to extract entity relationships with the general format of entity--relationship--entity, and all these can be achieve by simply tweaking the prompt. And given the large context window of 1 million tokens, by passing in the entire book, Gemini can find relationships not just limited to within each paragraph or each chapter, but across chapter. This can be explored future in future articles.