Table of Contents
- Introduction
- Why chunking
- Why contextual chunking
- Why fixed-size chunking is insufficient
- How to implement contextual chunking
- Docling
- Google Gemini 2.5 Pro/Flash
- Comparison of Chunking Approaches
- Conclusion
Introduction
In many enterprise settings, we often have documents which are highly structured and nicely formatted. There is often a hierarchical approach in how the information is structured. For example, there is usually a content page, and relevant information are often grouped by sections. The section numbers are also in running order, and often follow a certain convention, e.g. "d.d.d Header".
When creating RAG for thousands of such documents, chunking is often necessary so that when queried, the chunks are of manageable length to be passed to LLMs as context. Contextual chunking has significant advantages over simple fixed-size chunking in ensuring the relevancy and completeness of RAG, and in an enterprise setting where precision and accuracy are critical, getting chunking done right can even decide the fate of digital transformation efforts in organisations.
This is a hands-on article with code that you can implement along with me. Instead of reading this article, you may also wish to download my Jupyter notebook directly here.
Why chunking
Given the advent of LLMs such as Google Gemini 2.5 Flash/Pro with 1 million tokens context window, if the number of documents are within the range of 10, chunking may not be necessary. In fact, Gemini can produce very accurate and comprehensive response using such a brute force method, and it is also easy to implement.
The following code demonstrates how we can use Google Gemini to understand a document. First, we set up the Google Gemini client.
from dotenv import load_dotenv
import os
# --- DEBUGGING STEP ---
# Print the current working directory to see where Python is looking.
print(f"Current working directory: {os.getcwd()}")
# Load environment variables from the .env file
load_dotenv()
# Get the API key from the environment variables
# The string "GENAI_API_KEY" must match the variable name in your .env file
api_key = os.getenv("GENAI_API_KEY")
# Check if the API key is loaded correctly
if not api_key:
raise ValueError("No API key found. Please set the GENAI_API_KEY in your .env file.")
If the GEN_API_KEY
is present and loaded, we should only see the current working directory printed in the output:
Current working directory: /home/lewis/github/rag-strategies
Next, we can use the following code to load the document "Functional Specification Document: FIRDS -- Reference Data". I chose this document because it contains structures such as headers defining each section, and relatively complex content such as multi-page tables. For example, we can see that in Annex 1c: Reference Data Content and Consistency Validation Rules, there is a table that spans from page 183 to 187. We can ask Gemini the query "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules" to check if it is able to retrieve the entire table in its entirety.
from google import genai
from google.genai import types
import pathlib
import httpx
client = genai.Client(api_key=api_key)
# Retrieve and encode the PDF byte
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
doc_path = os.path.join(current_dir, "resources", file[0])
filepath = pathlib.Path(doc_path)
prompt = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
types.Part.from_bytes(
data=filepath.read_bytes(),
mime_type='application/pdf',
),
prompt])
print(response.text)
The expected response is the following table:
Control executed by the system | Error code | Error Message | Concerned Fields |
---|---|---|---|
The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix. | INS-101 | The CFI code is not valid against the CFI based validation matrix. | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table |
Check that Mandatory fields are reported according to “CFI-based validations table”. | INS-102 | The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)”. | RTS field 3 vs all other RTS fields |
Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”. | INS-103 | The following Non-Applicable fields are wrongly reported: “List of RTS23 number Id of N/A field(s)”. | All RTS fields |
The following checks are performed only in case checks above are passed. | |||
Check that that a record (ISIN, MIC) is not reported twice in the same file. | INS-104 | The following records are reported twice in the same file. | RTS field 1,6 |
The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date). | INS-105 | The Trading Venue field contains an invalid MIC code. | RTS field 6 |
The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file. | INS-107 | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity. | Reporting Entity RTS field 6 |
The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL. | INS-108 | The Strike Price Currency Code is incorrect. | RTS field 32 |
The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-109 | The Notional Currency 1 Code is incorrect. | RTS field 13 |
The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-110 | The Notional Currency 2 Code is incorrect. | RTS field 42, 47 |
The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-111 | The Currency of nominal value is incorrect. | RTS field 16 |
The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-112 | The LEI provided for “Issuer Identifier” is invalid. | RTS field 5 |
The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-113 | The LEI provided for “Direct Underlying Issuer” is invalid. | RTS field 27a, 27b |
Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation. | INS-114 | The ISIN code of the instrument identification code is invalid. | RTS field 1 |
Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation. | INS-115 | The ISIN code of the underlying is invalid. | RTS field 26a, 26b, 26c |
Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation. | INS-116 | The ISIN code of the Index/Benchmark of a floating rate Bond is invalid. | RTS field 19 |
The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899). | INS-117 | The “Date of admission to trading or date of First trade” is not a consistent date. | RTS field 11 |
The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-118 | The Termination Date is not a consistent date. | RTS field 12 |
The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-119 | The Termination Date is earlier than the “Date of admission to trading or date of First trade”. | RTS field 11, 12 |
The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-120 | The Maturity Date is not a consistent date. | RTS field 15 |
The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”. | INS-121 | The Maturity Date and Date of admission to trading or date of First trade are not consistent. | RTS field 11, 15 |
The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-122 | The Expiry Date is not a consistent date. | RTS field 24 |
The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-123 | The Expiry Date and The Date of admission to trading or date of First trade are not consistent. | RTS field 11, 24 |
Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP**** (Put Options). | INS-124 | Invalid “PUTO” Option Type | RTS field 3, 30 |
Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC****(Call Options). | INS-125 | Invalid “CALL” Option Type | RTS field 3, 30 |
The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date. | INS-126 | The Termination date is not populated for an expired/matured instrument. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
The termination date should be earlier or equal in case Expiry date/Maturity date is populated. | INS-127 | The Termination date and Expiry date/Maturity date are not consistent. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority. | INS-128 | The following fields are not consistent with the one provided by RCA :<>, RCA_MIC :<>(<>): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only. | RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table. |
The currency of the Total issued nominal amount shall be the same as the currency of nominal value | INS-129 | The currency of the Total issued nominal amount is not the same as the currency of nominal value | RTS Field 14. Currency RTS Field 16. |
The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB. | INS-130 | The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB | RTS field 1,6 |
Even with Gemini 2.5 Flash, we can see that it is able to understand the query, and retrieve the entire table completely, thus fulfilling the requirement of relevancy and completeness.
Here is the entire table of **Annex 1c: Reference Data Content and Consistency Validation Rules** from page 185 of the document:
**TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES**
| Control executed by the system | Error code | Error Message | Concerned Fields |
| :----------------------------- | :--------- | :------------ | :--------------- |
| The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix. | INS-101 | The CFI code is not valid against the CFI based validation matrix. | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table |
| Check that Mandatory fields are reported according to “CFI-based validations table”. | INS-102 | The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)”. | RTS field 3 vs all other RTS fields |
| Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”. | INS-103 | The following Non-Applicable fields are wrongly reported: “List of RTS23 number Id of N/A field(s)”. | All RTS fields |
| **The following checks are performed only in case checks above are passed.** | | | |
| Check that that a record (ISIN, MIC) is not reported twice in the same file. | INS-104 | The following records are reported twice in the same file. | RTS field 1,6 |
| The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date). | INS-105 | The Trading Venue field contains an invalid MIC code. | RTS field 6 |
| The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file. | INS-107 | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity. | Reporting Entity <br> RTS field 6 |
| The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL. | INS-108 | The Strike Price Currency Code is incorrect. | RTS field 32 |
| The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-109 | The Notional Currency 1 Code is incorrect. | RTS field 13 |
| The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-110 | The Notional Currency 2 Code is incorrect. | RTS field 42, 47 |
| The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS-111 | The Currency of nominal value is incorrect. | RTS field 16 |
| The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-112 | The LEI provided for “Issuer Identifier” is invalid. | RTS field 5 |
| The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: (ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-113 | The LEI provided for “Direct Underlying Issuer” is invalid. | RTS field 27a, 27b |
| Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation. | INS-114 | The ISIN code of the instrument identification code is invalid. | RTS field 1 |
| Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation. | INS-115 | The ISIN code of the underlying is invalid. | RTS field 26a, 26b, 26c |
| Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation. | INS-116 | The ISIN code of the Index/Benchmark of a floating rate Bond is invalid. | RTS field 19 |
| The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899). | INS-117 | The “Date of admission to trading or date of First trade” is not a consistent date. | RTS field 11 |
| The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-118 | The Termination Date is not a consistent date. | RTS field 12 |
| The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-119 | The Termination Date is earlier than the “Date of admission to trading or date of First trade”. | RTS field 11, 12 |
| The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-120 | The Maturity Date is not a consistent date. | RTS field 15 |
| The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”. | INS-121 | The Maturity Date and Date of admission to trading or date of First trade are not consistent. | RTS field 11, 15 |
| The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-122 | The Expiry Date is not a consistent date. | RTS field 24 |
| The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-123 | The Expiry Date and The Date of admission to trading or date of First trade are not consistent. | RTS field 11, 24 |
| Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP\*\*\*\* (Put Options). | INS-124 | Invalid “PUTO” Option Type | RTS field 3, 30 |
| Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC\*\*\*\*(Call Options). | INS-125 | Invalid “CALL” Option Type | RTS field 3, 30 |
| The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date. | INS-126 | The Termination date is not populated for an expired/matured instrument. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
| The termination date should be earlier or equal in case Expiry date/Maturity date is populated. | INS-127 | The Termination date and Expiry date/Maturity date are not consistent. N.B.: that check if failed generates a warning only. | RTS field 12, {15 or 24} |
| The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority. | INS-128 | The following fields are not consistent with the one provided by RCA :<<Upcoming RCA>>, RCA\_MIC :<<MIC>>(<<MIC’s country>>): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only. | RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table. |
| The currency of the Total issued nominal amount shall be the same as the currency of nominal value | INS-129 | The currency of the Total issued nominal amount is not the same as the currency of nominal value | RTS Field 14. Currency <br> RTS Field 16. |
| The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB. | INS-130 | The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB | RTS field 1,6 |
While simple to code and effective in the quality of response, the brute-force approach of loading the entire document as context is not a silver bullet. For enterprise use cases with more than thousands of documents which would be well above 1 million tokens, chunking is still a necessary evil.
In addition, the tens of thousands of tokens which represent the document is consumed per query as part of the input tokens, and they could contribute significant cost as a fixed amount of input tokens to each query sent to the LLM.
Using Gemini 2.5 Pro as an example, the input tokens are charged about $1.25 USD per million input tokens. Assuming that a document has 30,000 tokens, this amounts to a baseline cost of $0.0375 USD per query. If had used RAG for semantic similarity search instead, the total tokens of the top retrieved number of chunks would have been much lower than 30,000 tokens.
(Note: There are offerings from some providers offering cost savings by caching contexts such as documents, so that the cost of having such documents as part of the input prompt is much lower.)
Why contextual chunking
Documents have structures that may not be sufficiently captured by methods such as fixed-sized chunking. For example, a typical document may have a header for each section, and because the length of each section usually varies, we may face either of the following two problems when deciding the optimal size of each chunk:
1) Relevancy: For shorter sections, unnecessary information may be included in these chunks;
2) Completeness: For longer sections, they may be broken up into too many little chunks, hence resulting in incomplete retrieval.
By chunking according to the characteristics of each document, we can better ensure that the information retrieved from queries are both complete and relevant.
Why fixed-size chunking is insufficient
While simple to implement, fixed-size chunking may not sufficiently capture hierchical relationships such as multi-page tables which falls under the same section. Using the technical manual as an example, we can use the following code to implement fixed-size chunking.
First, let us import the dependencies and use sentence-transformers/all-mpnet-base-v2
as the embedder, cross-encoder/ms-marco-MiniLM-L-6-v2
as the encoder, and "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules" as the query.
import os
from fpdf import FPDF
# LangChain components
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Define constants
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
PDF_PATH = os.path.join(current_dir, "resources", file[0])
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
USER_QUERY = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
The following code does the heavy lifting which involves:
- loading the PDF, and chunking it into fixed sizes of
800
with chunk overlap100
. - Embed each chunk
- Store the chunks in a vector database. We use FAISS in this example.
- Retrieve the top 10 relevant chunks.
- Rerank the retrieved chunk and return the top 3.
# ==============================================================================
# STEP 1: LOAD AND CHUNK THE DOCUMENT
# ==============================================================================
print("\n--- Step 1: Loading and Chunking PDF ---")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
length_function=len,
)
chunks = text_splitter.split_documents(documents)
print(f"PDF loaded and split into {len(chunks)} chunks.")
# ==============================================================================
# STEP 2: EMBED THE CHUNKS
# ==============================================================================
print(f"\n--- Step 2: Embedding Chunks using '{EMBEDDING_MODEL_NAME}' ---")
# This will download the model from Hugging Face on its first run.
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
print("Embedding model loaded.")
# ==============================================================================
# STEP 3: STORE IN A VECTOR DATABASE
# ==============================================================================
print("\n--- Step 3: Storing chunks in FAISS in-memory vector database ---")
# The from_documents method handles embedding and storing in one step.
vector_store = FAISS.from_documents(chunks, embeddings)
print("Chunks embedded and stored in FAISS.")
# ==============================================================================
# STEP 4: RETRIEVE RELEVANT CHUNKS
# ==============================================================================
print("\n--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---")
print(f"\nUser Query: \"{USER_QUERY}\"")
# Retrieve the top 5 most similar chunks
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})
print("\n--- Top 10 Retrieved Chunks ---")
initial_results = base_retriever.get_relevant_documents(USER_QUERY)
print(f"\n--- Top 5 Initial Results (from vector search alone) ---")
for i, chunk in enumerate(initial_results[:5], 1):
print(f"\n--- Initial Result {i} ---\n")
print(chunk.page_content)
The output below returns the top 5 out of 10 chunks from FAISS. We can see from the output that the table which we are interested in, i.e. Annex 1c, is only ranked 3 (see "--- Initial Result 3 ---"):
--- Step 1: Loading and Chunking PDF ---
PDF loaded and split into 625 chunks.
--- Step 2: Embedding Chunks using 'sentence-transformers/all-mpnet-base-v2' ---
Embedding model loaded.
--- Step 3: Storing chunks in FAISS in-memory vector database ---
Chunks embedded and stored in FAISS.
--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---
User Query: "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
--- Top 10 Retrieved Chunks ---
--- Top 5 Initial Results (from vector search alone) ---
--- Initial Result 1 ---
ESMA REGULAR USE
33 / 216
Upcoming RCA The country of the Relevant Competent Authority of that instrument, as last
determined by the system for the upcoming publication.
Free-text fields
used for
consistency
checks
“Free-text fields used for consistency checks” fields in “RTS23 Fields table”
as listed in section 6.9 RTS23 Fields table.
Non-free-text
fields used for
consistency
checks
“Non-free-text fields used for consistency checks ” fields in “RTS23 Fields
table” as listed in section 6.9 RTS23 Fields table.
TABLE 6 - FIELDS OF REFERENCE FIELDS TABL E
Finally, as per section 3.3.10, the ESMA system updates, recursively based on the already existing
records, a new table called “Consistent Reference Data Table ” 4 as follows: for each record
--- Initial Result 2 ---
ESMA REGULAR USE
198 / 216
15 Annex 5 ISO reference data tables
15.1 Country reference data table
Field Name M/O Data field
description
Data field
Values
ISO Description Source
CountryCode M 2(a) ISO
3166
The 2-character ISO Country Code
identifier.
• data provider
• ESMA manual update
CountryName M 70(z) The ISO description of the country
name.
• data provider
• ESMA manual update
EEACountryFlag M TRUEFALSE
Indicator TRUE/FALSE Flag which indicates whether the
Country is EEA.
• ESMA manual update
• Default value is FALSE
ValidityStartDate M
Date
YYYYMMDD
Date at which the record becomes
valid Generated by the ESMA System
ValidityEndDate O
Date
YYYYMMDD
Date of which the records ends to be
valid Generated by the ESMA System
LastUpdatedDate M
--- Initial Result 3 ---
ESMA REGULAR USE
183 / 216
9 Annex 1c: Reference Data Content and
Consistency Validation Rules
Control executed by the system
Error
code
Error Message
Concerned
Fields
The value of “Instrument Classification” shall
be a valid ISO 10962 code and shall be
covered by at least one of the CFI constructs
in the CFI-based validation matrix.
INS-101
The CFI code is not valid
against the CFI based
validation matrix.
RTS field 3 against the list of
valid CFI codes table and
against the list of CFI Construct
(Primary Key) in the CFI based
validation table
Check that Mandatory fields are reported
according to “CFI-based validations table”.
INS-102 The following mandatory
fields are not reported:
“List of RTS23 number Id
of missing field(s)”.
--- Initial Result 4 ---
ESMA REGULAR USE
35 / 216
In addition, the system shall have mechanisms in place to avoid that interfacing systems needing
access reference data during the 00:00 – 08:00 period retrieve inconsistent data due to ongoing updates
taking place during the post-processing phase. Proposals on the best approach will be expected from
the provider in charge of the technical specifications and development of the system5.
3.3.3 Perform Reference Data Content Validation
Goal The goal of this use case is for individual records within a received file
to be validated by ESMA.
Actors TV/SI (in the jurisdiction of a delegating NCA) - submits data
NCA (not delegating data collection in its jurisdiction) - submits data
The ESMA System – validates data
--- Initial Result 5 ---
as per contained in the Consistent Reference Data T able. In order to ensure security of the data
contained in the Consistent Reference data table, the public user will access a copy of that table, the
publication table, which is updated on daily basis during the publication process.
Ideally, we want our chunk to be returned as the top chunk and not be buried behind other chunks which are not relevant. We can use a reranker to help improve the relevancy in ranking of the chunks. The following code reranks the 10 retrieved chunks, and returns the top 3 retrieved chunks.
# ==============================================================================
# STEP 5: RERANK RETRIEVED CHUNKS
# ==============================================================================
# The cross-encoder model will be downloaded on the first run.
# It takes the query and a list of documents and returns them, scored and re-ordered.
print(f"\n--- Initializing Reranker with '{RERANKER_MODEL_NAME}' ---")
model = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL_NAME)
reranker = CrossEncoderReranker(model=model, top_n=3)
# 5d. Create the full retrieval pipeline with the reranker
# The ContextualCompressionRetriever uses the base retriever to fetch documents
# and then the reranker to re-order them based on relevance.
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=base_retriever
)
print("Reranking pipeline created.")
# 5e. Perform the final, reranked search
print("\n--- Performing search with reranking... ---")
reranked_chunks = compression_retriever.get_relevant_documents(USER_QUERY)
print("\n\n=========================================================")
print(f"--- Top 3 Reranked & Most Relevant Chunks ---")
print("=========================================================")
for i, chunk in enumerate(reranked_chunks, 1):
print(f"\n--- Final Result {i} ---\n")
print(chunk.page_content)
Based on the ouput of the top 3 reranked chunks, we can see that the chunk we are looking for is now ranked number 1, which is ideal (see "--- Final Result 1 ---"). While the correct chunk has been retrieved, it is not ideal because we can see that the table has been truncated at row three of the 5-page table, i.e. "List of RTS23 number Id of missing field(s)". If we pass this as context to the LLM, the LLM will only be able to provide a truncated table back to the user instead of the full table as expected. This means while relevancy is achieved, completeness is not.
--- Performing search with reranking... ---
=========================================================
--- Top 3 Reranked & Most Relevant Chunks ---
=========================================================
--- Final Result 1 ---
ESMA REGULAR USE
183 / 216
9 Annex 1c: Reference Data Content and
Consistency Validation Rules
Control executed by the system
Error
code
Error Message
Concerned
Fields
The value of “Instrument Classification” shall
be a valid ISO 10962 code and shall be
covered by at least one of the CFI constructs
in the CFI-based validation matrix.
INS-101
The CFI code is not valid
against the CFI based
validation matrix.
RTS field 3 against the list of
valid CFI codes table and
against the list of CFI Construct
(Primary Key) in the CFI based
validation table
Check that Mandatory fields are reported
according to “CFI-based validations table”.
INS-102 The following mandatory
fields are not reported:
“List of RTS23 number Id
of missing field(s)”.
--- Final Result 2 ---
address errors on previous submission.
Business
Rules
Table 33 - Reference Data Content and Consistency Validation Rules.
Assumptions N/A
3.3.4 Update the Received Reference Data Table
Goal
The goal of this use case is to update the Received Reference Data
Table according to a submitted record which passed the content
validation checks.
Actors The ESMA System.
Preconditions The ESMA System has performed the content validation on the submitted
record.
Trigger
The ESMA System has successfully validated the content of the submitted
record.
Postcondition The ESMA System has updated the Received Reference Data Table
according to the submitted record.
Normal Flow
(Referenced
records –
DATINS file
submission)
--- Final Result 3 ---
as per contained in the Consistent Reference Data T able. In order to ensure security of the data
contained in the Consistent Reference data table, the public user will access a copy of that table, the
publication table, which is updated on daily basis during the publication process.
How to implement contextual chunking
I will explore two methods of implementing contextual chunking:
1) Docling: According to the Docling Technical Report, it is "an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget".
2) Google Gemini 2.5 Pro/Flash: We can take advantage of Google Gemini's large context window to understand a single large document in its entirety, and use prompts to ask Gemini to generate metadata about the document which we can then use to chunk the document. While this method is relatively more expensive than specialised frameworks/models such as Docling, it is also more flexible, hence ideal for fast iteration and prototyping.
Docling
Docling has a a fully integrated solution that can parse, chunk, embed and ingest into a vector database.
We will focus on how well Docling can parse PDFs. Docling is able to parse documents into a unified document representation called DoclingDocument, which captures information such as main content and headers and layout information such as bounding boxes. The code to parse PDF is below.
First, let us import the dependencies that we need for Docling.
# Import dependencies
import os
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, FormatOption
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
import time
from pathlib import Path
Next, let us list all the files we have in the resources
folder, and currently there is only the single file firds_reference_data_functional_specifications_v2.10.pdf
.
# List of files
current_dir = os.getcwd()
data_dir = os.path.join(current_dir, "resources")
files = os.listdir(data_dir)
We can then define the pipeline for Docling, and key is because we are focused on parsing tables, the settings are focused on enabling Docling to parse the tables properly.
# Pipeline configs
accelerator_options = AcceleratorOptions(
num_threads=4, device=AcceleratorDevice.AUTO
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.accelerator_options = accelerator_options
Next, we shall set up the converter for Docling:
# Setup converter
converted = DocumentConverter(
allowed_formats=[InputFormat.PDF],
format_options={
InputFormat.PDF: FormatOption(
pipeline_cls=StandardPdfPipeline,
pipeline_options=pipeline_options,
backend=PyPdfiumDocumentBackend
)
}
)
Finally, we will parse the file from PDF into markdown. I am using a Nvidia RTX 4070 Super with 12GB VRAM, and it took me about 76.35 seconds to parse the document. You may need a longer time if the pipeline is not GPU-enabled.
# Begin parsing
for file in files:
pdf_path = os.path.join(data_dir, file)
# Check if file exists
if not os.path.exists(pdf_path):
print(f"Error: File '{pdf_path}' does not exist.")
exit(1)
print(f"Parsing file '{pdf_path}'...")
start_time = time.time()
print("Converting PDF to text...")
conv_res = converted.convert(pdf_path)
print("Converting done.")
output_dir = Path("parsed")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_res.input.file.stem
# Save markdown
md_filename = output_dir / f"{doc_filename}.md"
conv_res.document.save_as_markdown(md_filename)
end_time = time.time() - start_time
print(f"Parsing done. Time elapsed: {end_time:.2f} seconds.")
If you are pulling my git repo, you may find the parsed markdown in the folder parsed
. We can see that Docling is able to capture information from a complex technical document such as headers, and represent structures such as tables accurately as markdown tables. However, we can also see that Docling is not perfect in handling relatively more complex data such as multi-page tables, which are broken into multiple tables if they span across pages.
In addition, Docling also occasionally confused page headers as section headers, which presents a significant problem as we will rely heavily on section headers as dividers for chunking. Using the example below for reference, ideally only "## 9 Annex 1c: Reference Data Content and Consistency Validation Rules" should be regarded as the section header, and "## ESMA REGULAR USE" should not be treated as a section header. As a result of mis-classifying "## ESMA REGULAR USE" as a section header, further post-processing is necessary before we can use section headers as the basis for chunking.
## 9 Annex 1c: Reference Data Content and Consistency Validation Rules
| Control executed by the system | Error code | Error Message | Concerned Fields |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI - based validation matrix. | INS - 101 | The CFI code is not valid against the CFI based validation matrix. | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table |
| Check that Mandatory fields are reported according to “CFI-based validations table”. | INS - 102 | The following mandatory fields are not reported: “ List of RTS23 number Id of missing field(s)” . | RTS field 3 vs all other RTS fields |
| Check that Non - Applicable fields (N/A) are not reported according to “CFI-based validations table”. | INS - 103 | The following Non Applicable fields are wrongly reported: “List of RTS23 number Id of N/A field(s)” . | All RTS fields |
| The following checks are performed only in case checks above are passed. | The following checks are performed only in case checks above are passed. | The following checks are performed only in case checks above are passed. | The following checks are performed only in case checks above are passed. |
| Check that that a record (ISIN, MIC) is not reported twice in the same file. | INS-104 | The following records are reported twice in the same file. | RTS field 1,6 |
| The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL | INS - 105 | The Trading Venue field contains an invalid MIC code. | RTS field 6 |
| The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file. | INS - 107 | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity. | Reporting Entity RTS field 6 |
<!-- image -->
## ESMA REGULAR USE
| The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL . | INS - 108 | The Strike Price Currency Code is incorrect. | RTS field 32 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|---------------------------------------------------------------|--------------------|
| The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS - 109 | The Notional Currency 1 Code is incorrect. | RTS field 13 |
| The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE). | INS - 110 | The Notional Currency 2 Code is incorrect. | RTS field 42, 47 |
| The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is | INS - 111 | The Currency of nominal value is incorrect. | RTS field 16 |
| The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND | INS - 112 | The LEI provided for “Issuer Identifier” is invalid. | RTS field 5 |
| "Pending transfer", "Pending archival}. The value of the “Direct Underlying issuer ” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND | INS - 113 | The LEI provided for “Direct Underlying Issuer” is invalid. | RTS field 27a, 27b |
<!-- image -->
Google Gemini 2.5 Pro/Flash
We now revisit the use of Gemini by having it generate metadata about the document, which only needs to happen once. For example, we can use the following prompt to request Gemini to extract the section headers, and the page number representing the start of each section.
First, let us set up the environment with the API key:
from dotenv import load_dotenv
import os
# --- DEBUGGING STEP ---
# Print the current working directory to see where Python is looking.
print(f"Current working directory: {os.getcwd()}")
# Load environment variables from the .env file
load_dotenv()
# Get the API key from the environment variables
# The string "GENAI_API_KEY" must match the variable name in your .env file
api_key = os.getenv("GENAI_API_KEY")
# Check if the API key is loaded correctly
if not api_key:
raise ValueError("No API key found. Please set the GENAI_API_KEY in your .env file.")
Next, we create a few helper functions read and extract the pdf:
def read_pdf_as_bytes(file_path):
try:
with open(file_path, "rb") as file:
pdf_bytes = file.read()
return pdf_bytes
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
return None
def extract_text_from_pdf(pdf_path):
try:
try:
from PyPDF2 import PdfReader
except ImportError:
from pypdf import PdfReader
full_text = ""
page_text = []
# Open and read PDF
with open(pdf_path, "rb") as file:
pdf_reader = PdfReader(file)
for page_num, page in enumerate(pdf_reader.pages):
text = page.extract_text()
page_text.append({"page": page_num + 1, "text": text})
full_text += f"\n\n--- Page {page_num + 1} ---\n\n{text}"
full_text_bytes = full_text.encode("utf-8")
return read_pdf_as_bytes(pdf_path), page_text
except Exception as e:
print(f"Error extracting text from PDF: {e}")
return "", []
The below section is key -- it contains the prompt for Gemini to extract the section headers and the corresponding page number. You may have to tweak this prompt slightly to suit each document and their unique structure and naming convention of the headers.
import json
def get_section_map_from_gemini(full_text):
print("Asking Gemini to identify the document structure...")
prompt = """
You are a technical document parser. Your task is to analyse the provided text from a PDF.
Identify all the file specification sections. A section typically starts with a pattern like "d.dd XXXXX", "d.d XXXXXX", "d XXXXXXX", or "d Annex dd: XXXXXXXX". These section headers are bolded.
Extract the following for each section found:
1. The full section title (e.g., '6.11 Rejection statistics table'.
2. The page number where the section title appears.
Return the result as a JSON array of objects. Each object should have two keys: 'title' and 'start_page'.
Ensure the page number is an integer.
Example of a single JSON object in the array:
{
"section_title": "6.11 Rejection statistics table",
"start_page": 10
}
"""
client = genai.Client(api_key=api_key)
response = client.models.generate_content(
model="gemini-2.5-pro",
config={
'temperature': 0.0,
'response_mime_type': 'application/json'
},
contents=[
types.Part.from_bytes(
data=full_text,
mime_type='application/pdf'
),
prompt
]
)
try:
section_map = json.loads(response.text)
print(f"Gemini successfully identified {len(section_map)} sections.")
return section_map
except json.JSONDecodeError:
print("Error: Gemini did not return a valid JSON response.")
print(response.text)
return None
Next, the following helper function chunks the pdf into chunks:
def create_logical_chunks(page_texts, section_map):
print("Creating logical chunks based on the section map...")
text_by_page = {p["page"]: p["text"] for p in page_texts}
chunks = []
sorted_sections = sorted(section_map, key=lambda x: x["start_page"])
for i, section in enumerate(sorted_sections):
start_page = section["start_page"]
section_title = section["section_title"]
end_page = None
if i + 1 < len(sorted_sections):
end_page = sorted_sections[i + 1]["start_page"]
if end_page is None or end_page < start_page:
end_page = len(page_texts)
chunk_text = ""
# we use end_page + 1 to overlap with one additional page, to handle the case where a single page has 2 sections
for page_num in range(start_page, end_page + 1):
if page_num in text_by_page:
chunk_text += text_by_page[page_num] + "\n"
# Clean up the chunk: find the start of the current section text
title_pos = chunk_text.find(section_title)
if title_pos != -1:
chunk_text = chunk_text[title_pos:]
# Create LangChain Document object
doc = Document(
page_content=chunk_text.strip(),
metadata={
"section_title": section_title,
"start_page": start_page,
"end_page": end_page
}
)
chunks.append(doc)
print(f"Created {len(chunks)} logical chunks.")
return chunks
Finally, we can generating the metadata from the document, to extract the section headers and their corresponding page numbers:
# ==============================================================================
# STEP 1: LOAD AND CHUNK THE DOCUMENT
# ==============================================================================
print("\n--- Step 1: Loading and Chunking PDF ---")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()
parsed_dir = "parsed"
os.makedirs(parsed_dir, exist_ok=True)
section_map_path = os.path.join(parsed_dir, "section_map.json")
full_doc_text, pages = extract_text_from_pdf(PDF_PATH)
if os.path.exists(section_map_path):
with open(section_map_path, "r") as f:
section_map = json.load(f)
print("Loaded existing section map.")
else:
section_map = get_section_map_from_gemini(full_doc_text)
if section_map:
with open(section_map_path, "w") as f:
json.dump(section_map, f, indent=2)
print("Saved section map to section_map.json.")
Using the prompt which we provided above, Gemini generated the following metadata (relevant extract below). We can see from the metadata that for Gemini correctly identified "9 Annex 1c: Reference Data Content and Consistency Validation Rules" as the section header, and it did not confuse page headers as section headers, which is a significant improvement over Docling. If you pulled the github repo, you may find this metadata in the parsed
folder, in the file section_map.json
.
{
"section_title": "8 Annex 1b: Format Validation Rules",
"start_page": 184
},
{
"section_title": "9 Annex 1c: Reference Data Content and Consistency Validation Rules",
"start_page": 185
},
{
"section_title": "10 Annex 1d: Non-working Days Content Validation Rules",
"start_page": 190
}
Given that Gemini is able to generate the metadata properly, we can now chunk the PDF according to the metadata, embed the chunks, and ingest the embeddings into a vector database. Note that from this point onwards, no LLM is required, hence the only significant cost involved is using Gemini to generate the metadata, which is a one-time cost.
# ==============================================================================
# STEP 1: CHUNK THE PDF
# ==============================================================================
chunks = create_logical_chunks(pages, section_map)
print(f"PDF loaded and split into {len(chunks)} chunks.")
# ==============================================================================
# STEP 2: EMBED THE CHUNKS
# ==============================================================================
print(f"\n--- Step 2: Embedding Chunks using '{EMBEDDING_MODEL_NAME}' ---")
# This will download the model from Hugging Face on its first run.
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
print("Embedding model loaded.")
# ==============================================================================
# STEP 3: STORE IN A VECTOR DATABASE
# ==============================================================================
print("\n--- Step 3: Storing chunks in FAISS in-memory vector database ---")
# The from_documents method handles embedding and storing in one step.
vector_store = FAISS.from_documents(chunks, embeddings)
print("Chunks embedded and stored in FAISS.")
# ==============================================================================
# STEP 4: RETRIEVE RELEVANT CHUNKS
# ==============================================================================
print("\n--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---")
print(f"\nUser Query: \"{USER_QUERY}\"")
# Retrieve the top 5 most similar chunks
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})
print("\n--- Top 10 Retrieved Chunks ---")
initial_results = base_retriever.get_relevant_documents(USER_QUERY)
print(f"\n--- Top 5 Initial Results (from vector search alone) ---")
for i, chunk in enumerate(initial_results[:5], 1):
print(f"\n--- Initial Result {i} ---\n")
print(chunk.page_content)
The top 5 matched chunks are below. We can see that the chunk we are looking for is under "--- Initial Result 5 ---", and unlike our previous approach with fixed-size chunking, this time we are able to retrieve the full table spanning all 5 pages.
Creating logical chunks based on the section map...
Created 171 logical chunks.
PDF loaded and split into 171 chunks.
--- Step 2: Embedding Chunks using 'sentence-transformers/all-mpnet-base-v2' ---
Embedding model loaded.
--- Step 3: Storing chunks in FAISS in-memory vector database ---
Chunks embedded and stored in FAISS.
--- Step 4: Retrieving Top 10 Chunks via Similarity Search ---
User Query: "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules"
--- Top 10 Retrieved Chunks ---
--- Top 5 Initial Results (from vector search alone) ---
--- Initial Result 1 ---
ESMA REGULAR USE
198 / 216
15 Annex 5 ISO reference data tables
15.1 Country reference data table
Field Name M/O Data field
description Data field
Values ISO Description Source
CountryCode M 2(a) ISO
3166 The 2-character ISO Country Code
identifier. • data provider
• ESMA manual update
CountryName M 70(z) The ISO description of the country
name. • data provider
• ESMA manual update
EEACountryFlag M TRUEFALSE
Indicator TRUE/FALSE Flag which indicates whether the
Country is EEA. • ESMA manual update
• Default value is FALSE
ValidityStartDate M Date
YYYYMMDD Date at which the record becomes
valid Generated by the ESMA System
ValidityEndDate O Date
YYYYMMDD Date of which the records ends to be
valid Generated by the ESMA System
LastUpdatedDate M DateTime
YYYYMMDD
HH:MI:SS Date at which the record was last
updated Generated by the ESMA System
--- Initial Result 2 ---
15.4 List of valid CFI codes table
Field Name M/O Data field description Data field description
Values ISO Description Source
CFI code M 6(a)
10962 The CFI Code Updated manually by the
ESMA Business Administrator
ValidityStartDate M Date
YYYYMMDD Date at which the record
becomes valid Generated by the ESMA
System
ValidityEndDate O Date
YYYYMMDD Date of which the records
ends to be valid Generated by the ESMA
System
TABLE 45 - LIST OF VALID CFI CODES TABLE
ESMA REGULAR USE
203 / 216
15.5 LEI reference data table
That table contains LEI records, including historical records, composed of all LEI attributes described in http://www.leiroc.org/publications/gls/lou_20140620.pdf . Only the fields
relevant for the COU files will be retained. In addition, for each LEI record, two technical attributes are to be appended (in order to manage history) :
Field Name M/O Data field description Data field description
Values ISO Description Source
ValidityStartDate M Date
YYYYMMDD Date at which the record becomes
valid Generated by the ESMA
System
ValidityEndDate O Date
YYYYMMDD Date of which the records ends to
be valid Generated by the ESMA
System
TABLE 46 - TECHNICAL ATTRIBUTES OF LEI REFERENCE DATA TABL E
--- Initial Result 3 ---
6.1 Reporting Files Table
Field Name M/O Data field description Data field Values ISO Description Source
FileName [PK] M 5(a)_6(a)_5(a)_5(a) -6(n)_2(a) Name of a submitted file excluding
HUBEX/HUBDE timestamp ESMA System
ESMA reception
Date Time M YYYYMMDDHHMMSS The timestamp in the name of a
submitted file ESMA System
TABLE 13 - REPORTING FILES TABLE
ESMA REGULAR USE
163 / 216
6.2 NCA reference data table
Field Name M/O Data field
description Data field
Values ISO Description Source
Country Code M 2(a) ISO 3166 -
Country Code 3166 The 2-character ISO Country Code identifier. Updated by ESMA IT administrator
from registration process
AuthorityName M 30(x) The official name of the NCA Updated by ESMA business
administrator from registration process
Address M 250(z) The address of the NCA Updated by ESMA business
administrator from registration process
Generic
EmailAddress O The email address to be used for the RCA
change process es. Updated by ESMA business
administrator from registration process
Contact.
Name M 250(z) The name of the contact Updated by ESMA business
administrator from registration process
Contact.
EmailAddress M The email address of the contact Updated by ESMA business
administrator from registration process
Contact.
PhoneNumber M The phone Number of the contact Updated by ESMA business
administrator from registration process
Level of
delegation M 1(a) N/C/T N in case Non-delegating NCA
C in case NCA delegating data collection and
transparency calculations
T in case NCA delegating transparency
calculations but not data collection in their
jurisdiction Updated by ESMA business
administrator from registration process
Withdrawn flag M TRUEFALSE
Indicator Flag which indicates whether the NCA is
withdrawn from the system Updated by ESMA business
administrator from registration process
TABLE 14 - NCA REFERENCE DATA TABL E
--- Initial Result 4 ---
15.5 LEI reference data table
That table contains LEI records, including historical records, composed of all LEI attributes described in http://www.leiroc.org/publications/gls/lou_20140620.pdf . Only the fields
relevant for the COU files will be retained. In addition, for each LEI record, two technical attributes are to be appended (in order to manage history) :
Field Name M/O Data field description Data field description
Values ISO Description Source
ValidityStartDate M Date
YYYYMMDD Date at which the record becomes
valid Generated by the ESMA
System
ValidityEndDate O Date
YYYYMMDD Date of which the records ends to
be valid Generated by the ESMA
System
TABLE 46 - TECHNICAL ATTRIBUTES OF LEI REFERENCE DATA TABL E
ESMA REGULAR USE
204 / 216
16 Annex 6 Scenarios of Instrument reference data reporting and distribution
The system shall ensure compliance with the following scenarios.
16.1 Modified instrument reported on time
--- Initial Result 5 ---
ESMA REGULAR USE
183 / 216
9 Annex 1c: Reference Data Content and
Consistency Validation Rules
Control executed by the system Error
code Error Message Concerned
Fields
The value of “Instrument Classification” shall
be a valid ISO 10962 code and shall be
covered by at least one of the CFI constructs
in the CFI -based validation matrix. INS-101
The CFI code is not valid
against the CFI based
validation matrix. RTS field 3 against the list of
valid CFI codes table and
against the list of CFI Construct
(Primary Key) in the CFI based
validation table
Check that Mandatory field s are reported
according to “CFI-based validations table”. INS-102 The following mandatory
fields are not reported:
“List of RTS23 number Id
of missing field(s)” . RTS field 3 vs a ll other RTS
fields
Check that Non-Applicable fields (N/A) are
not reported according to “CFI-based
validations table”. INS-103 The following Non-
Applicable fields are
wrongly reported: “ List of
RTS23 number Id of N/A
field(s)” . All RTS fields
The following checks are performed only in case checks above are passed.
Check that that a record (ISIN, MIC) is not
reported twice in the same file. INS-104 The following records are
reported twice in the
same file. RTS field 1,6
The MIC identifier in the
TradingVenueRelatedAttributes block shall
exist in the Trading venue mapping view
which satisfies the following conditions:
ValidityStartDate is prior or equal to the
current date and (ValidityEndDate is NULL
OR is later or equal to the current date ). INS-105 The Trading Venue field
contains an invalid MIC
code. RTS field 6
The Reporting entity identification associated
to the MIC [field 6] in Reporting Flow view
(TV / SI MIC) is equal to the Reporting Entity
identifier in the header of the XML file. INS-107 “Trading Venue” field is
not registered at ESMA
or is not reported by the
right reporting entity. Reporting Entity
RTS field 6
ESMA REGULAR USE
184 / 216
The Strike Price Currency Code shall exist
as an active ISO 4217 Currency Code in the
currency reference data table (based on
records with ValidityEndDate is NULL . INS-108 The Strike Price
Currency Code is
incorrect. RTS field 32
The Notional Currency 1 Code shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDat e is NULL or PreEuroFlag is
TRUE ). INS-109 The Notional Currency 1
Code is incorrect. RTS field 13
The Notional Currency 2 Code shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDate is NULL or PreEuroFlag is
TRUE ). INS-110 The Notional Currency 2
Code is incorrect. RTS field 42, 47
The Currency of nominal value shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDate is NULL or PreEuroFlag is
TRUE ). INS-111 The Currency of nominal
value is incorrect. RTS field 1 6
The value of the “Issuer Identifier” shall exist
in the LEI reference table and comply with
the following conditions:
(
ValidityEndDate is NULL
OR
date of termination of the respective record is
between any period specified by
ValidityStartDate and ValidityEnddate in LEI
reference table for this LEI
)
AND
register status in {“Issued", "Lapsed",
"Pending transfer", "Pending archival}. INS-112 The LEI provided for
“Issuer Identifier” is
invalid. RTS field 5
The value of the “ Direct Underlying issuer ”
shall exist in the LEI reference table and
comply with the following conditions:
(
ValidityEndDate is NULL
OR
date of termination of the respective record is
between any period specified by
ValidityStartDate and ValidityEnddate in LEI
reference table for this LEI
)
AND
register status in {“Issued", "Lapsed",
"Pending transfer", "Pending archival}. INS-113 The LEI provided for
“Direct Underlying Issuer”
is invalid. RTS field 27a, 27b
ESMA REGULAR USE
185 / 216
Check the last digit of the ISIN code of the
“instrument identification code” according to
the algorithm of ISIN validation .19 INS-114 The ISIN code of the
instrument identification
code is invalid. RTS field 1
Check the last digit of the ISIN code of the
“underlying instrument” should be valid
according to the algorithm of ISIN
validation .20 INS-115 The ISIN code of the
underlying is invalid. RTS field 26a, 26b, 26c
Check the last digit of the ISIN code of the
Identifier of the “Index/Benchmark of a
floating rate Bond” should be valid according
to the algorithm of ISIN validation.21 INS-116 The ISIN code of the
Index/Benchmark of a
floating rate Bond is
invalid. RTS field 19
The “Date of admission to trading or date of
First trade” should a valid date and in a
sensible range (no prior than 31 -12-189922). INS-117 The “Date of admission
to trading or date of First
trade” is not a consistent
date. RTS field 11
The Termination Date should a valid date
and in a sensible range (no prior than 31 -12-
189923). INS-118 The Termination Date is
not a consistent date. RTS field 12
The Termination Date should be equal to or
later than the “Date of admission to trading
or date of First trade”. INS-119 The Termination Date is
earlier than the “Date of
admission to trading or
date of First trade”. RTS field 11, 12
The Maturity Date should a valid date and in
a sensible range (no prior than 31 -12-
189924). INS-120 The Maturity Date is not
a consistent date. RTS field 15
The Maturity Date should b e equal to or later
than “Date of admission to trading or date of
First trade”. INS-121 The Maturity Date and
Date of admission to
trading or date of First
trade are not consistent. RTS field 11, 15
19 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
20 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
21 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
22 The oldest ins trument traded according to RDS System database. That date must be configurable.
23 The oldest instrument traded according to RDS System database. That date must be configurable.
24 The oldest instrument traded according to RDS System database. That date m ust be configurable.
ESMA REGULAR USE
186 / 216
The Expiry Date should a valid date and in a
sensible range (no prior than 31 -12-189925). INS-122 The Expiry Date is not a
consistent date. RTS field 24
The Expiry date should be equal to or later
than the “Date of admission to trading or
date of First trade”. INS-123 The Expiry Date and The
Date of admission to
trading or date of First
trade are not consistent. RTS field 11, 24
Field “Option Type” shall only contain value
“PUTO” when the “Instrument Classification”
refers to the following CFI Codes: OP****
(Put Options). INS-124 Invalid “PUTO” Option
Type RTS field 3, 30
Field “Option Type” shall only contain value
“CALL” when the “Instrument Classification”
refers to the following CFI Codes: OC****
(Call Options). INS-125 Invalid “CALL” Option
Type RTS field 3, 30
The termination date should be populated in
case Maturity date/Expiry date is populated
and is strictly earlier than the current
reporting date. INS-126 The Termination date is
not populated for an
expired/matured
instrument.
N.B.: tha t check if failed
generates a warning
only. RTS field 12, {15 or 24}
The termination date should be earlier or
equal in case Expiry date/Maturity date is
populated.
INS-127 The Termination date
and Expiry date/Maturity
date are not consistent.
N.B.: tha t check if failed
generates a warning
only. RTS field 12, {15 or 24}
The field listed in Table 1 BRD 43. shall be
consistent with the values provided by the
Relevant competent Authority.26
INS-128 The following fields are
not consistent with the
one provided by RCA
:<<Upcoming RCA>> ,
RCA_ MIC
:<<MIC>> (<<MIC’s
country>> ): List of RTS23 RTS field s used for consistency
checks as stated in Table 21 -
RTS23 Fields table .
25 The oldest instrument traded according to RDS System database. That date must be configurable.
26 Generated during the consistency checks.
ESMA REGULAR USE
187 / 216
number Id of missing
field(s)”.
N.B.: that check if failed
generates a warning
only.
The currency of the Total issued nominal
amount shall be the same as the currency of
nominal value INS-129 The currency of the Total
issued nominal amount is
not the same as the
currency of nominal value RTS Field 14. Currency
RTS Field 16.
The ISIN-MIC combination, received for a
cancellation record , should exists in FIRDS
DB. INS-130 The ISIN-MIC
combination, received
from a cancellation
record , doesn’t exists in
FIRDS DB RTS field 1,6
TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES
ESMA REGULAR USE
188 / 216
10 Annex 1d: Non -working Days Content Validation
Rules
Control executed by the system Error code Error Message
If the non -working day is provided for a Market TV/SI
(NonWorkgDay/Id/ MktIdCd is populated ): the system
check s that the MIC exists in the Reporting Flow View
under “TV / SI MIC”, and that there exists a line in the
Reporting Flow View which maps this “TV / SI MIC”
with “Reporting Entity” documented in the
RptHdr/RptgNtty
If the non -working day is provided for an APA or CTP
(NonWorkgDay/Id/ Othr/Id is populated ): the system
check s that the identification code under Other/Id
exists in the Reporting Flow View under “Reporting
Entity” and is the same as the entity reported under
RptHdr/RptgNtty/Id/Othr
NWD -001 The TV/SI/APA/CTP identified under
NonWorkgDay/Id is not registered at
ESMA or is not consistent with the
reporting entity in the header.
In case the identification code of the record is a NCA27,
that code shall exist in the NCA reference data table in
the Registers system and must be equal to the
Reporting Entity identifier in the header of the XML file. NWD -002 The NCA identified by the “Trading
Venue identification code” field is not
registered at ESMA or is not equal to
the reporting entity in the header.
The Non -working Date of a record should be a valid
date. NWD -003 This date does not exist .
TABLE 34 - NON-WORKING DAYS CONTENT VALIDATION RULES28
11 Reminder Message code and description
Code Code description
RMD -001 No file has been submitted to ESMA on the day <<current reporting date>> or was
submitted after the cut -off time.
RMD -002 The instrument was not reported on the day <<current reporting date>> or was reported
after the cut -off time.
TABLE 35 - REMINDER MESSAGE CODE AND DESC RIPTION
27 Used in case the non -working day refers to an NCA
To improve the relevancy of the retrieved chunks so that the expected chunk appears right at the very top, we can now use our reranker again.
# ==============================================================================
# STEP 5: RERANK RETRIEVED CHUNKS
# ==============================================================================
# The cross-encoder model will be downloaded on the first run.
# It takes the query and a list of documents and returns them, scored and re-ordered.
print(f"\n--- Initializing Reranker with '{RERANKER_MODEL_NAME}' ---")
model = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL_NAME)
reranker = CrossEncoderReranker(model=model, top_n=3)
# 5d. Create the full retrieval pipeline with the reranker
# The ContextualCompressionRetriever uses the base retriever to fetch documents
# and then the reranker to re-order them based on relevance.
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=base_retriever
)
print("Reranking pipeline created.")
# 5e. Perform the final, reranked search
print("\n--- Performing search with reranking... ---")
reranked_chunks = compression_retriever.get_relevant_documents(USER_QUERY)
print("\n\n=========================================================")
print(f"--- Top 3 Reranked & Most Relevant Chunks ---")
print("=========================================================")
for i, chunk in enumerate(reranked_chunks, 1):
print(f"\n--- Final Result {i} ---\n")
print(chunk.page_content)
With the application of the reranker, our expected chunk now appears as "--- Final Result 1---", and the full multi-page table is now properly retrieved.
--- Initializing Reranker with 'cross-encoder/ms-marco-MiniLM-L-6-v2' ---
Reranking pipeline created.
--- Performing search with reranking... ---
=========================================================
--- Top 3 Reranked & Most Relevant Chunks ---
=========================================================
--- Final Result 1 ---
ESMA REGULAR USE
183 / 216
9 Annex 1c: Reference Data Content and
Consistency Validation Rules
Control executed by the system Error
code Error Message Concerned
Fields
The value of “Instrument Classification” shall
be a valid ISO 10962 code and shall be
covered by at least one of the CFI constructs
in the CFI -based validation matrix. INS-101
The CFI code is not valid
against the CFI based
validation matrix. RTS field 3 against the list of
valid CFI codes table and
against the list of CFI Construct
(Primary Key) in the CFI based
validation table
Check that Mandatory field s are reported
according to “CFI-based validations table”. INS-102 The following mandatory
fields are not reported:
“List of RTS23 number Id
of missing field(s)” . RTS field 3 vs a ll other RTS
fields
Check that Non-Applicable fields (N/A) are
not reported according to “CFI-based
validations table”. INS-103 The following Non-
Applicable fields are
wrongly reported: “ List of
RTS23 number Id of N/A
field(s)” . All RTS fields
The following checks are performed only in case checks above are passed.
Check that that a record (ISIN, MIC) is not
reported twice in the same file. INS-104 The following records are
reported twice in the
same file. RTS field 1,6
The MIC identifier in the
TradingVenueRelatedAttributes block shall
exist in the Trading venue mapping view
which satisfies the following conditions:
ValidityStartDate is prior or equal to the
current date and (ValidityEndDate is NULL
OR is later or equal to the current date ). INS-105 The Trading Venue field
contains an invalid MIC
code. RTS field 6
The Reporting entity identification associated
to the MIC [field 6] in Reporting Flow view
(TV / SI MIC) is equal to the Reporting Entity
identifier in the header of the XML file. INS-107 “Trading Venue” field is
not registered at ESMA
or is not reported by the
right reporting entity. Reporting Entity
RTS field 6
ESMA REGULAR USE
184 / 216
The Strike Price Currency Code shall exist
as an active ISO 4217 Currency Code in the
currency reference data table (based on
records with ValidityEndDate is NULL . INS-108 The Strike Price
Currency Code is
incorrect. RTS field 32
The Notional Currency 1 Code shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDat e is NULL or PreEuroFlag is
TRUE ). INS-109 The Notional Currency 1
Code is incorrect. RTS field 13
The Notional Currency 2 Code shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDate is NULL or PreEuroFlag is
TRUE ). INS-110 The Notional Currency 2
Code is incorrect. RTS field 42, 47
The Currency of nominal value shall exist as
an ISO 4217 Currency Code in the currency
reference table (based on records which
ValidityEndDate is NULL or PreEuroFlag is
TRUE ). INS-111 The Currency of nominal
value is incorrect. RTS field 1 6
The value of the “Issuer Identifier” shall exist
in the LEI reference table and comply with
the following conditions:
(
ValidityEndDate is NULL
OR
date of termination of the respective record is
between any period specified by
ValidityStartDate and ValidityEnddate in LEI
reference table for this LEI
)
AND
register status in {“Issued", "Lapsed",
"Pending transfer", "Pending archival}. INS-112 The LEI provided for
“Issuer Identifier” is
invalid. RTS field 5
The value of the “ Direct Underlying issuer ”
shall exist in the LEI reference table and
comply with the following conditions:
(
ValidityEndDate is NULL
OR
date of termination of the respective record is
between any period specified by
ValidityStartDate and ValidityEnddate in LEI
reference table for this LEI
)
AND
register status in {“Issued", "Lapsed",
"Pending transfer", "Pending archival}. INS-113 The LEI provided for
“Direct Underlying Issuer”
is invalid. RTS field 27a, 27b
ESMA REGULAR USE
185 / 216
Check the last digit of the ISIN code of the
“instrument identification code” according to
the algorithm of ISIN validation .19 INS-114 The ISIN code of the
instrument identification
code is invalid. RTS field 1
Check the last digit of the ISIN code of the
“underlying instrument” should be valid
according to the algorithm of ISIN
validation .20 INS-115 The ISIN code of the
underlying is invalid. RTS field 26a, 26b, 26c
Check the last digit of the ISIN code of the
Identifier of the “Index/Benchmark of a
floating rate Bond” should be valid according
to the algorithm of ISIN validation.21 INS-116 The ISIN code of the
Index/Benchmark of a
floating rate Bond is
invalid. RTS field 19
The “Date of admission to trading or date of
First trade” should a valid date and in a
sensible range (no prior than 31 -12-189922). INS-117 The “Date of admission
to trading or date of First
trade” is not a consistent
date. RTS field 11
The Termination Date should a valid date
and in a sensible range (no prior than 31 -12-
189923). INS-118 The Termination Date is
not a consistent date. RTS field 12
The Termination Date should be equal to or
later than the “Date of admission to trading
or date of First trade”. INS-119 The Termination Date is
earlier than the “Date of
admission to trading or
date of First trade”. RTS field 11, 12
The Maturity Date should a valid date and in
a sensible range (no prior than 31 -12-
189924). INS-120 The Maturity Date is not
a consistent date. RTS field 15
The Maturity Date should b e equal to or later
than “Date of admission to trading or date of
First trade”. INS-121 The Maturity Date and
Date of admission to
trading or date of First
trade are not consistent. RTS field 11, 15
19 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
20 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
21 See Formula for computing modulus 10 "Double -Add-Double" check digit as per ISO 6166 specifications .
22 The oldest ins trument traded according to RDS System database. That date must be configurable.
23 The oldest instrument traded according to RDS System database. That date must be configurable.
24 The oldest instrument traded according to RDS System database. That date m ust be configurable.
ESMA REGULAR USE
186 / 216
The Expiry Date should a valid date and in a
sensible range (no prior than 31 -12-189925). INS-122 The Expiry Date is not a
consistent date. RTS field 24
The Expiry date should be equal to or later
than the “Date of admission to trading or
date of First trade”. INS-123 The Expiry Date and The
Date of admission to
trading or date of First
trade are not consistent. RTS field 11, 24
Field “Option Type” shall only contain value
“PUTO” when the “Instrument Classification”
refers to the following CFI Codes: OP****
(Put Options). INS-124 Invalid “PUTO” Option
Type RTS field 3, 30
Field “Option Type” shall only contain value
“CALL” when the “Instrument Classification”
refers to the following CFI Codes: OC****
(Call Options). INS-125 Invalid “CALL” Option
Type RTS field 3, 30
The termination date should be populated in
case Maturity date/Expiry date is populated
and is strictly earlier than the current
reporting date. INS-126 The Termination date is
not populated for an
expired/matured
instrument.
N.B.: tha t check if failed
generates a warning
only. RTS field 12, {15 or 24}
The termination date should be earlier or
equal in case Expiry date/Maturity date is
populated.
INS-127 The Termination date
and Expiry date/Maturity
date are not consistent.
N.B.: tha t check if failed
generates a warning
only. RTS field 12, {15 or 24}
The field listed in Table 1 BRD 43. shall be
consistent with the values provided by the
Relevant competent Authority.26
INS-128 The following fields are
not consistent with the
one provided by RCA
:<<Upcoming RCA>> ,
RCA_ MIC
:<<MIC>> (<<MIC’s
country>> ): List of RTS23 RTS field s used for consistency
checks as stated in Table 21 -
RTS23 Fields table .
25 The oldest instrument traded according to RDS System database. That date must be configurable.
26 Generated during the consistency checks.
ESMA REGULAR USE
187 / 216
number Id of missing
field(s)”.
N.B.: that check if failed
generates a warning
only.
The currency of the Total issued nominal
amount shall be the same as the currency of
nominal value INS-129 The currency of the Total
issued nominal amount is
not the same as the
currency of nominal value RTS Field 14. Currency
RTS Field 16.
The ISIN-MIC combination, received for a
cancellation record , should exists in FIRDS
DB. INS-130 The ISIN-MIC
combination, received
from a cancellation
record , doesn’t exists in
FIRDS DB RTS field 1,6
TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES
ESMA REGULAR USE
188 / 216
10 Annex 1d: Non -working Days Content Validation
Rules
Control executed by the system Error code Error Message
If the non -working day is provided for a Market TV/SI
(NonWorkgDay/Id/ MktIdCd is populated ): the system
check s that the MIC exists in the Reporting Flow View
under “TV / SI MIC”, and that there exists a line in the
Reporting Flow View which maps this “TV / SI MIC”
with “Reporting Entity” documented in the
RptHdr/RptgNtty
If the non -working day is provided for an APA or CTP
(NonWorkgDay/Id/ Othr/Id is populated ): the system
check s that the identification code under Other/Id
exists in the Reporting Flow View under “Reporting
Entity” and is the same as the entity reported under
RptHdr/RptgNtty/Id/Othr
NWD -001 The TV/SI/APA/CTP identified under
NonWorkgDay/Id is not registered at
ESMA or is not consistent with the
reporting entity in the header.
In case the identification code of the record is a NCA27,
that code shall exist in the NCA reference data table in
the Registers system and must be equal to the
Reporting Entity identifier in the header of the XML file. NWD -002 The NCA identified by the “Trading
Venue identification code” field is not
registered at ESMA or is not equal to
the reporting entity in the header.
The Non -working Date of a record should be a valid
date. NWD -003 This date does not exist .
TABLE 34 - NON-WORKING DAYS CONTENT VALIDATION RULES28
11 Reminder Message code and description
Code Code description
RMD -001 No file has been submitted to ESMA on the day <<current reporting date>> or was
submitted after the cut -off time.
RMD -002 The instrument was not reported on the day <<current reporting date>> or was reported
after the cut -off time.
TABLE 35 - REMINDER MESSAGE CODE AND DESC RIPTION
27 Used in case the non -working day refers to an NCA
--- Final Result 2 ---
8 Annex 1b: Format Validation Rules
Initial data validation is done to confirm file sent by the Submitting Entity can be processed. This
includes whether the file can be uncompressed , conforms to expected XSD schema and common file
identifiers are valid.
Possible Errors encountered are:
Error
code Error Message Control
Feedback messages related to file validation
FIL-104 The ISO 20022 Message Identifier in the
BAH (*.xsd) is not valid. The ISO 20022 Message Identifier in the
BAH must refer to the latest schema
approved by ITMG.
FIL-105 The file structure does not correspond to the
XML schema: [result of XML validation]. Validate that the file sent fits to the
corresponding XML schema. For information
purposes, if there is an error in the validation,
the error message produced by the XML
parser is displayed in place of [result of XML
validation].
FIL-106 The Reporting Entity is not registered at
ESMA or the Submitting Entity shall not
submit this data . Validate the file as follows:
1) Extracts from Table 19 - Reporting Flow
view the Submitting entity identification
associated to the Reporting entity
identifier code in the Reporting header of
the submitted file.
2) Checks that the Submitting entity
identification extract ed in step 1 is equal
to the sender code of the submitted file .
FIL-107 File <Filename> has already been submitted
once. When a file is received, the system checks
whether it exists in the Reporting Files Table
as described in Table 13 - Reporting Files
table a record which filename is composed of
the same sender, filetype, recipient, Key1,
Key2 Year.
TABLE 32 - FORMAT VALIDATION RULES
ESMA REGULAR USE
183 / 216
9 Annex 1c: Reference Data Content and
Consistency Validation Rules
Control executed by the system Error
code Error Message Concerned
Fields
The value of “Instrument Classification” shall
be a valid ISO 10962 code and shall be
covered by at least one of the CFI constructs
in the CFI -based validation matrix. INS-101
The CFI code is not valid
against the CFI based
validation matrix. RTS field 3 against the list of
valid CFI codes table and
against the list of CFI Construct
(Primary Key) in the CFI based
validation table
Check that Mandatory field s are reported
according to “CFI-based validations table”. INS-102 The following mandatory
fields are not reported:
“List of RTS23 number Id
of missing field(s)” . RTS field 3 vs a ll other RTS
fields
Check that Non-Applicable fields (N/A) are
not reported according to “CFI-based
validations table”. INS-103 The following Non-
Applicable fields are
wrongly reported: “ List of
RTS23 number Id of N/A
field(s)” . All RTS fields
The following checks are performed only in case checks above are passed.
Check that that a record (ISIN, MIC) is not
reported twice in the same file. INS-104 The following records are
reported twice in the
same file. RTS field 1,6
The MIC identifier in the
TradingVenueRelatedAttributes block shall
exist in the Trading venue mapping view
which satisfies the following conditions:
ValidityStartDate is prior or equal to the
current date and (ValidityEndDate is NULL
OR is later or equal to the current date ). INS-105 The Trading Venue field
contains an invalid MIC
code. RTS field 6
The Reporting entity identification associated
to the MIC [field 6] in Reporting Flow view
(TV / SI MIC) is equal to the Reporting Entity
identifier in the header of the XML file. INS-107 “Trading Venue” field is
not registered at ESMA
or is not reported by the
right reporting entity. Reporting Entity
RTS field 6
--- Final Result 3 ---
ESMA REGULAR USE
35 / 216
In addition, the system shall have mechanisms in place to avoid that interfacing systems needing
access reference data during the 00:00 – 08:00 period retrieve inconsistent data due to ongoing updates
taking place during the post -processing phase. Proposals on the best approach will be expected from
the provider in charge of the technical specifications and development of the system5.
3.3.3 Perform Reference Data Content Validation
Goal The goal of this use case is for individual records within a received file
to be validated by ESMA .
Actors TV/SI (in the jurisdiction of a delegating NCA) - submits data
NCA (not delegating data collection in its jurisdiction) - submits data
The ESMA Sy stem – validates data
Preconditions ESMA has received and successfully validated the format of a received file
Trigger ESMA has received and successfully validated the format of a received file
Postcondition The ESMA System has extracted the subset of records which passed the
data content validations .
Normal Flow 1. The ESMA System validates each record against data all content
validation rules sequentially in the order as described in Table 33 -
Reference Data Content and Consistency Validation Rules .
2. Validation is successful finding no errors .
Alternate
Flow 1:
Preliminary
Content
validation
Errors 1a. The ESMA System validates each record against data content validation
rules sequentially in the order stated in Annex 1c. One of the following
check s INS-101, INS-102 and INS-103 fails. The ESMA System logs the
error , stops the validation of the record and runs again the validation
process on the next record .
2a. The ESMA System logs the erroneous records and the list of errors and
rejects the erroneous record.
Alternate
Flow 2:
Blocking
content
validation
Errors 1b. The ESMA System validates each record against data content validation
rules sequentially in the order as described in Table 33 - Reference Data
Content and Consistency Validation Rules .
Checks INS-101, INS-102, INS-103 are passed.
At least o ne of the following checks INS-104 to INS-125 or INS-129 to INS-
130 fails.
The ESMA System logs the error and continues the validation process of that
record until the last content check and runs again the validation process on
the next record.
5 As an example, the system may work on a temporar y copy of the CRDT during the post -processing phase, then lock and
commit the ch anges to the CRDT only once this post-processing phase is complete.
ESMA REGULAR USE
36 / 216
2c. The ESMA System logs the erroneous records and the list of errors and
rejects the erroneous record ,
Alternate
Flow 3:
Warning s 1b. The ESMA System validates each record against checks INS-126 and
INS-127 as described in Table 33 - Reference Data Content and Consistency
Validation Rules . Each time a check fails the ESMA System logs the error but
continues the validation process of that record until the last content check.
2b. The ESMA System logs the record and associated list of warning s.
Frequency Once each file is submitted by a submitting entity. Each entity is expected to
submit at least one file per day but can also make multiple submissions to
address errors on previous submission.
Business
Rules Table 33 - Reference Data Content and Consistency Validation Rules .
Assumptions N/A
3.3.4 Update the Received Reference Data Table
Goal The goal of this use case is to update the Received Reference Data
Table according to a submitted record which passed the content
validation checks.
Actors The ESMA System .
Preconditions The ESMA System has performed the content validation on the submitted
record.
Trigger The ESMA System has s uccessfully validated the content of the submitted
record.
Postcondition The ESMA System has updated the Received Reference Data Table
according to the submitted record.
Normal Flow
(Referenced
records –
DATINS file
submission) In case the system identifies the submitted record as “ReferenceRcd”
then
1. The ESMA System determines the HCRR of the submitted record by
calculating the hash value of the whole set of all RTS23 fields , using a
hash function with sufficient collision resistance to ensure that two
different versions of the RTS23 fields will not lead to the same hash
value for the same ISIN, MIC combination6.
2. The ESMA System checks whether it exists in the “Received reference
data table” a record having the same ISIN, MIC, HCRR and Latest
Received flag is TRUE.
3. In case no such record is found in step 2 , the ESMA System:
6 The choice of the hash function will be discussed during the system technical specifications
We can now use a LLM to interpret the retrieved chunks, combine it with the original user query and give a coherent response to the query based on the chunks.
from google import genai
from google.genai import types
import pathlib
import httpx
client = genai.Client(api_key=api_key)
# Retrieve and encode the PDF byte
file = ["firds_reference_data_functional_specifications_v2.10.pdf"]
current_dir = os.getcwd()
doc_path = os.path.join(current_dir, "resources", file[0])
filepath = pathlib.Path(doc_path)
prompt = "Show me the entire table of Annex 1c: Reference Data Content and Consistency Validation Rules".join([chunk.page_content for chunk in reranked_chunks])
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt
)
print(response.text)
And we get the nicely formatted complete table of Annex 1c. Not only is the result accurate, Gemini 2.5 Flash also took a shorter time of 10s 665ms, as compared to 14s 390ms of the brute-forced method. Not to mention that such chunking approach is also much cheaper as we only passed the top 3 reranked chunk to Gemini.
Here is the entire "TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES" from the provided text:
**TABLE 33 - REFERENCE DATA CONTENT AND CONSISTENCY VALIDATION RULES**
| Control executed by the system | Error code | Error Message | Concerned Fields |
| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :--------- | :----------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------- |
| The value of “Instrument Classification” shall be a valid ISO 10962 code and shall be covered by at least one of the CFI constructs in the CFI-based validation matrix. | INS-101 | The CFI code is not valid against the CFI based validation matrix. | RTS field 3 against the list of valid CFI codes table and against the list of CFI Construct (Primary Key) in the CFI based validation table |
| Check that Mandatory fields are reported according to “CFI-based validations table”. | INS-102 | The following mandatory fields are not reported: “List of RTS23 number Id of missing field(s)” . | RTS field 3 vs all other RTS fields |
| Check that Non-Applicable fields (N/A) are not reported according to “CFI-based validations table”. | INS-103 | The following Non-Applicable fields are wrongly reported: “ List of RTS23 number Id of N/A field(s)” . | All RTS fields |
| **The following checks are performed only in case checks above are passed.** | | | |
| Check that that a record (ISIN, MIC) is not reported twice in the same file. | INS-104 | The following records are reported twice in the same file. | RTS field 1,6 |
| The MIC identifier in the TradingVenueRelatedAttributes block shall exist in the Trading venue mapping view which satisfies the following conditions: ValidityStartDate is prior or equal to the current date and (ValidityEndDate is NULL OR is later or equal to the current date ). | INS-105 | The Trading Venue field contains an invalid MIC code. | RTS field 6 |
| The Reporting entity identification associated to the MIC [field 6] in Reporting Flow view (TV / SI MIC) is equal to the Reporting Entity identifier in the header of the XML file. | INS-107 | “Trading Venue” field is not registered at ESMA or is not reported by the right reporting entity. | Reporting Entity RTS field 6 |
| The Strike Price Currency Code shall exist as an active ISO 4217 Currency Code in the currency reference data table (based on records with ValidityEndDate is NULL . | INS-108 | The Strike Price Currency Code is incorrect. | RTS field 32 |
| The Notional Currency 1 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ). | INS-109 | The Notional Currency 1 Code is incorrect. | RTS field 13 |
| The Notional Currency 2 Code shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ). | INS-110 | The Notional Currency 2 Code is incorrect. | RTS field 42, 47 |
| The Currency of nominal value shall exist as an ISO 4217 Currency Code in the currency reference table (based on records which ValidityEndDate is NULL or PreEuroFlag is TRUE ). | INS-111 | The Currency of nominal value is incorrect. | RTS field 16 |
| The value of the “Issuer Identifier” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-112 | The LEI provided for “Issuer Identifier” is invalid. | RTS field 5 |
| The value of the “Direct Underlying issuer” shall exist in the LEI reference table and comply with the following conditions: ( ValidityEndDate is NULL OR date of termination of the respective record is between any period specified by ValidityStartDate and ValidityEnddate in LEI reference table for this LEI ) AND register status in {“Issued", "Lapsed", "Pending transfer", "Pending archival}. | INS-113 | The LEI provided for “Direct Underlying Issuer” is invalid. | RTS field 27a, 27b |
| Check the last digit of the ISIN code of the “instrument identification code” according to the algorithm of ISIN validation. | INS-114 | The ISIN code of the instrument identification code is invalid. | RTS field 1 |
| Check the last digit of the ISIN code of the “underlying instrument” should be valid according to the algorithm of ISIN validation. | INS-115 | The ISIN code of the underlying is invalid. | RTS field 26a, 26b, 26c |
| Check the last digit of the ISIN code of the Identifier of the “Index/Benchmark of a floating rate Bond” should be valid according to the algorithm of ISIN validation. | INS-116 | The ISIN code of the Index/Benchmark of a floating rate Bond is invalid. | RTS field 19 |
| The “Date of admission to trading or date of First trade” should a valid date and in a sensible range (no prior than 31-12-1899). | INS-117 | The “Date of admission to trading or date of First trade” is not a consistent date. | RTS field 11 |
| The Termination Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-118 | The Termination Date is not a consistent date. | RTS field 12 |
| The Termination Date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-119 | The Termination Date is earlier than the “Date of admission to trading or date of First trade”. | RTS field 11, 12 |
| The Maturity Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-120 | The Maturity Date is not a consistent date. | RTS field 15 |
| The Maturity Date should be equal to or later than “Date of admission to trading or date of First trade”. | INS-121 | The Maturity Date and Date of admission to trading or date of First trade are not consistent. | RTS field 11, 15 |
| The Expiry Date should a valid date and in a sensible range (no prior than 31-12-1899). | INS-122 | The Expiry Date is not a consistent date. | RTS field 24 |
| The Expiry date should be equal to or later than the “Date of admission to trading or date of First trade”. | INS-123 | The Expiry Date and The Date of admission to trading or date of First trade are not consistent. | RTS field 11, 24 |
| Field “Option Type” shall only contain value “PUTO” when the “Instrument Classification” refers to the following CFI Codes: OP**** (Put Options). | INS-124 | Invalid “PUTO” Option Type | RTS field 3, 30 |
| Field “Option Type” shall only contain value “CALL” when the “Instrument Classification” refers to the following CFI Codes: OC**** (Call Options). | INS-125 | Invalid “CALL” Option Type | RTS field 3, 30 |
| The termination date should be populated in case Maturity date/Expiry date is populated and is strictly earlier than the current reporting date. N.B.: that check if failed generates a warning only. | INS-126 | The Termination date is not populated for an expired/matured instrument. | RTS field 12, {15 or 24} |
| The termination date should be earlier or equal in case Expiry date/Maturity date is populated. N.B.: that check if failed generates a warning only. | INS-127 | The Termination date and Expiry date/Maturity date are not consistent. | RTS field 12, {15 or 24} |
| The field listed in Table 1 BRD 43. shall be consistent with the values provided by the Relevant competent Authority. | INS-128 | The following fields are not consistent with the one provided by RCA :<<Upcoming RCA>> , RCA_ MIC :<<MIC>> (<<MIC’s country>> ): List of RTS23 number Id of missing field(s)”. N.B.: that check if failed generates a warning only. | RTS fields used for consistency checks as stated in Table 21 - RTS23 Fields table . |
| The currency of the Total issued nominal amount shall be the same as the currency of nominal value | INS-129 | The currency of the Total issued nominal amount is not the same as the currency of nominal value | RTS Field 14. Currency RTS Field 16. |
| The ISIN-MIC combination, received for a cancellation record, should exists in FIRDS DB. | INS-130 | The ISIN-MIC combination, received from a cancellation record, doesn’t exists in FIRDS DB | RTS field 1,6 |
Comparison of Chunking Approaches
Approach | Pros | Cons |
---|---|---|
1. Brute Force (Full Document as Context) | - Extremely High Quality: Achieves excellent relevancy and completeness by providing the full document to the LLM. - Simple to Implement: Requires minimal code; just load the document and query. |
- Not Scalable: Infeasible for large document sets (thousands of files) that exceed the LLM's context window. - High Per-Query Cost: The entire document's token count is charged for every single query. - Potential Latency: Processing massive contexts can be slower. |
2. Fixed-Size Chunking | - Very Simple to Implement: The most common and straightforward chunking method. - Fast and Computationally Cheap: Does not require complex analysis of the document. |
- Structurally Unaware: Arbitrarily splits content, breaking apart tables, code blocks, and logical sections. - Poor Completeness: Fails to retrieve complete information (e.g., gets only the first page of a 5-page table). - Low Relevancy: Often includes irrelevant, adjacent text in chunks, requiring aggressive reranking. |
3. Contextual Chunking via Docling | - Structure-Aware: Capable of identifying basic document structures like headers and tables. - Potentially Fast Parsing: Can leverage a GPU for quick processing. - LLM-Free Parsing: Avoids LLM costs for the initial document parsing step. |
- Prone to Errors: Confuses page headers with section headers, which corrupts the chunking logic. - Incomplete on Complex Structures: Fails to merge multi-page tables, breaking them into separate, incomplete chunks. - Requires Post-Processing: The errors necessitate additional code to clean the parsed output before it can be used. |
4. Contextual Chunking via Gemini Metadata | - Superior Structural Understanding: Accurately identifies true section headers and ignores noise like page headers. - Achieves Relevancy & Completeness: Enables retrieval of entire logical sections, like the full multi-page table. - Highly Flexible: Can be adapted to different document structures or extraction tasks (e.g., entity relationships) by simply changing the prompt. - Cost-Effective for RAG: The expensive LLM call is a one-time cost per document for metadata generation. Subsequent queries are cheap. |
- Upfront Cost: Incurs a one-time LLM cost for every document processed to generate metadata. - Requires Prompt Engineering: The quality of the output depends on a well-crafted prompt, which may need tuning for different document types. - Dependency on LLM Provider: Creates a dependency on the specific LLM API (e.g., Google Gemini). |
Conclusion
We have shown that chunking for RAG is a necessary evil when dealing with large quantities of documents. When handling structured documents with complex features such as multi-page table, contextual chunking helps to improve the relevancy and completeness of information by taking into account these document-specific features.
We have also demonstrated two approaches with contextual chunking, one via Docling, the other via a LLM such as Gemini 2.5 Pro. We also showed how each approaches can be implemented.
With the latter approach, we can easily make changes to the prompt to optimise for different document types, and extract metadata of different forms with minimal efforts. For example, when building knowledge graphs, we can ask Gemini to extract entity relationships with the general format of entity--relationship--entity
, and all these can be achieve by simply tweaking the prompt. And given the large context window of 1 million tokens, by passing in the entire book, Gemini can find relationships not just limited to within each paragraph or each chapter, but across chapter. This can be explored future in future articles.
Top comments (0)