Forem: Arseny Zinchenko

AWS: Monitoring AWS OpenSearch Service Cluster with CloudWatch

Arseny Zinchenko — Wed, 31 Dec 2025 10:00:00 +0000

Let’s continue our journey with AWS OpenSearch Service.

What we have is a small AWS OpenSearch Service cluster with three data nodes, used as a vector store for AWS Bedrock Knowledge Bases.

Previous parts:

We already had our first production incident :-)

We launched a search without filters, and our t3.small.search died due to CPU overload.

So let’s take a look at what we have in terms of monitoring all this happiness.

CloudWatch metrics
Memory monitoring
kNN Memory usage
JVM Memory usage
Collecting metrics to VictoriaMetrics
Creating a Grafana dashboard
VictoriaMetrics/Prometheus sum(), avg() та max()
Cluster status
Nodes status
CPUUtilization: Stats
CPUUtilization: Graph
JVMMemoryPressure: Graph
JVMGCYoungCollectionCount and JVMGCOldCollectionCount
KNNHitCount vs KNNMissCount
Final result
t3.small.search vs t3.medium.search on graphs
Creating Alerts

Now let’s do something basic, just with CloudWatch metrics, but there are several solutions for monitoring OpenSearch:

CloudWatch metrics from OpenSearchService itself — data on CPU, memory, and JVM, which we can collect in VictoriaMetrics and generate alerts or use in the Grafana dashboard, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch
CloudWatch Events generated by OpenSearch Service — see Monitoring OpenSearch Service events with Amazon EventBridge — can be sent via SNS to Opsgenie, and from there to Slack.
Logs in CloudWatch Logs — we can collect them in VictoriaLogs and generate some metrics and alerts, but I didn’t see anything interesting in the logs during our production incident, see Monitoring OpenSearch logs with Amazon CloudWatch Logs.
Monitors of OpenSearch itself — capable of anomaly detection and custom alerting, there is even a Terraform resource opensearch_monitor, see also Configuring alerts in Amazon OpenSearch Service
There is also the Prometheus Exporter Plugin, which opens an endpoint for collecting metrics from Prometheus/VictoriaMetrics (but it cannot be added to AWS OpenSearch Managed, although support promises that there is a feature request — maybe it will be added someday).

CloudWatch metrics

There are quite a few metrics, but the ones that may be of interest to us are those that take into account the fact that we do not have dedicated master and coordinator nodes, and we do not use ultra-warm and cold instances.

Cluster metrics :

ClusterStatus: green/yellow/red - the main indicator of cluster status, control of data shard activity
Shards: active/unassigned/delayedUnassigned/activePrimary/initializing/relocating - more detailed information on the status of shards, but here is just the total number, without details on specific indexes
Nodes: the number of nodes in the cluster - knowing how many live nodes there should be - we can alert when a node goes down
SearchableDocuments: not that it's particularly interesting to us, but it might be useful later on to see what's going on in the indexes in general.
CPUUtilization: the percentage of CPU usage across all nodes, and this is a must-have
FreeStorageSpace: also useful to monitor
ClusterIndexWritesBlocked: Is everything OK with index writes?
JVMMemoryPressure and OldGenJVMMemoryPressure: percentage of JVM heap memory usage - we'll dig into JVM monitoring separately later, because it's a whole other headache.
AutomatedSnapshotFailure: probably good to know if the backup fails
CPUCreditBalance: useful for us because we are on t3 instances (but we don't have it in CloudWatch)
2xx, 3xx, 4xx, 5xx: data on HTTP requests and errors; I only collect 5xx for alerts here
ThroughputThrottle and IopsThrottle: we encountered disk access issues in RDS, so it is worth monitoring here as well, see PostgreSQL: AWS RDS Performance and monitoring; here you may need to look at the metrics from EBS volume metrics, but for start, you can simply add alerts to Throttle in general
HighSwapUsage: similar to the previous metrics - we once had a problem with RDS, so it's better to monitor this as well.

EBS volume metrics — these are basically standard EBS metrics, as for EC2 or RDS:

ReadLatency and WriteLatency: read/write delays
- sometimes there are spikes, so you can add
ReadThroughput and WriteThroughput: total disk load, let's say this way
DiskQueueDepth: I/O operations queue
- is empty in CloudWatch (for now?), so we’ll skip it
ReadIOPS and WriteIOPS: number of read/write operations per second

Instance metrics — here are the metrics for each OpenSearch instance (not the server, EC2, but OpenSearch itself) on each node:

FetchLatency and FetchRate: how quickly we get data from shards (but I couldn't find it in CloudWatch either)
ThreadCount: the number of threads in the operating system that were created by the JVM (Garbage Collector threads, search threads, write/index threads, etc.)
The value is stable in CloudWatch, but for now, we can add it to Grafana for the overall picture and see if there is anything interesting there
ShardReactivateCount: how often shards are transferred from cold/inactive states to active ones, which requires operating system resources, CPU, and memory; Well... maybe we should check if it has any significance for us at all.
- But there is nothing in CloudWatch either — “did not match any metrics”
ConcurrentSearchRate and ConcurrentSearchLatency: the number and speed of simultaneous search requests - this can be interesting if there are many parallel requests hanging for a long time
but for us (yet?), these values are constantly at zero, so we skip them
SearchRate: number of search queries per minute, useful for the overall picture
SearchLatency: search query execution speed, probably very useful, you can even set up an alert
IndexingRate and IndexingLatency: similar, but for indexing new documents
SysMemoryUtilization: percentage of memory usage on the data node, but this does not give a complete picture; you need to look at the JVM memory.
JVMGCYoungCollectionCount and JVMGCOldCollectionCount: the number of Garbage Collector runs, useful in conjunction with JVM memory data, which we will discuss in more detail later.
SearchTaskCancelled and SearchShardTaskCancelled: bad news :-) if tasks are canceled, something is clearly wrong (either the user interrupted the request, or there was an HTTP connection reset, or timeouts, or cluster load)
- but we always have zeroes, even when the cluster went down, so I don’t see the point in collecting these metrics yet
ThreadpoolIndexQueue and ThreadpoolSearchQueue: the number of tasks for indexing and searching in the queue; when there are too many of them, we get ThreadpoolIndexRejected and ThreadpoolSearchRejected
ThreadpoolIndexQueue is not available in CloudWatch at all, and ThreadpoolSearchQueue is there, but it's also constantly at zero, so we're skipping it for now
ThreadpoolIndexRejected and ThreadpoolSearchRejected: actually, above
in CloudWatch, the picture is similar — ThreadpoolIndexRejected is not present at all, ThreadpoolSearchRejected is zero
ThreadpoolIndexThreads and ThreadpoolSearchThreads: the maximum number of operating system threads for indexing and searching; if all are busy, requests will go to ThreadpoolIndexQueue/ThreadpoolSearchQueue
- OpenSearch has several types of pools for threads — search, index, write, etc., and each pool has a threads indicator (how many are allocated), see OpenSearch Threadpool.
- The Node Stats API (GET _nodes/stats/thread_pool) has an active threads metric, but I don't see it in CloudWatch.
ThreadpoolIndexThreads is not available in CloudWatch at all, and ThreadpoolSearchThreads is static, so I think we can skip monitoring them for now.
PrimaryWriteRejected: rejected write operations in primary shards due to issues in the thread pool write or index, or load on the data node
- CloudWatch is empty for now, but we will add collection and alerts
ReplicaWriteRejected: rejected write operations in replica shards - added to the primary document, but cannot be written to the replica
- CloudWatch is empty for now, but we will add collection and alerts

k-NN metrics — useful for us because we have a vector store with k-NN:

KNNCacheCapacityReached: when the cache is full (see below)
KNNEvictionCount: how often data is removed from the cache - a sign that there is not enough memory
KNNGraphMemoryUsage: off-heap memory usage for the vector graph itself
KNNGraphQueryErrors: number of errors when searching in vectors
- in CloudWatch are empty for now, but we will add collection and alert
KNNGraphQueryRequests: total number of queries to k-NN graphs
KNNHitCount and KNNMissCount: how many results were returned from the cache, and how many had to be read from the disk
KNNTotalLoadTime: speed of loading from disk to cache (large graphs or loaded EBS - time will increase)

Memory monitoring

Let’s think about how we can monitor the main indicators, starting with memory, because, well, this is Java.

What do we have about memory metrics?

SysMemoryUtilization: percentage of memory usage on the server (data node) in general
JVMMemoryPressure: total percentage of JVM Heap usage; JVM Heap is allocated by default to 50% of the server's memory, but no more than 32 gigabytes.
OldGenJVMMemoryPressure: see below
KNNGraphMemoryUsage: this was discussed in the first post - AWS: introduction to OpenSearch Service as a vector store
CloudWatch also has a metric called KNNGraphMemoryUsagePercentage, but it is not included in the documentation

kNN Memory usage

First, a brief overview of k-NN memory.

So, on EC2, we allocate memory for the JVM Heap (50% of what is available on the server) and separately for the off-heap for the OpenSearch vector store, where it keeps graphs and cache. For vector store, see Approximate k-NN search, plus the operating system itself and its file cache.

We don’t have a metric like “KNNGraphMemoryAvailable,” but with KNNGraphMemoryUsagePercentage and KNNGraphMemoryUsage, we can calculate it:

KNNGraphMemoryUsage: we currently have 662 megabytes
KNNGraphMemoryUsagePercentage: 60%

This means that 1 gigabyte is allocated outside the JVM Heap memory for k-NN graphs (this is on t3.medium.search).

From the documentation k-Nearest Neighbor (k-NN) search in Amazon OpenSearch Service:

OpenSearch Service uses half of an instance’s RAM for the Java heap (up to a heap size of 32 GiB). By default, k-NN uses up to 50% of the remaining half

Knowing that we currently have t3.medium.search, which provides 4 gigabytes of memory - 2 GB goes to the JVM Heap, and 1 gigabyte goes to the k-NN graph.

The main part of KNNGraphMemory is used by the k-NN cache, i.e., the part of the system's RAM where OpenSearch keeps HNSW graphs from vector indexes so that they do not have to be read from disk each time (see k-NN clear cache).

Therefore, it is useful to have graphs for EBS IOPS and k-NN cache usage.

JVM Memory usage

Okay, let’s review what’s going on in Java in general. See What Is Java Heap Memory?, OpenSearch Heap Size Usage and JVM Garbage Collection, and Understanding the JVMMemoryPressure metric changes in Amazon OpenSearch Service.

To put it simply:

Stack Memory: in addition to the JVM Heap, we have a Stack, which is allocated to each thread, where it keeps its variables, references, and startup parameters
- set via -Xss, default value from 256 kilobytes to 1 megabyte, see Understanding Threads and Locks (couldn't find how to view in OpenSearch Service)
- if we have many threads, there will be a lot of memory for their stacks
cleared when the thread dies
Heap Space :
used to allocate memory that is available to all threads
managed by Garbage Collectors (GC)
in the context of OpenSearch, we will have search and indexation caches here

In Heap memory, we have:

Young Generation : fresh data, all new objects
data from here is either deleted completely or moved to Old Generation
Old Generation : the OpenSearch process code itself, caches, Lucene index structures, large arrays

If OldGenJVMMemoryPressure is full, it means that the Garbage Collector cannot clean it up because there are references to the data, and then we have a problem - because there is no space in the Heap for new data, and the JVM may crash with an OutOfMemoryError.

In general, “heap pressure” is when there is little free memory in Young Gen and Old Gen, and there is nowhere to place new data to respond to clients.

This leads to frequent Garbage Collector runs, which take up time and system resources — instead of processing requests from clients.

As a result, latency increases, indexing of new documents slows down, or we get ClusterIndexWritesBlocked - to avoid Java OutOfMemoryError , because when indexing, OpenSearch first writes data to the Heap and then "dumps" it to disk.

See Key JVM Metrics to Monitor for Peak Java Application Performance.

So, to get a picture of memory usage, we monitor:

SysMemoryUtilization - for an overall picture of the EC2 status
- in our case, it will be consistently around 90%, but that’s OK
JVMMemoryPressure - for an overall picture of the JVM
- should be cleaned regularly with Garbage Collector (GC)
- if it is constantly above 80–90%, there are problems with running GC
OldGenJVMMemoryPressure - for Old Generation Heap data
- should be at 30–40%; if it is higher and is not being cleared, then there are problems either with the code or with GC
KNNGraphMemoryUsage - in our case, this is necessary for the overall picture

It is worth adding alerts for HighSwapUsage - we already had active swapping when we launched on t3.small.search, and this is an indication that there is not enough memory.

Collecting metrics to VictoriaMetrics

So, how do you choose metrics?

First, we look for them in CloudWatch Metrics and see if the metric exists at all and if it returns any interesting data.

For example, SysMemoryUtilization provides information.

Here we had a spike on t3.small.search, after which the cluster crashed:

But the HighSwapUsage metric also needs to be moved to t3.medium.search:

ClusterStatus is present:

Shards exist, but they are indexed by all criteria, and there is no way to filter by individual criteria:

It is also important to note that collecting metrics from CloudWatch also costs money for API requests, so it is not advisable to collect everything indiscriminately.

In general, we use YACE (Yet Another CloudWatch Exporter) to collect metrics from CloudWatch, but it does not support OpenSearch Managed cluster, see Features.

Therefore, we will use a standard exporter — CloudWatch Exporter.

We deploy it from the Helm monitoring chart (see VictoriaMetrics: creating a Kubernetes monitoring stack with your own Helm chart), add a new config to it:

...

prometheus-cloudwatch-exporter:
  enabled: true
  serviceAccount:
    name: "cloudwatch-sa"
    annotations:
      eks.amazonaws.com/sts-regional-endpoints: "true"
  serviceMonitor:
    enabled: true
  config: |-
    region: us-east-1
    metrics:

    - aws_namespace: AWS/ES
      aws_metric_name: KNNGraphMemoryUsage
      aws_dimensions: [ClientId, DomainName, NodeId]
      aws_statistics: [Average]

    - aws_namespace: AWS/ES
      aws_metric_name: SysMemoryUtilization
      aws_dimensions: [ClientId, DomainName, NodeId]
      aws_statistics: [Average]

    - aws_namespace: AWS/ES
      aws_metric_name: JVMMemoryPressure
      aws_dimensions: [ClientId, DomainName, NodeId]
      aws_statistics: [Average]

    - aws_namespace: AWS/ES
      aws_metric_name: OldGenJVMMemoryPressure
      aws_dimensions: [ClientId, DomainName, NodeId]
      aws_statistics: [Average]

Please note that different metrics may have different Dimensions - check them in CloudWatch:

Deploy, check:

And even the numbers turned out to be as we calculated in the first post — we have ~130,000 documents in the production index, according to the formula num_vectors * 1.1 * (4*1024 + 8*16), which equals 604032000 bytes, or 604.032 megabytes.

And on the graph we have 662,261 kilobytes — that’s 662 megabytes, but across all indexes combined.

Now we have metrics in VictoriaMetrics : aws_es_knngraph_memory_usage_average, aws_es_sys_memory_utilization_average, aws_es_jvmmemory_pressure_average, aws_es_old_gen_jvmmemory_pressure_average.

Add the rest in the same way.

To find out what metrics are called in VictoriaMetrics/Prometheus, open the port to CloudWatch Exporter:

$ kk port-forward svc/atlas-victoriametrics-prometheus-cloudwatch-exporter 9106

And search for metrics with curl and grep:

$ curl -s localhost:9106/metrics | grep aws_es
# HELP aws_es_cluster_status_green_maximum CloudWatch metric AWS/ES ClusterStatus.green Dimensions: [ClientId, DomainName] Statistic: Maximum Unit: Count
# TYPE aws_es_cluster_status_green_maximum gauge
aws_es_cluster_status_green_maximum{job="aws_es",instance="",domain_name="atlas-kb-prod-cluster",client_id="492***148",} 1.0 1758014700000
# HELP aws_es_cluster_status_yellow_maximum CloudWatch metric AWS/ES ClusterStatus.yellow Dimensions: [ClientId, DomainName] Statistic: Maximum Unit: Count
# TYPE aws_es_cluster_status_yellow_maximum gauge
aws_es_cluster_status_yellow_maximum{job="aws_es",instance="",domain_name="atlas-kb-prod-cluster",client_id="492***148",} 0.0 1758014700000
# HELP aws_es_cluster_status_red_maximum CloudWatch metric AWS/ES ClusterStatus.red Dimensions: [ClientId, DomainName] Statistic: Maximum Unit: Count
# TYPE aws_es_cluster_status_red_maximum gauge
aws_es_cluster_status_red_maximum{job="aws_es",instance="",domain_name="atlas-kb-prod-cluster",client_id="492***148",} 0.0 1758014700000
...

Creating a Grafana dashboard

OK, we have metrics from CloudWatch — that’s enough for now.

Let’s think about what we want to see in Grafana.

The general idea is to create a kind of dashboard overview, where all the key data for the cluster will be displayed on a single board.

What metrics are currently available, and how can we use them in Grafana? I wrote them down here so as not to get confused, because there are quite a few of them:

aws_es_cluster_status_green_maximum, aws_es_cluster_status_yellow_maximum, aws_es_cluster_status_red_maximum: you can create a single Stats panel
aws_es_nodes_maximum: also some kind of stats panel - we know how many there should be, and we'll mark it red when there are fewer Data Nodes than there should be.
aws_es_searchable_documents_maximum: just for fun, we will show the number of documents in all indexes together in a graph
aws_es_cpuutilization_average: one graph per node, and some Stats with general information and different colors
aws_es_free_storage_space_maximum: just Stats
aws_es_cluster_index_writes_blocked_maximum: did not add to Grafana, only alert
aws_es_jvmmemory_pressure_average: graph and stats
aws_es_old_gen_jvmmemory_pressure_average: somewhere nearby, also graph + Stats
aws_es_automated_snapshot_failure_maximum: this is just for alerting
aws_es_5xx_maximum: both graph and Stats
aws_es_iops_throttle_maximum: graph to see in comparison with other data such as CPU/Mem usage
aws_es_throughput_throttle_maximum: graph
aws_es_high_swap_usage_maximum: both graph and Stats - graph, to see in comparison with CPU/disks
aws_es_read_latency_average: graph
aws_es_write_latency_average: graph
aws_es_read_throughput_average: I didn't add it because there are too many graphs.
aws_es_write_throughput_average: I didn't add it because there are too many graphs.
aws_es_read_iops_average: a graph that is useful for understanding how the k-NN cache works - if there is not enough of it (and we tested on t3.small.search with 2 gigabytes of total memory), then there will be a lot of reading from the disk.
aws_es_write_iops_average: similarly
aws_es_thread_count_average: I didn't add it because it's pretty static and I didn't see any particularly useful information in it.
aws_es_search_rate_average: also just a graph
aws_es_search_latency: similarly, somewhere nearby
aws_es_sys_memory_utilization_average: Well, it will constantly be around 90% until I remove it from Grafana, but I added it to alerts.
aws_es_jvmgcyoung_collection_count_average: graph showing how often it is called
aws_es_jvmgcold_collection_count_average: graph showing how often it is called
aws_es_primary_write_rejected_average: graph, but I haven't added it yet because there are too many graphs - only alerts
aws_es_replica_write_rejected_average: graph, but I haven't added it yet because there are too many graphs - only alerts

k-NN:

aws_es_knncache_capacity_reached_maximum: only for warning alerts
aws_es_knneviction_count_average: did not add, although it may be interesting
`aws_es_knngraph_memory_usage_average: did not add
aws_es_knngraph_memory_usage_percentage_maximum: graph instead of aws_es_knngraph_memory_usage_average
aws_es_knngraph_query_errors_maximum: alert only
aws_es_knngraph_query_requests_sum: graph
aws_es_knnhit_count_maximum: graph
aws_es_knnmiss_count_maximum: graph
`aws_es_knntotal_load_time_sum: it would be nice to have a graph, but there is no space on the board

VictoriaMetrics/Prometheus `sum()`, `avg()` and max()`

First, let’s recall what functions we have for data aggregation.

With CloudWatch for OpenSearch, we will receive two main types: counter and gauge:

`
$ curl -s localhost:9106/metrics | grep cpuutil

HELP aws_es_cpuutilization_average CloudWatch metric AWS/ES CPUUtilization Dimensions: [ClientId, DomainName, NodeId] Statistic: Average Unit: Percent

TYPE aws_es_cpuutilization_average gauge

aws_es_cpuutilization_average{job="aws_es",instance="",domain_name="atlas-kb-prod-cluster",node_id="BzX51PLwSRCJ7GrbgB4VyA",client_id="492***148",} 10.0 1758099600000
...
`

The difference between them:

counter : the value can only increase the value
gauge : the value can increase and decrease

Here we have “TYPE aws_es_cpuutilization_average gauge", because CPU usage can both increase and decrease.

See the excellent documentation VictoriaMetrics — Prometheus Metrics Explained: Counters, Gauges, Histograms & Summaries:

How can we use it in graphs?

If we just look at the values, we have a set of labels here, each forming its own time series:

aws_es_cpuutilization_average{node_id="BzX51PLwSRCJ7GrbgB4VyA"} == 9
aws_es_cpuutilization_average{node_id="IIEcajw5SfmWCXe_AZMIpA"} == 28
aws_es_cpuutilization_average{node_id="lrsnwK1CQgumpiXfhGq06g"} == 8

With sum() without a label, we simply get the sum of all values:

If we do sum by (node_id), we will get the value for a specific time series, which will coincide with the sample without sum by ():

(the meaning changes as I write and make inquiries)

With `max() without filters, we simply obtain the maximum value selected from all the time series received:

And with avg() - the average value of all values, i.e., the sum of all values divided by the number of time series:

Let’s calculate it ourselves:

(41+46+12)/3
33

Actually, the reason I decided to write about this separately is because even with sum() and by (node_id), you can sometimes get the following results:

Although without sum() there are none:

And they happened because Pod was being recreated from CloudWatch Exporter at that moment:

And at that moment, we were receiving data from the old pod and the new one.

Therefore, the options here are to use either max() or just avg(). Although⁣ max() is probably better, because we are interested in the "worst" indicators.

Okay, now that we’ve figured that out, let’s get started on the dashboard.

Cluster status

Here, I would like to see all three values — Green, Yellow, and Red — on a single Stats panel.

But since we don’t have if/else in Grafana, let’s make a workaround.

We collect all three metrics and multiply the result of each by 1, 2, or 3:

sum(aws_es_cluster_status_green_maximum) by (domain_name) * 1 +
sum(aws_es_cluster_status_yellow_maximum) by (domain_name) * 2 +
sum(aws_es_cluster_status_red_maximum) by (domain_name) * 3

Accordingly, if aws_es_cluster_status_green_maximum == 1, then 1 * 1 == 1, and aws_es_cluster_status_yellow_maximum == 0 and aws_es_cluster_status_red_maximum will be == 0, then the multiplication will return 0.

And if aws_es_cluster_status_green_maximum becomes 0, but aws_es_cluster_status_red_maximum is 1, then 1 * 2 equals 3, and based on the value 3, we will change the indicator in the Stats panel.

And add Value mappings with text and colors:

Get the following result:

Nodes status

It’s simple here — we know the required number, and we get the current one from aws_es_nodes_maximum:

sum(aws_es_nodes_maximum) by (domain_name)

And again, using Value mappings, we set the values and colors:

In case we ever increase the number of nodes and forget to update the value for “OK” here, we add a third status, ERR:

CPUUtilization: Stats

Here, we will make a cross-tabulation with the Gauge visualization type:

avg(aws_es_cpuutilization_average) by (domain_name)

Set Text size and Unit:

And Thresholds:

Description ChatGPT generates pretty well — useful for developers and for us in six months, or we can just take the description from AWS documentation:

The percentage of CPU usage for data nodes in the cluster. Maximum shows the node with the highest CPU usage. Average represents all nodes in the cluster.

Add the rest of the stats:

CPUUtilization: Graph

Here we will display a graph for the CPU of each node — the average over 5 minutes:

max(avg_over_time(aws_es_cpuutilization_average[5m])) by (node_id)

And here is another example of how sum() created spikes that did not actually exist:

Therefore, we do max().

Set Gradient mode == Opacity, and Unit == percent:

Set Color scheme and Thresholds, enable Show thresholds:

In Data links, you can set a link to the DataNode Health page in the AWS Console:

https://us-east-1.console.aws.amazon.com/aos/home?region=us-east-1#opensearch/domains/atlas-kb-prod-cluster/data_Node/${__field.labels.node_id}

All available fields — Ctrl+Space:

Actions seems to have appeared not so long ago. I haven’t used it yet, but it looks interesting — you can push something:

JVMMemoryPressure: Graph

Here, we are interested in seeing whether memory usage “sticks” and how often the Garbage Collector is launched.

The query is simple — you can do max by (node_id), but I just made a general picture for the cluster:

max(aws_es_jvmmemory_pressure_average)

And the schedule is similar to the previous one:

In Description, add the explanation “when to worry”:

Represents the percentage of JVM heap in use (young + old generation).

Values below 75% are normal. Sustained pressure above 80% indicates frequent GC and potential performance degradation.

Values consistently > 85–90% mean heap exhaustion risk and may trigger ClusterIndexWritesBlocked — investigate immediately.

JVMGCYoungCollectionCount and JVMGCOldCollectionCount

A very useful graph to see how often Garbage Collects are triggered.

In the query, we will use increase[1m] to see how the value has changed in a minute:

max(increase(aws_es_jvmgcyoung_collection_count_average[1m])) by (domain_name)

And for Old Gen:

max(increase(aws_es_jvmgcold_collection_count_average[1m])) by (domain_name)

Unit — ops/sec, Decimals set to 0 to have only integer values:

KNNHitCount vs KNNMissCount

Here, we will generate data for a second — rate():

sum(rate(aws_es_knnhit_count_average[5m]))

And for Cache Miss:

sum(rate(aws_es_knnmiss_count_average[5m]))

Unit ops/s, colors can be set via Overrides:

The statistics here, by the way, are very mediocre — there are consistently a lot of cache misses, but we haven’t figured out why yet.

Final result

We collect all the graphs and get something like this:

`t3.small.search` vs `t3.medium.search` on graphs

And here’s an example of how a lack of resources, primarily memory, is reflected in the graphs: we had t3.medium.search, then we switched back to t3.small.search to see how it would affect performance.

t3.small.search is only 2 gigabytes of memory and 2 CPU cores.

Of these 2 gigabytes of memory, 1 gigabyte was allocated to JVM Heap, 500 megabytes to k-NN memory, and 500 remained for other processes.

Well, the results are quite expected:

Garbage Collectors started running constantly because it was necessary to clean up the memory that was lacking.
Read IOPS increased because data was constantly being loaded from the disk to the JVM Heap Young and k-NN.
Search Latency increased because not all data was in the cache, and I/O operations from the disk were pending.
and CPU utilization jumped — because the CPU was loaded with Garbage Collectors and reading from the disk

Creating Alerts

You can also check out the recommendations from AWS — Recommended CloudWatch alarms for Amazon OpenSearch Service.

OpenSearch ClusterStatus Yellow and OpenSearch ClusterStatus Red: here, simply if more than 0:

...
      - alert: OpenSearch ClusterStatus Yellow
        expr: sum(aws_es_cluster_status_yellow_maximum) by (domain_name, node_id) > 0
        for: 1s
        labels:
          severity: warning
          component: backend
          environment: prod
        annotations:
          summary: 'OpenSearch ClusterStatus Yellow status detected'
          description: |-
            The primary shards for all indexes are allocated to nodes in the cluster, but replica shards for at least one index are not
            *OpenSearch Doman*: `{{ "{{" }} $labels.domain_name }}`
          grafana_opensearch_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/b2d2dabd-a6b4-4a8a-b795-270b3e200a2e/aws-opensearch-cluster-cloudwatch'

      - alert: OpenSearch ClusterStatus Red
        expr: sum(aws_es_cluster_status_red_maximum) by (domain_name, node_id) > 0
        for: 1s
        labels:
          severity: critical
          component: backend
          environment: prod
        annotations:
          summary: 'OpenSearch ClusterStatus RED status detected!'
          description: |-
            The primary and replica shards for at least one index are not allocated to nodes in the cluster
            *OpenSearch Doman*: `{{ "{{" }} $labels.domain_name }}`
          grafana_opensearch_overview_url: 'https://{{ .Values.monitoring.root_url }}/d/b2d2dabd-a6b4-4a8a-b795-270b3e200a2e/aws-opensearch-cluster-cloudwatch'
...

Through labels, we have implemented alert routing in Opsgenie to the necessary Slack channels, and the annotation grafana_opensearch_overview_url is used to add a link to Grafana in a Slack message:

OpenSearch CPUHigh — if more than 20% for 10 minutes:

- alert: OpenSearch CPUHigh
        expr: sum(aws_es_cpuutilization_average) by (domain_name, node_id) > 20
        for: 10m
...

OpenSearch Data Node down — if the node is down:

- alert: OpenSearch Data Node down
        expr: sum(aws_es_nodes_maximum) by (domain_name) < 3
        for: 1s
        labels:
          severity: critical
...

aws_es_free_storage_space_maximum - we don't need it yet.

OpenSearch Blocking Write — alert us if write blocks have started:

...
      - alert: OpenSearch Blocking Write
        expr: sum(aws_es_cluster_index_writes_blocked_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: critical
...

And the rest of the alerts I’ve added so far:

...
      - alert: OpenSearch AutomatedSnapshotFailure 
        expr: sum(aws_es_automated_snapshot_failure_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: critical
...
      - alert: OpenSearch 5xx Errors 
        expr: sum(aws_es_5xx_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: critical
...
      - alert: OpenSearch IopsThrottled
        expr: sum(aws_es_iops_throttle_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: warning
...
      - alert: OpenSearch ThroughputThrottled
        expr: sum(aws_es_throughput_throttle_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: warning
...
      - alert: OpenSearch SysMemoryUtilization High Warning
        expr: avg(aws_es_sys_memory_utilization_average) by (domain_name) >= 95
        for: 5m
        labels:
          severity: warning
...
      - alert: OpenSearch PrimaryWriteRejected High
        expr: sum(aws_es_primary_write_rejected_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: critical
...
      - alert: OpenSearch KNNGraphQueryErrors High
        expr: sum(aws_es_knngraph_query_errors_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: critical
...
      - alert: OpenSearch KNNCacheCapacityReached
        expr: sum(aws_es_knngraph_query_errors_maximum) by (domain_name) >= 1
        for: 1s
        labels:
          severity: warning
...

As we use it, we’ll see what else we can add.

Originally published at RTFM: Linux, DevOps, and system administration.

Terraform: creating an AWS OpenSearch Service cluster and users

Arseny Zinchenko — Tue, 30 Dec 2025 10:00:00 +0000

In the first part, we covered the basics of AWS OpenSearch Service in general and the types of instances for Data Nodes — AWS: Getting Started with OpenSearch Service as a Vector Store.

In the second part, we covered access, AWS: Creating an OpenSearch Service Cluster and Configuring Authentication and Authorization.

Now let’s write Terraform code to create a cluster, users, and indexes.

We will create the cluster in VPC and use the internal user database for authentication.

But in VPC, you can’t… Because — surprise! — AWS Bedrock requires OpenSearch Managed Cluster to be public, not in VPC.

The OpenSearch Managed Cluster you provided is not supported because it is VPC protected. Your cluster must be behind a public network.

I wrote to the AWS tech. support, and they said:

However, there is an ongoing product feature request (PFR) to have Bedrock KnowledgeBases support provisioned Open Search clusters in VPC.

And they suggest using Amazon OpenSearch Serverless, which we are actually running away from because the prices are ridiculous.

The second problem that arose when I started writing resources bedrockagent_knowledge_base is that it does not support storage_configuration with typeOPENSEARCH_MANAGED`, only Serverless.

But Pull Request for this already exists, maybe someday they will approve it.
(UPD: this was already merged)

So, we will create an OpenSearch Managed Service cluster with three indexes — Dev/Staging/Prod.

The cluster will have three small data nodes, and each index will have 1 primary shard and 1 replica, because the project is small, and the data in our Production index on AWS OpenSearch Serverless, from which we want to migrate to AWS OpenSearch Service, is currently only 2 GiB, and is unlikely to grow significantly in the future.

It would be good to create the cluster in our own Terraform module to make it easier to create some test environments, as I did for AWS EKS, but there isn’t much time for that right now, so we’ll just use tf files with a separate prod.tfvars for variables.

Maybe later I’ll write separately about transferring it to our own module, because it’s really convenient.

In the next part, we’ll talk about monitoring, because our Production has already crashed once :-)

Terraform files structure
Project planning
Creating a cluster
Custom endpoint configuration
Terraform Outputs
Creating OpenSearch Users
Error: elastic: Error 403 (Forbidden)
Creating Internal Users
Internal database users
Adding IAM Users
Creating AWS Bedrock IAM Roles and OpenSearch Role mappings
Creating OpenSearch indexes

Terraform files structure

The initial file and directory structure of the project is as follows:

$ tree . . ├── README.md └── terraform ├── Makefile ├── backend.tf ├── data.tf ├── envs │ └── prod │ └── prod.tfvars ├── locals.tf ├── outputs.tf ├── providers.tf ├── variables.tf └── versions.tf

In the providers.tf - provider settings, currently only AWS, and through it we set the default tags:

provider "aws" { region = var.aws_region default_tags { tags = { component = var.component created-by = "terraform" environment = var.environment } } }

In the data.tf, we collect AWS Account ID, Availability Zones, VPC, and private subnets in which we will create a cluster in which we will eventually create a cluster:

`
data "aws_caller_identity" "current" {}

data "aws_availability_zones" "available" {
state = "available"
}

data "aws_vpc" "eks_vpc" {
id = var.vpc_id
}

data "aws_subnets" "private" {
filter {
name = "vpc-id"
values = [var.vpc_id]
}

tags = {
subnet-type = "private"
}
}
`

File variables.tf with our default variables, then we will add new ones:

`
variable "aws_region" {
type = string
}

variable "project_name" {
description = "A project name to be used in resources"
type = string
}

variable "component" {
description = "A team using this project (backend, web, ios, data, devops)"
type = string
}

variable "environment" {
description = "Dev/Prod, will be used in AWS resources Name tag, and resources names"
type = string
}

variable "vpc_id" {
type = string
description = "A VPC ID to be used to create OpenSearch cluster and its Nodes"
}
`

Pass the values of variables through a separate prod.tfvars file, then, if necessary, we can create a new environment through a file of the type envs/test/test.tfvars:

aws_region = "us-east-1" project_name = "atlas-kb" component = "backend" environment = "prod" vpc_id = "vpc-0fbaffe234c0d81ea" dns_zone = "prod.example.co"

In the Makefile, we simplify our local life:

PROD

init-prod:
terraform init -reconfigure -backend-config="key=prod/atlas-knowledge-base-prod.tfstate"

plan-prod:
terraform plan -var-file=envs/prod/prod.tfvars

apply-prod:
terraform apply -var-file=envs/prod/prod.tfvars

destroy-prod:

terraform destroy -var-file=envs/prod/prod.tfvars

What files will be next?

We will also have AWS Bedrock, which will need to be configured for access — we will do this through its IAM Role, and I will not write about Bedrock here — because it is a separate topic, and Terraform does not yet support OPENSEARCH_MANAGED, so we did it manually, and then we will execute terraform import.

We will create indexes, users for our Backend API, and Bedrock IAM Role mappings in OpenSearch’s internal database through Terraform OpenSearch Provider to simplify OpenSearch Dashboards access.

Project planning

We can create a cluster from the Terraform resource aws_opensearch_domain, or we can use ready-made modules, such as the opensearch from @Anton Babenko.

Let’s take Anton’s module, because I use his modules a lot, and everything works great.

Creating a cluster

Examples — terraform-aws-opensearch/tree/master/examples.

Add a variable with cluster parameters to the variables.tf:

`
...

variable "cluser_options" {
description = "A map of options to configure the OpenSearch cluster"
type = object({
instance_type = string
instance_count = number
volume_size = number
volume_type = string
engine_version = string
auto_software_update_enabled = bool
})
}
`

And a value in prod.tfvars:

`
...

cluser_options = {
instance_type = "t3.small.search"
instance_count = 3
volume_size = 50
volume_type = "gp3"
engine_version = "OpenSearch_2.19"
auto_software_update_enabled = true
}
`

t3.small.search instances are the most minimal and sufficient for us at this time, although there are limitations for t3, such as the AWS OpenSearch Auto-tune feature not being supported.

In general, t3 is not intended for production use cases. See also Operational best practices for Amazon OpenSearch Service, Current generation instance types, and Amazon OpenSearch Service quotas.

I set the version here to 2.9, but 3.1 was added just a few days ago — see Supported versions of Elasticsearch and OpenSearch.

We take three nodes so that the cluster can select a cluster manager node if one node fails, see Dedicated master node distribution, Learning OpenSearch from scratch, part 2: Digging deeper, and Enhance stability with dedicated cluster manager nodes using Amazon OpenSearch Service.

Contents of the locals.tf:

locals { # 'atlas-kb-prod' env_name = "${var.project_name}-${var.environment}" }

Most of the locals will be right here, but some that are very "local" to a particular code will be in the resource code files.

Add the file opensearcth_users.tf - for now, there is only a root user here, and the password is stored in AWS Parameter Store (instead of AWS Secrets Manager - "that's just how it happened historically“):

ROOT

generate root password

waiting for write-only: https://github.com/hashicorp/terraform-provider-aws/pull/43621

then will update it with the ephemeral type

resource "random_password" "os_master_password" {
length = 16
special = true
}

store the root password in AWS Parameter Store

resource "aws_ssm_parameter" "os_master_password" {
name = "/${var.environment}/${local.env_name}-root-password"
description = "OpenSearch cluster master password"
type = "SecureString"
value = random_password.os_master_password.result
overwrite = true
tier = "Standard"

lifecycle {
ignore_changes = [value] # to prevent diff every time password is regenerated
}
}

data "aws_ssm_parameter" "os_master_password" {
name = "/${var.environment}/${local.env_name}-root-password"
with_decryption = true

depends_on = [aws_ssm_parameter.os_master_password]
}
`

Let’s write the opensearch_cluster.tf file.

I left the config for VPC here for future reference and just as an example, although it will not be possible to transfer an already created cluster to VPC — you will have to create a new one, see Limitations in the documentation Launching your Amazon OpenSearch Service domains within a VPC:

`
module "opensearch" {
source = "terraform-aws-modules/opensearch/aws"
version = "~> 2.0.0"

# enable Fine-grained access control
# by using the internal user database, we'll simply access to the Dashboards
# for backend API Kubernetes Pods, will use Kubernetes Secrets with username:password from AWS Parameter Store
advanced_security_options = {
enabled = true
anonymous_auth_enabled = false
internal_user_database_enabled = true

master_user_options = {
  master_user_name = "os_root"
  master_user_password = data.aws_ssm_parameter.os_master_password.value
}

}

# can't be used with t3 instances
auto_tune_options = {
desired_state = "DISABLED"
}

# have three data nodes - t3.small.search nodes in two AZs
# will use 3 indexes - dev/stage/prod with 1 shard and 1 replica each
cluster_config = {
instance_count = var.cluser_options.instance_count
dedicated_master_enabled = false
instance_type = var.cluser_options.instance_type

# put both data-nodes in different AZs
zone_awareness_config = {
  availability_zone_count = 2
}

zone_awareness_enabled = true

}

# the cluster's name
# 'atlas-kb-prod'
domain_name = "${local.env_name}-cluster"

# 50 GiB for each Data Node
ebs_options = {
ebs_enabled = true
volume_type = var.cluser_options.volume_type
volume_size = var.cluser_options.volume_size
}

encrypt_at_rest = {
enabled = true
}

# latest for today:
# https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html#choosing-version
engine_version = var.cluser_options.engine_version

# enable CloudWatch logs for Index and Search slow logs
# TODO: collect to VictoriaLogs or Loki, and create metrics and alerts
log_publishing_options = [
{ log_type = "INDEX_SLOW_LOGS" },
{ log_type = "SEARCH_SLOW_LOGS" },
]

ip_address_type = "ipv4"

node_to_node_encryption = {
enabled = true
}

# allow minor version updates automatically
# will be performed during off-peak windows
software_update_options = {
auto_software_update_enabled = var.cluser_options.auto_software_update_enabled
}

# DO NOT use 'atlas-vpc-ops' VPC and its private subnets
# > "The OpenSearch Managed Cluster you provided is not supported because it is VPC protected. Your cluster must be behind a public network."
# vpc_options = {
# subnet_ids = data.aws_subnets.private.ids
# }

# # VPC endpoint to access from Kubernetes Pods
# vpc_endpoints = {
# one = {
# subnet_ids = data.aws_subnets.private.ids
# }
# }

# Security Group rules to allow access from the VPC only
# security_group_rules = {
# ingress_443 = {
# type = "ingress"
# description = "HTTPS access from VPC"
# from_port = 443
# to_port = 443
# ip_protocol = "tcp"
# cidr_ipv4 = data.aws_vpc.ops_vpc.cidr_block
# }
# }

# Access policy
# necessary to allow access for AWS user to the Dashboards
access_policy_statements = [
{
effect = "Allow"

  principals = [{
    type = "*"
    identifiers = ["*"]
  }]

  actions = ["es:*"]
}

]

# 'atlas-kb-ops-os-cluster'
tags = {
Name = "${var.project_name}-${var.environment}-os-cluster"
}
}
`

Basically, everything is described in the comments, but in short:

enable fine-grained access control and a local user database
three data nodes, each with 50 gigabytes of disk space, in different Availability Zones
enable logs in CloudWatch
create a cluster in private subnets
allow access for everyone in the Domain Access Policy
well, that’s it for now… we can’t use Security Groups because we’re not in VPC, but how do we create an IP-based policy? We don’t know CIDR Bedrock
or in the principals.identifiers we could add a limit on our IAM Users + Bedrock AIM Role

Run creating the cluster and go to have some tea, as this process will take around 20 minutes.

Custom endpoint configuration

After creating the cluster, check access to the Dashboards. If everything is OK, add a custom endpoint.

Note : Custom endpoints have their own quirks: in Terraform OpenSearch Provider, you need to use the custom endpoint URL, but in AWS Bedrock Knowledge Base, you need to use the default cluster URL.

To do this, we need to create a certificate in AWS Certificate Manager, and add a new record in Route53.

I expected a possible chicken-and-egg problem here, because Custom Endpoint settings depend on AWS ACM and a record in AWS Route53, and the record in AWS Route53 will depend on the cluster because it uses its endpoint.

But no, if you create a new cluster with the settings described below, everything is created correctly: first, the certificate in AWS ACM, then the cluster with Custom Endpoint, then the record in Route53 with CNAME to the cluster default URL.

Add a new local - os_custom_domain_name:

locals { # 'atlas-kb-prod' env_name = "${var.project_name}-${var.environment}" # 'opensearch.prod.example.co' os_custom_domain_name = "opensearch.${var.dns_zone}" }

Add the Route53 zone data retrieval to the data.tf:

`
...

data "aws_route53_zone" "zone" {
name = var.dns_zone
}
`

Add certificate creation and Route53 entry to the opensearch_cluster.tf:

TLS for the Custom Domain

module "prod_opensearch_acm" {
source = "terraform-aws-modules/acm/aws"
version = "~> 6.0"

# 'opensearch.example.co'
domain_name = local.os_custom_domain_name
zone_id = data.aws_route53_zone.zone.zone_id

validation_method = "DNS"
wait_for_validation = true

tags = {
Name = local.os_custom_domain_name
}
}

resource "aws_route53_record" "opensearch_domain_endpoint" {
zone_id = data.aws_route53_zone.zone.zone_id
name = local.os_custom_domain_name
type = "CNAME"
ttl = 300
records = [module.opensearch.domain_endpoint]
}

...
`

And in the module "opensearch", add the custom endpoint settings:

... domain_endpoint_options = { custom_endpoint_certificate_arn = module.prod_opensearch_acm.acm_certificate_arn custom_endpoint_enabled = true custom_endpoint = local.os_custom_domain_name tls_security_policy = "Policy-Min-TLS-1-2-2019-07" } ...

Run terraform init and terraform apply, check the settings:

And check access to the Dashboards.

Terraform Outputs

Let’s add some outputs.

For now, just for ourselves, but later we may use them in imports from other projects, see Terraform: terraform_remote_state — getting outputs from other state files:

`
output "vpc_id" {
value = var.vpc_id
}

output "cluster_arn" {
value = module.opensearch.domain_arn
}

output "opensearch_domain_endpoint_cluster" {
value = "https://${module.opensearch.domain_endpoint}"
}

output "opensearch_domain_endpoint_custom" {
value = "https://${local.os_custom_domain_name}"
}

output "opensearch_root_username" {
value = "os_root"
}

output "opensearch_root_user_password_secret_name" {
value = "/${var.environment}/${local.env_name}-root-password"
}
`

Creating OpenSearch Users

All that’s left now are users and indexes.

We will have two types of users:

regular users from the OpenSearch internal database — for our Backend API in Kubernetes (actually, we later switched to IAM Roles, which are mapped to the Backend via EKS Pod Identities)
and users (IAM Role) for Bedrock — there will be three Knowledge Bases, each with its own IAM Role, for which we will need to add an OpenSearch Role and map it to IAM roles

Let’s start with regular users.

Add a provider, in my case it is in the versions.tf file:

`
terraform {

required_version = "~> 1.6"

required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 6.0"
}
opensearch = {
source = "opensearch-project/opensearch"
version = "~> 2.3"
}
}
}
`

In the providers.tf file, describe access to the cluster:

`
...

provider "opensearch" {
url = "https://${local.os_custom_domain_name}"
username = "os_root"
password = data.aws_ssm_parameter.os_master_password.value
healthcheck = false
}
`

Error: elastic: Error 403 (Forbidden)

Here is an important point about the url value in the provider configuration. I wrote about it above, and now I will show you how it looks.

First, in the provider.url, I set it as outputs of the module, i.e. module.opensearch.domain_endpoint.

Because of this, I got a 403 error when I tried to create users:

... opensearch_user.os_kraken_dev_user: Creating... opensearch_role.os_kraken_dev_role: Creating... ╷ │ Error: elastic: Error 403 (Forbidden) │ │ with opensearch_user.os_kraken_dev_user, │ on opensearch_users.tf line 23, in resource "opensearch_user" "os_kraken_dev_user": │ 23: resource "opensearch_user" "os_kraken_dev_user" { │ ╵ ╷ │ Error: elastic: Error 403 (Forbidden) │ │ with opensearch_role.os_kraken_dev_role, │ on opensearch_users.tf line 30, in resource "opensearch_role" "os_kraken_dev_role": │ 30: resource "opensearch_role" "os_kraken_dev_role" {

Thus, set the URL in the form of FQDN, which we did for Custom Endpoint, something like "url = https://opensearch.exmaple.com" - and everything works well.

Creating Internal Users

Now for the users themselves.

There will be three of them — dev, staging, prod, each with access to the corresponding index.

Here we will use opensearch_user.

If the cluster is created in VPC, a VPN connection is required so that the provider can connect to the cluster.

Add list() to the variables.tf with a list of environments:

`
...

variable "app_environments" {
type = list(string)
description = "The Application's environments, to be used to created Dev/Staging/Prod DynamoDB tables, etc"
}
`

And the value in prod.tfvars:

`
...

app_environments = [
"dev",
"staging",
"prod"
]
`

Internal database users

At first, I planned to just use local users, and wrote this option in this post — let it be. Next, I will show how we did it in the end — with IAM Users and IAM Roles.

In the file opensearch_users.tf, add three passwords, three users, and three roles to which we map users in loops - each role with access to its own index:

`
...

KRAKEN

resource "random_password" "os_kraken_password" {
for_each = toset(var.app_environments)
length = 16
special = true
}

store the root password in AWS Parameter Store

resource "aws_ssm_parameter" "os_kraken_password" {
for_each = toset(var.app_environments)

name = "/${var.environment}/${local.env_name}-kraken-${each.key}-password"
description = "OpenSearch cluster Backend Dev password"
type = "SecureString"
value = random_password.os_kraken_password[each.key].result
overwrite = true
tier = "Standard"

lifecycle {
ignore_changes = [value] # to prevent diff every time password is regenerated
}
}

Create a user

resource "opensearch_user" "os_kraken_user" {
for_each = toset(var.app_environments)

username = "os_kraken_${each.key}"
password = random_password.os_kraken_password[each.key].result
description = "Backend EKS ${each.key} user"

depends_on = [module.opensearch]
}

And a full user, role and role mapping example:

resource "opensearch_role" "os_kraken_role" {
for_each = toset(var.app_environments)

role_name = "os_kraken_${each.key}_role"
description = "Backend EKS ${each.key} role"

cluster_permissions = [
"indices:data/read/msearch",
"indices:data/write/bulk*",
"indices:data/read/mget*"
]
index_permissions {
index_patterns = ["kraken-kb-index-${each.key}"]
allowed_actions = ["*"]
}

depends_on = [module.opensearch]
}
`

In cluster_permissions, we add permissions that are required for both the index level and the cluster level, because Bedrock did not work without them, see Cluster wide index permissions.

Deploy and check in Dashboards:

Adding IAM Users

The idea here is the same, except that instead of regular users with a login:password for authentication, IAM and its Users && Roles are used.

More on the role for Bedrock later, but for now, let’s add user mapping.

What we need to do is take a list of our Backend team users, give them an IAM Policy with access to OpenSearch, and then add mapping to a local role in the OpenSearch internal users database.

For now, we can use the local role all_access, although it would be better to write our own later. See Predefined roles and About the master user.

Add a new variable to the variables.tf:

`
...

variable "backend_team_users_arns" {
type = list(string)
}
`

Its value in the prod.tfvars:

`
...

backend_team_users_arns = [
"arn:aws:iam::492*148:user/arseny",
"arn:aws:iam::492148:user/misha",
"arn:aws:iam::492148:user/oleksii",
"arn:aws:iam::492*148:user/vladimir",
"os_root"
]
`

Here, we had to mess around with the user os_root, because otherwise it would be removed from the role.

So, it’s better to make normal roles — but for MVP, it’s okay.

And we add the mapping of these IAM Users to the role all_access:

`
...

BACKEND TEAM

resource "opensearch_roles_mapping" "all_access_mapping" {
role_name = "all_access"

users = var.backend_team_users_arns
}
`

Deploy, check the all_access role:

Note : ChatGPT stubbornly insisted on adding IAM Users to Backend Roles, but no, and this is clearly stated in the documentation — you need to add them to Users, see Additional master users.

And for all the IAM Users we need to add an IAM policy with access.

Again, for MVP, we can simply take the AWS managed policy AmazonOpenSearchServiceFullAccess, which is connected to the IAM Group:

Creating AWS Bedrock IAM Roles and OpenSearch Role mappings

We already have Bedrock, now just need to create new IAM Roles and map them to OpenSearch Roles.

Add the iam.tf file - describe the IAM Role and IAM Policy (Identity-based Policy for access to OpenSearch), also in a loop for each of the var.app_environments:

MAIN ROLE FOR KNOWLEDGE BASE

grants permissions for AWS Bedrock to interact with other AWS services

resource "aws_iam_role" "knowledge_base_role" {
for_each = toset(var.app_environments)
name = "${var.project_name}-role-${each.key}-managed"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "bedrock.amazonaws.com"
}
Condition = {
StringEquals = {
"aws:SourceAccount" = data.aws_caller_identity.current.account_id
}
ArnLike = {
# restricts the role to be assumed only by Bedrock knowledge base in the specified region
"aws:SourceArn" = "arn:aws:bedrock:${var.aws_region}:${data.aws_caller_identity.current.account_id}:knowledge-base/*"
}
}
}
]
})
}

IAM policy for Knowledge Base to access OpenSearch Managed

resource "aws_iam_policy" "knowledge_base_opensearch_policy" {
for_each = toset(var.app_environments)
name = "${var.project_name}-kb-opensearch-policy-${each.key}-managed"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"es:",
]
Resource = [
module.opensearch.domain_arn,
"${module.opensearch.domain_arn}/"
]
}
]
})
}

resource "aws_iam_role_policy_attachment" "knowledge_base_opensearch" {
for_each = toset(var.app_environments)
role = aws_iam_role.knowledge_base_role[each.key].name
policy_arn = aws_iam_policy.knowledge_base_opensearch_policy[each.key].arn
}
`

Next, in the opensearch_users.tf, let's create:

opensearch_role: with cluster_permissions and index_permissions for each index
locals with all the IAM Roles we created above
and opensearch_roles_mapping for each opensearch_role.os_bedrock_roles, which we add to each opensearch_rolevia backend_roles.

It looks something like this:

`
...

BEDROCK

resource "opensearch_role" "os_bedrock_roles" {
for_each = toset(var.app_environments)
role_name = "os_bedrock_${each.key}_role"
description = "Backend Bedrock KB ${each.key} role"

cluster_permissions = [
"indices:data/read/msearch",
"indices:data/write/bulk*",
"indices:data/read/mget*"
]

index_permissions {
index_patterns = ["kraken-kb-index-${each.key}"]
allowed_actions = ["*"]
}

depends_on = [module.opensearch]
}

'aws_iam_role' is defined in iam.tf

locals {
knowledge_base_role_arns = {
for env, role in aws_iam_role.knowledge_base_role :
env => role.arn
}
}

resource "opensearch_roles_mapping" "os_bedrock_role_mappings" {
for_each = toset(var.app_environments)
role_name = opensearch_role.os_bedrock_roles[each.key].role_name

backend_roles = [
local.knowledge_base_role_arns[each.key]
]

depends_on = [module.opensearch]
}
`

Actually, this is where I encountered Bedrock access errors, which forced me to add cluster_permissions:

The knowledge base storage configuration provided is invalid… Request failed: [security_exception] no permissions for [indices:data/read/msearch] and User [name=arn:aws:iam::492*148:role/kraken-kb-role-dev, backend_roles=[arn:aws:iam::492*148:role/kraken-kb-role-dev], requestedTenant=null]

Deploy, check:

Creating OpenSearch indexes

The provider already exists, so we’ll take the opensearch_index resource.

In locals, we write the index template - I just took it from the developers from the old configuration:

`
locals {
# 'atlas-kb-prod'
env_name = "${var.project_name}-${var.environment}"
# 'opensearch.prod.example.co'
os_custom_domain_name = "opensearch.${var.dns_zone}"

# index mappings

os_index_mappings = <<-EOF
{
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 8192,
"type": "keyword"
}
},
"type": "text"
}
}
}
],
"properties": {
"bedrock-knowledge-base-default-vector": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"name": "hnsw",
"engine": "faiss",
"parameters": {
"m": 16,
"ef_construction": 512
},
"space_type": "l2"
}
},
"AMAZON_BEDROCK_METADATA": {
"type": "text",
"index": false
},
"AMAZON_BEDROCK_TEXT_CHUNK": {
"type": "text",
"index": true
}
}
}
EOF
}
`

Create a file named opensearch_indexes.tf. Add the indexes themselves - here, I decided not to use a loop, but to create separate Dev/Staging/Prod files directly:

Dev Index

resource "opensearch_index" "kb_vector_index_dev" {
name = "kraken-kb-index-dev"

# enable approximate nearest neighbor search by setting index_knn to true
index_knn = true
index_knn_algo_param_ef_search = "512"
number_of_shards = "1"
number_of_replicas = "1"
mappings = local.os_index_mappings

# When new documents are ingested into the Knowledge Base,
# OpenSearch automatically creates field mappings for new metadata fields under
# AMAZON_BEDROCK_METADATA. Since these fields are created outside of TF resource definitions,
# TF detects them as configuration drift and attempts to recreate the index to match its
# known state.
#
# This lifecycle rule prevents unnecessary index recreation by ignoring mapping changes
# that occur after initial deployment.
lifecycle {
ignore_changes = [mappings]
}
}

...
`

Deploy, check:

That’s basically it.

Bedrock is already connected, everything is working.

But it took a little bit of effort.

And I’m sure it won’t be the last time :-)

Originally published at RTFM: Linux, DevOps, and system administration.

Terraform: using Ephemeral Resources and Write-Only Attributes

Arseny Zinchenko — Mon, 29 Dec 2025 10:00:00 +0000

Ephemeral resources and write-only arguments appeared in Terraform a long time ago, back in version 1.10, but there was no opportunity to write about them in detail.

The main idea behind them is not to leave “traces” in the state file, which is especially useful for passwords or tokens, because the data only exists during the execution of Terraform itself in its memory.

However, there are certain limitations to their use — we’ll look at those later, but first, let’s see everything in action.

Example without ephemeral values and write-only arguments
Using Write-Only Attributes
Using Ephemeral Resources
The “This output value is not declared as returning an ephemeral value” error
The “Ephemeral outputs are not allowed in context of a root module” error
Using values from Ephemeral resources
Using Ephemeral Outputs
Useful links

Example without ephemeral values and write-only arguments

Let’s start with the old scheme, without using ephemeral resources and write-only arguments — we will create a random password, the resource aws_secretsmanager_secret, store this password in it, and get it from data:

provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      component = "devops"
      created-by = "terraform"
      environment = "test"
    }
  }
}

### RESOURCES ###

# generate a random password
resource "random_password" "test_random_password" {
   length = 8
   special = false
}

# create an AWS Secret resource
resource "aws_secretsmanager_secret" "test_aws_secret" {
  name = "db_password"
  description = "database passsword"
  recovery_window_in_days = 0
}

# create an AWS Secret value
resource "aws_secretsmanager_secret_version" "test_aws_secret_version" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
  secret_string = random_password.test_random_password.result
}

### DATA SOURCES ###

# retrieve the AWS Secret value
data "aws_secretsmanager_secret_version" "test_aws_secret_data" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id

  depends_on = [aws_secretsmanager_secret_version.test_aws_secret_version]
}

### OUTPUTS ###

# get the random password value
output test_random_password {
  value = random_password.test_random_password.result
  sensitive = true
}

# get the AWS Secret value
output "test_aws_secret" {
  value = data.aws_secretsmanager_secret_version.test_aws_secret_data.secret_string
  sensitive = true
}

Here we are:

resource "random_password": generate the password itself
resource "aws_secretsmanager_secret": create a new entry in AWS Secrets Manager
resource "aws_secretsmanager_secret_version": write the value from resource "random_password" to this Secret
data “aws_secretsmanager_secret_version”: get the value from AWS Secrets Manager
output “test_random_password”: output the value from resource ‘random_password’
output “test_aws_secret”: output the value obtained from AWS Secrets Manager

Execute terraform init and terraform apply:

...
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

Outputs:

test_aws_secret = <sensitive>
test_random_password = <sensitive>

Looks OK — in outputs, thanks to the sensitive = true, nothing is displayed.

But the password is in the state file:

$ cat terraform.tfstate
{
  ...
  "outputs": {
    "test_aws_secret": {
      "value": "1atcZYGR",
      "type": "string",
      "sensitive": true
    },
    "test_random_password": {
      "value": "1atcZYGR",
      "type": "string",
      "sensitive": true
    }
  },
...
  "resources": [
    {
      "mode": "data",
      "type": "aws_secretsmanager_secret_version",
      "name": "test_aws_secret_data",
      ...
            "secret_string": "1atcZYGR",
...
    {
      "mode": "managed",
      "type": "aws_secretsmanager_secret_version",
      "name": "test_aws_secret_version",
      ...
            "secret_string": "1atcZYGR",
...
    {
      "mode": "managed",
      "type": "random_password",
      "name": "test_random_password",
      ...
            "result": "1atcZYGR",

Now let’s start hiding this data from the state.

Using Write-Only Attributes

Resource attributes with the suffix _wo are "write-only" data, meaning that Terraform keeps them in memory during operations but does not store them anywhere.

However, not all resources support these attributes. For example, in AWS RDS, you can pass a password via the password_wo attribute of the aws_db_instance resource, but in aws_opensearch_domain and its master_user_password attribute to create a root user in the internal user database - not yet.

Official documentation — Use write-only arguments.

aws_secretsmanager_secret_version also supports write-only attributes - secret_string_wo instead of secret_string, and secret_string_wo_version instead of secret_string_version.

The use of secret_string_wo_version is mandatory for secret_string_wo, because since Terraform does not store password information, it will not know when to update it. To do this, we set a version that we increment each time we want to update the password.

Edit the code, change the only resource “aws_secretsmanager_secret_version” - set secret_string_wo and secret_string_wo_version, leaving the rest unchanged:

...
# create an AWS Secret value
resource "aws_secretsmanager_secret_version" "test_aws_secret_version" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
  #secret_string = random_password.test_random_password.result
  secret_string_wo = random_password.test_random_password.result
  secret_string_wo_version = 1
}
...

Run terraform apply, and check the state now:

$ cat terraform.tfstate
{
  ...
  "outputs": {
    "test_aws_secret": {
      "value": "1atcZYGR",
      "type": "string",
      "sensitive": true
    },
    "test_random_password": {
      "value": "1atcZYGR",
      "type": "string",
      "sensitive": true
    }
  },
...
  "resources": [
    {
      "mode": "data",
      "type": "aws_secretsmanager_secret_version",
      "name": "test_aws_secret_data",
      ...
            "secret_string": "1atcZYGR",
...
    {
      "mode": "managed",
      "type": "aws_secretsmanager_secret_version",
      "name": "test_aws_secret_version",
      ...
            "secret_string": "",
            "secret_string_wo": null,
            "secret_string_wo_version": 1,

...
    {
      "mode": "managed",
      "type": "random_password",
      "name": "test_random_password",
      ...
            "result": "1atcZYGR",

Now we have managed.aws_secretsmanager_secret_version.test_aws_secret_version with no values for secret_string and secret_string_wo.

Using Ephemeral Resources

The idea behind “ephemeral” resources is the same as with write-only arguments — these resources only exist in Terraform’s memory during the execution of terraform apply and are not stored in the state file.

However, the use of such resources is limited:

you can refer to them in write-only arguments
in other ephemeral resources
in locals
in ephemeral variables
in providers, provisioners, and connections

Documentation — Ephemeral block reference.

Let’s edit our code and change resource “random_password” to ephemeral "random_password", leave resource “aws_secretsmanager_secret_version” - it will write the password to AWS Secrets Manager but will not store the value in state, and add a new resource - ephemeral “aws_secretsmanager_secret_version”, through which we will get this password back in Terraform.

At the same time, in the secret_string_wo and in output “test_random_password” we now refer to the password through ephemeral - ephemeral.random_password.test_random_password.result.

And in the output “test_aws_secret” we also use ephemeral.aws_secretsmanager_secret_version.test_aws_secret_data.secret_string.

The data "aws_secretsmanager_secret_version" can be removed, because we will now get the password from the ephemeral “aws_secretsmanager_secret_version”:

...

### RESOURCES ###

# generate a random password
ephemeral "random_password" "test_random_password" {
   length = 8
   special = false
}

# create an AWS Secret resource
resource "aws_secretsmanager_secret" "test_aws_secret" {
  name = "db_password"
  description = "database passsword"
  recovery_window_in_days = 0
}

# create an AWS Secret value
resource "aws_secretsmanager_secret_version" "test_aws_secret_version" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
  #secret_string = random_password.test_random_password.result
  secret_string_wo = ephemeral.random_password.test_random_password.result
  secret_string_wo_version = 1
}

### DATA SOURCES ###

# Retrieve the password from Secrets Manager (ephemeral)
ephemeral "aws_secretsmanager_secret_version" "test_aws_secret_version_ephemeral" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
}

# retrieve the AWS Secret value
# data "aws_secretsmanager_secret_version" "test_aws_secret_data" {
# secret_id = aws_secretsmanager_secret.test_aws_secret.id

# depends_on = [aws_secretsmanager_secret_version.test_aws_secret_version]
# }

### OUTPUTS ###

# get the random password value
output test_random_password {
  value = ephemeral.random_password.test_random_password.result
  sensitive = true
}

# get the AWS Secret value
output "test_aws_secret" {
  value = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  sensitive = true
}

The “This output value is not declared as returning an ephemeral value” error

Execute terraform apply and catch the first error:

...
│ Error: Ephemeral value not allowed
│ 
│ on main.tf line 53, in output "test_random_password":
│ 53: value = ephemeral.random_password.test_random_password.result
│ 
│ This output value is not declared as returning an ephemeral value, so it cannot be set to a result derived from an ephemeral value.
╵
╷
│ Error: Ephemeral value not allowed
│ 
│ on main.tf line 59, in output "test_aws_secret":
│ 59: value = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
│ 
│ This output value is not declared as returning an ephemeral value, so it cannot be set to a result derived from an ephemeral value.

But even if we add the parameter ephemeral = true:

...
### OUTPUTS ###

# get the random password value
output test_random_password {
  value = ephemeral.random_password.test_random_password.result
  sensitive = true
  ephemeral = true
}

# get the AWS Secret value
output "test_aws_secret" {
  value = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  sensitive = true
  ephemeral = true
}

It still won’t work.

The “Ephemeral outputs are not allowed in context of a root module” error

Now the error will look like this:

...
╷
│ Error: Ephemeral output not allowed
│ 
│ on main.tf line 52:
│ 52: output test_random_password {
│ 
│ Ephemeral outputs are not allowed in context of a root module
╵
╷
│ Error: Ephemeral output not allowed
│ 
│ on main.tf line 59:
│ 59: output "test_aws_secret" {
│ 
│ Ephemeral outputs are not allowed in context of a root module

Because Ephemeral outputs can only be used in modules — we’ll see how later.

OK — for now, let’s just remove Outputs, and now terraform apply runs without any problems:

$ terraform apply
...
random_password.test_random_password: Refreshing state... [id=none]
ephemeral.random_password.test_random_password: Opening...
ephemeral.random_password.test_random_password: Opening complete after 0s
...
ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Opening...
...
ephemeral.random_password.test_random_password: Closing...
ephemeral.random_password.test_random_password: Closing complete after 0s
...

Please note that for ephemeral resources, Terraform now performs Opening and Closing operations instead of Reading and Refreshing state. That is, it simply creates an object in memory, reads the resource into it, and then “closes” and removes it from memory.

Let’s check the state file now:

...
    {
      "mode": "managed",
      "type": "aws_secretsmanager_secret_version",
      "name": "test_aws_secret_version",
      ...
            "secret_string": "",
            "secret_string_wo": null,
            "secret_string_wo_version": 1,
...

Now we have:

resources ephemeral “random_password” and ephemeral “aws_secretsmanager_secret_version” are not in the state at all,
and managed.aws_secretsmanager_secret_version.test_aws_secret_version still has an empty field in secret_string_wo because we made it write-only earlier

OK, but how do we use the password now? Because we removed data “aws_secretsmanager_secret_version”.

Using values from Ephemeral resources

We have already seen an example of referencing Ephemeral resources above when we did secret_string_wo = ephemeral.random_password.test_random_password.result.

Similarly, we can use ephemeral.aws_secretsmanager_secret_version.db_password_wo_ephemeral.secret_string.

As mentioned above, we cannot do this everywhere, but it is allowed in providers.

To verify this, let’s run PostgreSQL with our password (we’ll take it directly from AWS Console > AWS Secrets Manager):

Launch a container, to which we pass the variable POSTGRES_PASSWORD="1atcZYGR":

$ docker run --rm --name some-postgres -e POSTGRES_PASSWORD="1atcZYGR" -p 5432:5432 postgres

Add the provider to our code and use it to connect to the container, where we will create a test database.

In the provider’s password field we will use a value from the ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string:

...

### PostgreSQL Configuration

terraform {
  required_providers {
    postgresql = {
      source = "cyrilgdn/postgresql"
      version = "~> 1.20"
    }
  }
}

provider "postgresql" {
  host = "localhost"
  port = 5432
  username = "postgres"
  password = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  sslmode = "disable"
}

resource "postgresql_database" "demo_db" {
  name = "demo_db"
  template = "template0"
  connection_limit = -1
  allow_connections = true
}

Run terraform init and terraform apply:

$ terraform init && terraform apply
...
ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Opening...
ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Opening complete after 1s
postgresql_database.demo_db: Creating...
postgresql_database.demo_db: Creation complete after 0s [id=demo_db]
ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Closing...
ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Closing complete after 0s

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Check the database:

$ export PGPASSWORD="1atcZYGR"
$ psql -h localhost -U postgres -c "\l"
                                                    List of databases
   Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges   
-----------+----------+----------+-----------------+------------+------------+--------+-----------+-----------------------
 demo_db | postgres | UTF8 | libc | en_US.utf8 | en_US.utf8 | | | 
...

In the same way, we could use an ephemeral resource via locals:

...
locals {
  db_password_local = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
}

provider "postgresql" {
  host = "localhost"
  port = 5432
  username = "postgres"
  password = local.db_password_local
  #password = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  sslmode = "disable"
}

resource "postgresql_database" "demo_db" {
  name = "demo_db_via_local"
  template = "template0"
  connection_limit = -1
  allow_connections = true
}

Check:

$ terraform apply
...
  # postgresql_database.demo_db will be updated in-place
  ~ resource "postgresql_database" "demo_db" {
        id = "demo_db"
      ~ name = "demo_db" -> "demo_db_via_local"
        # (10 unchanged attributes hidden)
    }
...
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

And in the state file, the password is not visible anywhere:

$ cat terraform.tfstate | grep 1atcZYGR | echo $?
127

Using Ephemeral Outputs

Above, we tried to use output “test_aws_secret” with ephemeral = true, but got the error "Ephemeral outputs are not allowed in context of a root module".

Let’s try using it in our own module.

Documentation — ephemeral — Avoid storing values in state or plan files.

Let’s create a module modules/secret_ephemeral, in which we will generate a password and save it in AWS Secrets Manager, and add Ephemeral Output.

And in the root module, we will use outputs of this module to get ephemeral “aws_secretsmanager_secret_version”, as we did above.

Let’s write the file modules/secret_ephemeral/secret.tf:

### RESOURCES ###

# generate a random password
ephemeral "random_password" "test_random_password" {
   length = 8
   special = false
}

# create an AWS Secret resource
resource "aws_secretsmanager_secret" "test_aws_secret" {
  name = "db_password_via_module"
  description = "database passsword"
  recovery_window_in_days = 0
}

# create an AWS Secret value
resource "aws_secretsmanager_secret_version" "test_aws_secret_version" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
  #secret_string = random_password.test_random_password.result
  secret_string_wo = ephemeral.random_password.test_random_password.result
  secret_string_wo_version = 1
}

# Retrieve the password from Secrets Manager (ephemeral)
ephemeral "aws_secretsmanager_secret_version" "test_aws_secret_version_ephemeral" {
  secret_id = aws_secretsmanager_secret.test_aws_secret.id
}

output "password_ephemeral" {
  value = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  ephemeral = true
}

In the main file main.tf, remove everything related to the password, add a module call, and in locals use its output:

...

### PostgreSQL Configuration

terraform {
  required_providers {
    postgresql = {
      source = "cyrilgdn/postgresql"
      version = "~> 1.20"
    }
  }
}

module "secret_ephemeral" {
  source = "./modules/secret_ephemeral"
}

locals {
  db_password_local = module.secret_ephemeral.password_ephemeral
}

provider "postgresql" {
  host = "localhost"
  port = 5432
  username = "postgres"
  password = local.db_password_local
  #password = ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral.secret_string
  sslmode = "disable"
}

resource "postgresql_database" "demo_db" {
  name = "demo_db_via"
  template = "template0"
  connection_limit = -1
  allow_connections = true
}

First, you need to create a password — run terraform apply without resource “postgresql_database”, and update the container launch with the new password:

$ docker run --rm --name some-postgres -e POSTGRES_PASSWORD="PHsfzcIx" -p 5432:5432 postgres

Now our provider uses a password from the Ephemeral Output module modules/secret_ephemeral:

...
module.secret_ephemeral.ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Opening...
module.secret_ephemeral.ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Opening complete after 1s
postgresql_database.demo_db: Creating...
postgresql_database.demo_db: Creation complete after 0s [id=demo_db_via]
module.secret_ephemeral.ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Closing...
module.secret_ephemeral.ephemeral.aws_secretsmanager_secret_version.test_aws_secret_version_ephemeral: Closing complete after 0s

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

In the state, we still don’t have a password:

$ cat terraform.tfstate | grep PHsfzcIx | echo $?
127

That’s basically it.

It’s a shame that aws_opensearch_domain doesn't support write-only. I wanted to use it for the root password :-(

But there is already an issue on GitHub Support ephemeral “write-only” argument for aws_opensearch_domain, and even a comment saying “I have started working on this issue, and will submit a PR shortly”.

And in the pull request itself, you can even see how it’s implemented.

Useful links

Originally published at RTFM: Linux, DevOps, and system administration.

Terraform: AWS EKS Terraform module update from version 20.x to version 21.x

Arseny Zinchenko — Thu, 18 Sep 2025 10:02:45 +0000

AWS EKS Terraform module version v21.0.0 added support for the AWS Provider Version 6.

Documentation — here>>>.

The main changes in the AWS EKS module are the replacement of IRSA with EKS Pod Identity for the Karpenter sub-module:

Native support for IAM roles for service accounts (IRSA) has been removed; EKS Pod Identity is now enabled by default

Also, “The aws-auth sub-module has been removed”, but I personally removed it a long time ago.

Some variables have also been renamed.

I wrote about upgrading from version 19 to 20 in Terraform: EKS and Karpenter — upgrade module version from 19.21 to 20.0, and this time we will follow the same path — change the module versions and see what breaks.

I have a separate “Testing” environment for this, which I first roll out with the current versions of modules/providers, then update the code, deploy the upgrade, and when everything is fixed, I upgrade EKS Production (because we have one cluster on dev/staging/prod).

In Karpenter’s own Helm chart, there seem to be no significant changes, although version 1.6 has already been released. You can update it at the same time, but that’s for another time.

Overall, the upgrade went smoothly, but there were two issues that required some debugging: a problem with the EC2 metadata for AWS Load Balancer Controller during the upgrade, and a problem with EKS Add-ons when creating a new cluster with AWS EKS Terraform module v21.x.

Upgrade AWS EKS Terraform module

Upgrade AWS Provider Version 6

First, change the AWS Provider version — finally, because the open pool requests from Renovate were annoying, and I couldn’t close them.

It’s simple — just change the version to 6:

...
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 6.0"
    }
  }
...

Use the pessimistic constraint operator to allow upgrades of all minor versions.

This will be considered both by Renovate, and when executing terraform init -upgrade.

Upgrade `terraform-aws-modules/eks/aws`

Let’s upgrade the EKS module version — change 20 to 21, also with the “~>":

...
module "eks" {
  source = "terraform-aws-modules/eks/aws"
  version = "~> v21.0"
  ...

And Karpenter too, I have it as a separate module:

module "karpenter" {
  source = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> v21.0"
  ...

Run terraform init, and get the " does not match configured version constraint" error, I've already described it in the Terraform: “no available releases match the given constraints post:

$ terraform init
...
registry.terraform.io/hashicorp/aws 5.100.0 does not match configured version constraint >= 4.0.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 6.0.0
...

Because .terraform.lock.hcl still contains the old version of the AWS provider:

$ cat envs/test-1-33/.terraform.lock.hcl | grep -A 5 5.100
  version = "5.100.0"
  constraints = ">= 4.0.0, >= 4.33.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 5.95.0"

You can drop the file and run terraform init again, or you can run terraform init -upgrade to pull all upgrades at once:

$ terraform init -upgrade

Check .terraform.lock.hcl again - now everything is OK:

$ git diff .terraform.lock.hcl
diff --git a/terraform/envs/test-1-33/.terraform.lock.hcl b/terraform/envs/test-1-33/.terraform.lock.hcl
index bd44714..cb2eace 100644
--- a/terraform/envs/test-1-33/.terraform.lock.hcl
+++ b/terraform/envs/test-1-33/.terraform.lock.hcl
@@ -24,98 +24,85 @@ provider "registry.terraform.io/alekc/kubectl" {
 }

 provider "registry.terraform.io/hashicorp/aws" {
- version = "5.100.0"
- constraints = ">= 4.0.0, >= 4.33.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 5.95.0"
+ version = "6.7.0"
+ constraints = ">= 4.0.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, >= 6.0.0, ~> 6.0"
   hashes = [
...

Let’s run terraform plan and see what breaks.

Renamed variables в `terraform-aws-modules/eks/aws`

The first, as expected, were errors about missing variables, because they had been renamed in the module:

$ terraform plan -var-file=test-1-33.tfvars
...
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 34, in module "eks":
│ 34: cluster_name = "${var.env_name}-cluster"
│ 
│ An argument named "cluster_name" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 38, in module "eks":
│ 38: cluster_version = var.eks_version
│ 
│ An argument named "cluster_version" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 42, in module "eks":
│ 42: cluster_endpoint_public_access = var.eks_params.cluster_endpoint_public_access
│ 
│ An argument named "cluster_endpoint_public_access" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 46, in module "eks":
│ 46: cluster_enabled_log_types = var.eks_params.cluster_enabled_log_types
│ 
│ An argument named "cluster_enabled_log_types" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 50, in module "eks":
│ 50: cluster_addons = {
│ 
│ An argument named "cluster_addons" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/eks.tf line 148, in module "eks":
│ 148: cluster_security_group_name = "${var.env_name}-cluster-sg"
│ 
│ An argument named "cluster_security_group_name" is not expected here.
...

Let’s go to the upgrade documentation and find out what the variables are now called:

cluster_name => name
cluster_version => kubernetes_version
cluster_endpoint_public_access => endpoint_public_access
cluster_enabled_log_types => enabled_log_types
cluster_addons -> addons
cluster_security_group_name -> security_group_name

Although, in my opinion, the prefix cluster_* would have been better, because we have node_security_group_name, and there was cluster_security_group_name - it is clear which parameter is for what.

And now there is node_security_group_name and "some" security_group_name.

Removed variables в terraform-aws-`modules/eks/aws//modules/karpenter`

OK, edit the variable names in the main module code, run terraform plan again - now we have errors for changes in the karpenter module:

...
 Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/karpenter.tf line 7, in module "karpenter":
│ 7: irsa_oidc_provider_arn = module.eks.oidc_provider_arn
│ 
│ An argument named "irsa_oidc_provider_arn" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/karpenter.tf line 8, in module "karpenter":
│ 8: irsa_namespace_service_accounts = ["karpenter:karpenter"]
│ 
│ An argument named "irsa_namespace_service_accounts" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│ on ../../modules/atlas-eks/karpenter.tf line 14, in module "karpenter":
│ 14: enable_irsa = true
│ 
│ An argument named "enable_irsa" is not expected here.
...

They were removed because IRSA no longer exists — an EKS Pod Identity will now be created for Karpenter, see [main.tf#L92](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/karpenter/main.tf#L92).

I wrote about EKS Pod Identities in AWS: EKS Pod Identities — a replacement for IRSA? Simplifying IAM access management and in Terraform: managing EKS Access Entries and EKS Pod Identities.

Let’s remove them:

...
  #irsa_oidc_provider_arn = module.eks.oidc_provider_arn
  #irsa_namespace_service_accounts = ["karpenter:karpenter"]
  #enable_irsa = true
...

Run terraform plan again.

Important: Karpenter’s EKS Identity Provider Namespace

And here is an important point:

...
  # module.atlas_eks.module.karpenter.aws_eks_pod_identity_association.karpenter[0] will be created
  + resource "aws_eks_pod_identity_association" "karpenter" {
      ...
      + namespace = "kube-system"
      + region = "us-east-1"
      + role_arn = "arn:aws:iam::492***148:role/KarpenterIRSA-atlas-eks-test-1-33-cluster"
      + service_account = "karpenter"
...

eks_pod_identity_association will be created for the Kubernetes Namespace "kube-system".

If you have Karpenter running in a different namespace, you need to specify it explicitly when calling the module:

...
module "karpenter" {
  source = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> v21.0"

  cluster_name = module.eks.cluster_name
  namespace = "karpenter"
...

Otherwise, Karpenter will broke, and the WorkerNode Group upgrade will fail because a Node will wait for the Karpenter, which will be in the CrashLoopbackoff, and the Group upgrade will fail.

`eks_managed_node_groups`: attribute "taints": map of object required

Now there is an error with node group tags:

...
│ The given value is not suitable for module.atlas_eks.module.eks.var.eks_managed_node_groups declared at .terraform/modules/atlas_eks.eks/variables.tf:1205,1-35: element "test-1-33-default": attribute "taints": map of object required.
...

Why? Because:

Variable definitions now contain detailed object types in place of the previously used any type.

See diff 20 vs 21:

So now it should be map(object):

...
  type = map(object({
    key = string
    value = optional(string)
    effect = string
  }))
...

And I have taints currently been passed from a variable with an object set(map(string)):

...
variable "eks_managed_node_group_params" {
  description = "EKS Managed NodeGroups setting, one item in the map() per each dedicated NodeGroup"
  type = map(object({
    min_size = number
    max_size = number
    desired_size = number
    instance_types = list(string)
    capacity_type = string
    taints = set(map(string))
    max_unavailable_percentage = number
  }))
}
...

With the following values:

...
eks_managed_node_group_params = {
  default_group = {
    min_size = 1
    max_size = 1
    desired_size = 1
    instance_types = ["t3.medium"]
    capacity_type = "ON_DEMAND"
    taints = [
      {
        key = "CriticalAddonsOnly"
        value = "true"
        effect = "NO_SCHEDULE"
      },
      {
        key = "CriticalAddonsOnly"
        value = "true"
        effect = "NO_EXECUTE"
      }
    ]
    max_unavailable_percentage = 100
  }
}
...

So what needs to be done is to change the declaration of the variable in my code:

...
variable "eks_managed_node_group_params" {
  description = "EKS Managed NodeGroups setting, one item in the map() per each dedicated NodeGroup"
  type = map(object({
    min_size = number
    max_size = number
    desired_size = number
    instance_types = list(string)
    capacity_type = string
    #taints = set(map(string))
    taints = optional(map(object({
      key = string
      value = optional(string)
      effect = string
    })))
    max_unavailable_percentage = number
  }))
}
...

And update the values — add keys for map{}:

...
eks_managed_node_group_params = {
  default_group = {
    min_size = 1
    max_size = 1
    desired_size = 1
    instance_types = ["t3.medium"]
    capacity_type = "ON_DEMAND"
    # taints = [
    # {
    # key = "CriticalAddonsOnly"
    # value = "true"
    # effect = "NO_SCHEDULE"
    # },
    # {
    # key = "CriticalAddonsOnly"
    # value = "true"
    # effect = "NO_EXECUTE"
    # }
    # ]
      taints = {
        critical_no_sched = {
          key = "CriticalAddonsOnly"
          value = "true"
          effect = "NO_SCHEDULE"
        },
        critical_no_exec = {
          key = "CriticalAddonsOnly"
          value = "true"
          effect = "NO_EXECUTE"
        }
      }
    max_unavailable_percentage = 100
  }
}
...

Run terraform plan again, and now everything works without errors.

Let’s deploy the updates.

Deploying changes

Run terraform apply, and now we have a new resource with EKS Pod Identity Association for Karpenter - module.atlas_eks.module.karpenter.aws_eks_pod_identity_association.karpenter:

Which wasn’t here in the old cluster with v20.

ALB Controller error: “failed to fetch VPC ID from instance metadata”

There was also a problem with AWS Load Balancer Controller, because after the upgrade it could not connect to IMDS, probably due to switching to v2, see AWS: Instance Metadata Service v1 vs IMDS v2 and working with Kubernetes Pod and Docker containers:

...
{"level":"error","ts":"2025-08-06T07:25:40Z"," logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}
...

Actually, we can just pass the parameters explicitly; see the documentation Using the Amazon EC2 instance metadata server version 2 (IMDSv2).

Note the --aws-vpc-tag-key:

optional flag — aws-vpc-tag-key if you have a different key for the tag other than “Name”

First, let’s try setting the parameters manually to check that it works:

Everything is working now.

Now the parameters for the Helm chart, see its values.yaml#L163 — my controllers are installed from aws-ia/eks-blueprints-addons/aws in Terraform when creating the cluster, so set here:

...
    values = [
      <<-EOT
        replicaCount: 1
        region: ${var.aws_region}
        vpcId: ${var.vpc_id}
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
      EOT
    ]
...

Start the deployment:

Everything works.

Issue: Node Group Status CREATE_FAILED

Here I will describe a problem that arose only when creating a new EKS cluster with module v21 — upgrading an existing cluster proceeds without these issues.

Actually, here’s the problem: the cluster was created, everything seems OK, but it hangs for a long time on creating the Node Group, and then crashes with the error “ unexpected state ‘CREATE_FAILED’ ”:

...
╷
│ Error: waiting for EKS Node Group (atlas-eks-test-1-33-cluster:test-1-33-default-20250801112636765600000014) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-03f2c73c7211880f7: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster
...

Although there is an EC2 Auto Scaling Group created, and it has an EC2 up and running.

Why?

So, the problem is that WorkerNode has been created but cannot connect to Kubernetes.

The first thing that comes to mind is to check the Security Group, but everything appears to be correct here — all the rules are correct. I compared it with the current EKS cluster, which was created with AWS EKS Terraform module v20.x — everything is the same.

Problem with IAM? EC2 doesn’t have permissions to access the cluster? Again, compare with the old cluster, and everything is OK.

“Check the logs, Billy!”

The funny thing is that SSH is configured on all my EC2 instances, but only for nodes created with Karpenter, as I wrote in AWS: Karpenter and SSH for Kubernetes WorkerNodes.

The current problem arose in the “default” NodeGroup, where various controllers are launched.

So, let’s connect via the AWS Console and select Connect:

Then, in EC2 Instance Connect, select “Connect using a Private IP” and select an existing EC2 Instance Connect Endpoint or quickly create a new one.

Set the username — for Amazon Linux, it is ec2-user:

And let’s look at the logs:

“Container runtime network not ready — cni plugin not initialized”

Actually:

Aug 01 13:26:04 ip-10-0-48-198.ec2.internal kubelet[1619]: E0801 13:26:04.989799 1619 kubelet.go:3126] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Wow…

Okay, what’s the situation with VPC CNI?

Let’s go check out EKS Add-ons, and…

It’s completely empty.

Let’s look at the log terraform apply - and we see "Read complete", but there is no "Creating...":

...
module.atlas_eks.module.eks.data.aws_eks_addon_version.this["vpc-cni"]: Read complete after 0s [id=vpc-cni]
...

Let’s check if there are any containers on the node — maybe there are some errors?

Wow, once again…

Nothing at all.

Even then, I went back to GitHub Issues and searched for “addon” and found this issue: Managed EKS Node Groups boot without CNI, but addon is added after node group.

Actually, yes — the problem arose due to the absence of the before_compute parameter.

Although it’s a little strange, because it was added in version v19.9, the last time I deployed a cluster from scratch was with v20, and this problem did not occur.

Even more, when I created the Testing cluster from the master branch, where none of the updates described here have been applied, and module version v20 is still used — everything is working without any problems.

And in diff 20 vs 21 I don’t see any significant changes related to before_compute.

However, since this only applies to creating a new cluster, we do not need to add before_compute when simply upgrading. But if you do add it, the add-ons will be recreated.

The before_compute itself was added to allow specifying which addons to create before WorkerNodes and which after. See main.tf#L797 and comments to PR #2478.

Add as in the examples EKS Managed Node Group:

...
    vpc-cni = {
      addon_version = var.eks_addon_versions.vpc_cni
      before_compute = true
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET = "1"
          AWS_VPC_K8S_CNI_EXTERNALSNAT = "true"
        }
      })
    }
    aws-ebs-csi-driver = {
      addon_version = var.eks_addon_versions.aws_ebs_csi_driver
      service_account_role_arn = module.ebs_csi_irsa_role.iam_role_arn
    }
    eks-pod-identity-agent = {
      addon_version = var.eks_addon_versions.eks_pod_identity_agent
      before_compute = true
    }
...

Run terraform apply again, and here it is:

...
module.atlas_eks.module.eks.aws_eks_addon.before_compute["vpc-cni"]: Creating...
...
module.atlas_eks.module.eks.aws_eks_addon.before_compute["vpc-cni"]: Creation complete after 46s [id=atlas-eks-test-1-33-cluster:vpc-cni]
...

And in the AWS Console:

NodeGroup created without errors:

...
module.atlas_eks.module.eks.module.eks_managed_node_group["test-1-33-default"].aws_eks_node_group.this[0]: Still creating... [01m40s elapsed]
module.atlas_eks.module.eks.module.eks_managed_node_group["test-1-33-default"].aws_eks_node_group.this[0]: Creation complete after 1m49s [id=atlas-eks-test-1-33-cluster:test-1-33-default-20250801142042855800000003]
...

Done.

Originally published at RTFM: Linux, DevOps, and system administration.

Kubernetes: PVC in a StatefulSet, and the “Forbidden updates to statefulset spec” error

Arseny Zinchenko — Wed, 17 Sep 2025 12:36:49 +0000

Kubernetes: PVC in StatefulSet, and the “Forbidden updates to statefulset spec” error

We have a VictoriaLogs Helm chart with a PVC size of 30 GB, which is no longer enough for us, and we need to increase it.

But the problem is that .spec.volumeClaimTemplates[*].spec.resources.requests.storage in STS is immutable, that is, we can't just change the size through values.yaml file, because it will lead to the error "Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'revisionHistoryLimit', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden".

The chart values now look like this:

victoria-logs-single:
  server:
    persistentVolume:
      enabled: true
      storageClassName: gp2-retain
      size: 30Gi
    retentionPeriod: 7d

And with the default type of StatefulSet in the chart, the volumeClaimTemplates is used to create PVCs:

...  
volumeClaimTemplates:
    - apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: server-volume
        ...
      spec:
        ...
        resources:
          requests:
            storage: {{ $app.persistentVolume.size }}
...

If instead of STS there was a Deployment type, then in the VictoriaLogs chart this would lead to the creation of a separate PVC — see the pvc.yaml.

You could simply create a separate PVC yourself and connect it through the existingClaim value, but you already have a PersistentVolume, and you don't want to create a new one and migrate data (although you can if you need to, see VictoriaMetrics: migrating VMSingle and VictoriaLogs data between Kubernetes clusters, but there will be a down time), so let's see how we can solve this differently - without deleting Pods and without stopping the service.

storageClassName and AllowVolumeExpansion

The storageClas used to create a Persistent Volume must support AllowVolumeExpansion - see Volume expansion:

$ kk describe storageclass gp2-retain
Name: gp2-retain
...
Provisioner: kubernetes.io/aws-ebs
Parameters: <none>
AllowVolumeExpansion: True
MountOptions: <none>
ReclaimPolicy: Retain
VolumeBindingMode: WaitForFirstConsumer
...

Create this storageClass when creating an EKS cluster from a simple manifest:

...
resource "kubectl_manifest" "storageclass_gp2_retain" {

  yaml_body = <<YAML
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: gp2-retain
    provisioner: kubernetes.io/aws-ebs
    reclaimPolicy: Retain
    allowVolumeExpansion: true
    volumeBindingMode: WaitForFirstConsumer
  YAML
}
...

Although there is a dedicated storage_class resource for Terraform, and would be better to use it instead for the kubectl_manifest.

And the kubernetes.io/aws-ebs driver is already deprecated (OMG, since Kubernetes 1.17!), it's time to update to ebs.csi.aws.com.

But we’ll fix this later, right now the goal is to simply increase the disk.

Reproducing the issue

For the test, let’s write our own STS with volumeClaimTemplates:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: demo-sts
spec:
  serviceName: demo-sts-svc
  replicas: 1
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
        - name: app
          image: busybox
          command: ["sh", "-c", "sleep 3600"]
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp2-retain
        resources:
          requests:
            storage: 1Gi

In volumeClaimTemplates, set the storageClassName and the size to 1 gigabyte.

Deploy:

$ kk apply -f test-sts-pvc.yaml 
statefulset.apps/demo-sts created

Check the PVC:

$ kk get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
data-demo-sts-0 Bound pvc-31a9a547-7547-4d34-bb2d-2c7015b9e0f3 1Gi RWO gp2-retain <unset> 15s

Now, if we want to increase the size via volumeClaimTemplates from 1Gi to 2Gi:

...
  volumeClaimTemplates:
    ...
        resources:
          requests:
            storage: 2Gi

Then we get an error:

$ kk apply -f test-sts-pvc.yaml 
The StatefulSet "demo-sts" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'revisionHistoryLimit', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

The Fix

But we can get around this very easily:

edit the PVC manually — set a new size
delete STS with the --cascade=orphan - see Delete owner objects and orphan dependents
create STS again
…
profit!

Let’s try it.

Note : before changing disks, don’t forget about backups!

Edit the PVC manually — change resources.requests.storage from 1Gi to 2Gi:

Check the Events of this PVC:

$ kk describe pvc data-demo-sts-0
...
  Normal ExternalExpanding 40s volume_expand CSI migration enabled for kubernetes.io/aws-ebs; waiting for external resizer to expand the pvc
  Normal Resizing 40s external-resizer ebs.csi.aws.com External resizer is resizing volume pvc-31a9a547-7547-4d34-bb2d-2c7015b9e0f3
  Normal FileSystemResizeRequired 35s external-resizer ebs.csi.aws.com Require file system resize of volume on node

And after a few more seconds, it’s done:

...
  Normal FileSystemResizeSuccessful 19s kubelet

Check CAPACITY:

$ kk get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
data-demo-sts-0 Bound pvc-31a9a547-7547-4d34-bb2d-2c7015b9e0f3 2Gi RWO gp2-retain <unset> 4m7s

2Gi, everything is OK.

And now we also have 2 gigabytes in the Pod itself:

$ kk exec -ti demo-sts-0 -- df -h /data
Filesystem Size Used Available Use% Mounted on
/dev/nvme7n1 1.9G 24.0K 1.9G 0% /data

But if we try to deploy the changes to volumeClaimTemplates.spec.resources.requests.storage again, we will still get an error:

$ kk apply -f test-sts-pvc.yaml 
The StatefulSet "demo-sts" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'revisionHistoryLimit', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

So, delete the STS itself, but leave all its dependent objects:

$ kubectl delete statefulset demo-sts --cascade=orphan 
statefulset.apps "demo-sts" deleted

Check if the Pod is alive:

$ kk get pod
NAME READY STATUS RESTARTS AGE
demo-sts-0 1/1 Running 0 3m13s

And now we just create STS again, with a new value in the volumeClaimTemplates.spec.resources.requests.storage:

$ kk apply -f test-sts-pvc.yaml 
statefulset.apps/demo-sts created

Done.

Originally published at RTFM: Linux, DevOps, and system administration.

Kubernetes: what are the Kubernetes Operator and CustomResourceDefinition

Arseny Zinchenko — Tue, 16 Sep 2025 07:27:07 +0000

Perhaps everyone has used operators in Kubernetes, for example, PostgreSQL operator, VictoriaMetrics Operator.

But what’s going on under the hood? How and to what are CustomResourceDefinition (CRD) applied, and what is an “operator”?

And finally, what is the difference between a Kubernetes Operator and a Kubernetes Controller?

In the previous part — Kubernetes:Kubernetes APIs, API Groups, CRDs, etcd — we dug a little deeper into how the Kubernetes API works and what a CRD is, and now we can try to write our own micro-operator, a simple MVP, and use it as an example to understand the details.

Kubernetes Controller vs Kubernetes Operator
What is: Kubernetes Controller
What is: Kubernetes Operator
Kubernetes Operator frameworks
Creating a CustomResourceDefinition
Creating a Kubernetes Operator with Kopf
Resource templates: Kopf and Kubebuilder
And what about in real operators?

Kubernetes Controller vs Kubernetes Operator

So, what is the main difference between Controllers and Operators?

What is: Kubernetes Controller

Simply put, a Controller is just some service that monitors resources in a cluster and brings their state in line with how this state is described in the database — etcd.

In Kubernetes, we have a set of default controllers — Core Controllers within the Kube Controller Manager, such as the ReplicaSet Controller, which checks the number of pods in the Deployment against the replicas value, or the Deployment Controller, which controls the creation and update of ReplicaSets, or the PersistentVolume Controller and PersistentVolumeClaim Binder for working with disks, etc.

In addition to these default controllers, you can create your own controller or use an existing one, such as ExternalDNS Controller. These are examples of custom controllers.

Controllers work in a control loop — a cyclic process in which they constantly check the resources assigned to them — either to change existing resources in the system or to respond to the addition of new ones.

During each check*(reconciliation loop*), the Controller compares the current state of the resource and compares it with the desired state — that is, the parameters specified in its manifest when the resource was created or updated.

If the desired state does not correspond to the current state, the controller performs the necessary actions to bring these states into alignment.

What is: Kubernetes Operator

Kubernetes Operator, in turn, is a kind of “controller on steroids”: in fact, Operator is a Custom Controller in the sense that it has its own service in the form of a Pod that communicates with the Kubernetes API to receive and update information about resources.

But if ordinary controllers work with “default” resource types (Pod, Endpoint Slice, Node, PVC), then for Operator we describe our own custom resources using a manifest with Custom Resource.

And how these resources will look like and what parameters they will have — we set through CustomResourceDefinition which are written to the Kubernetes database and added to the Kubernetes API, and thus the Kubernetes API allows our custom Controller to operate with these resources.

That is:

Controller is a component, a service, and Operator is a combination of one or more custom Controllers and corresponding CRDs
Controller — responds to changes in resources, and Operator — adds new types of resources + controller that controls these resources

Kubernetes Operator frameworks

There are several solutions that simplify the creation of operators.

The main ones are Kubebuilder, a framework for creating controllers in Go, and Kopf, a framework in Python.

There is also the Operator SDK, which allows you to work with controllers even with Helm, without code.

At first, I was thinking of doing it in bare Go, without any frameworks, to better understand how everything works under the hood — but this post started to turn into 95% Golang.

And since the main idea of the post was to show conceptually what a Kubernetes Operator is, what role CustomResourceDefinitions play, and how they interact with each other and allow you to manage resources, I decided to use Kopf because it’s very simple and quite suitable for these purposes.

Creating a CustomResourceDefinition

Let’s start with writing the CRD.

Actually, CustomResourceDefinition is just a description of what fields our custom resource will have so that the controller can use them through the Kubernetes API to create real resources — whether they are some resources in Kubernetes itself, or external ones like AWS Load Balancer or AWS Route 53.

What we will do: we will write a CRD that will describe the MyApp resource, and this resource will have fields for the Docker image and a custom field with some text that will then be written to the Kubernetes Pod logs.

Kubernetes documentation on CRD — Extend the Kubernetes API with CustomResourceDefinitions.

Create the file myapp-crd.yaml:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: myapps.demo.rtfm.co.ua
spec:
  group: demo.rtfm.co.ua
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                image:
                  type: string
                banner: 
                  type: string
                  description: "Optional banner text for the application"
  scope: Namespaced
  names:
    plural: myapps
    singular: myapp
    kind: MyApp
    shortNames:
      - ma

Here:

spec.group: demo.rtfm.co.ua: create a new API Group, all resources of this type will be available at /apis/demo.rtfm.co.ua/...
versions: list of versions of the new resource
name.v1: we will have an only one version
served: true: add a new resource to the Kube API - you can do kubectl get myapp (GET /apis/demo.rtfm.co.ua/v1/myapps)
storage: true: this version will be used for storage in etcd (if several versions are described, only one should be with storage: true)
schema:
openAPIV3Schema: describe the API scheme according to the OpenAPI v3
type: object: describe an object with nested fields (key: value)
properties: what fields the object will have
spec: what we can use in YAML manifests when creating
type: object - describe the following fields:
properties:
image.type: string: a Docker image
banner.type: string: our custom field through which we will add some entry to the resource logs
scope: Namespaced: all resources of this type will exist in a specific Kubernetes Namespace
names:
plural: myapps: the resources will be available through /apis/demo.rtfm.co.ua/v1/namespaces/<ns>/myapps/, and how we can "access" the resource (kubectl get myapp), used in RBAC where you need to specify resources:["myapps"]
singular: myapp: alias for convenience
shortNames:[ma]: short alias for convenience

Let’s start Minikube:

$ minikube start

Add the CRD:

$ kk apply -f myapp-crd.yaml 
customresourcedefinition.apiextensions.k8s.io/myapps.demo.rtfm.co.ua created

Let’s look at the Groups API:

$kubectl api-versions 
...
demo.rtfm.co.ua/v1 
...

And a new resource in this API Group:

$ kubectl api-resources --api-group=demo.rtfm.co.ua
NAME SHORTNAMES APIVERSION NAMESPACED KIND
myapps ma demo.rtfm.co.ua/v1 true MyApp

OK — we have created a CRD, and now we can even create a CustomResource (CR).

Create the file myapp-example-resource.yaml:

apiVersion: demo.rtfm.co.ua/v1 # matches the CRD's group and version
kind: MyApp # kind from the CRD's 'spec.names.kind'
metadata:
  name: example-app # name of this custom resource
  namespace: default # namespace (CRD has scope: Namespaced)
spec:
  image: nginx:latest # container image to use (from our schema)
  banner: "This pod was created by MyApp operator 🚀"

Deploy:

$ kk apply -f myapp-example-resource.yaml 
myapp.demo.rtfm.co.ua/example-app created

And check:

$ kk get myapp
NAME AGE
example-app 15s

But there are no resources of type Pod — because we do not have a controller that will work with this type of resources.

Creating a Kubernetes Operator with Kopf

So, we will use Kopf to create a Kubernetes Pod, but using our own CRD.

Create a Python virtual environment:

$ python -m venv venv 
$ . ./venv/bin/activate 
(venv)

Add dependencies — requirements.txt file :

kopf 
kubernetes
PyYAML

Install them — with pip or uv:

$ pip install -r requirements.txt

Let’s write the operator code:

import os
import kopf
import kubernetes
import yaml

# use kopf to register a handler for the creation of MyApp custom resources
@kopf.on.create('demo.rtfm.co.ua', 'v1', 'myapps')
# this function will be called when a new MyApp resource is created
def create_myapp(spec, name, namespace, logger, **kwargs):
    # get image value from the spec of the CustomResource manifest
    image = spec.get('image')
    if not image:
        raise kopf.PermanentError("Field 'spec.image' must be provided.")

    # get optional banner value from the CR manifest spec
    banner = spec.get('banner')

    # load pod template YAML from file
    path = os.path.join(os.path.dirname( __file__ ), 'pod.yaml')
    with open(path, 'rt') as f:
        pod_template = f.read()

    # render pod YAML with provided values
    pod_yaml = pod_template.format(
        name=f"{name}-pod",
        image=image,
        app_name=name,
    )
    # create Pod difinition from the rendered YAML
    # it uses PyYAML to parse the YAML string into a Python dictionary
    # which can be used by Kubernetes API client
    # it is used to create a Pod object in Kubernetes
    pod_spec = yaml.safe_load(pod_yaml)

    # inject banner as environment variable if provided
    if banner:
        # it is used to add a new environment variable into the container spec
        container = pod_spec['spec']['containers'][0]
        env = container.setdefault('env', [])
        env.append({
            'name': 'BANNER',
            'value': banner
        })

    # create Kubernetes CoreV1 API client
    # used to interact with the Kubernetes API
    api = kubernetes.client.CoreV1Api()

    try:
        # it sends a request to the Kubernetes API to create a new Pod
        # uses 'create_namespaced_pod' method to create the Pod in the specified namespace
        # 'namespace' is the namespace where the Pod will be created
        # 'body' is the Pod specification that was created from the YAML template
        api.create_namespaced_pod(namespace=namespace, body=pod_spec)
        logger.info(f"Pod {name}-pod created.")
    except kubernetes.client.exceptions.ApiException as e:
        logger.error(f"Failed to create pod {name}-pod: {e}")

Create a template that will be used by our Operator to create resources:

apiVersion: v1
kind: Pod
metadata:
  name: {name}
  labels:
    app: {app_name}
spec:
  containers:
    - name: {app_name}
      image: {image}
      ports:
        - containerPort: 80
      env:
        - name: BANNER
          value: "" # will be overridden in code if provided
      command: ["/bin/sh", "-c"]
      args:
        - |
          if [-n "$BANNER"]; then
            echo "$BANNER";
          fi
          exec sleep infinity

Run the operator with kopf run myoperator.py.

We already have a CustomResource created, and the Operator should see it and create a Kubernetes Pod:

$ kopf run myoperator.py --verbose
...
[2025-07-18 13:59:58,201] kopf._cogs.clients.w [DEBUG] Starting the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
[2025-07-18 13:59:58,201] kopf._cogs.clients.w [DEBUG] Starting the watch-stream for myapps.v1.demo.rtfm.co.ua cluster-wide.
[2025-07-18 13:59:58,305] kopf.objects [DEBUG] [default/example-app] Creation is in progress: {'apiVersion': 'demo.rtfm.co.ua/v1', 'kind': 'MyApp', 'metadata': {'annotations': {'kubectl.kubernetes.io/last-applied-configuration': '{"apiVersion":"demo.rtfm.co.ua/v1","kind":"MyApp","metadata":{"annotations":{},"name":"example-app","namespace":"default"},"spec":{"banner":"This pod was created by MyApp operator 🚀","image":"nginx:latest","replicas":3}}\n'}, 'creationTimestamp': '2025-07-18T09:55:42Z', 'generation': 2, 'managedFields': [{'apiVersion': 'demo.rtfm.co.ua/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:kubectl.kubernetes.io/last-applied-configuration': {}}}, 'f:spec': {'.': {}, 'f:banner': {}, 'f:image': {}, 'f:replicas': {}}}, 'manager': 'kubectl-client-side-apply', 'operation': 'Update', 'time': '2025-07-18T10:48:27Z'}], 'name': 'example-app', 'namespace': 'default', 'resourceVersion': '2955', 'uid': '8b674a99-05ab-4d4b-8205-725de450890a'}, 'spec': {'banner': 'This pod was created by MyApp operator 🚀', 'image': 'nginx:latest', 'replicas': 3}}
...
[2025-07-18 13:59:58,325] kopf.objects [INFO] [default/example-app] Pod example-app-pod created.
[2025-07-18 13:59:58,326] kopf.objects [INFO] [default/example-app] Handler 'create_myapp' succeeded.
...

Check the Pod:

$ kk get pod
NAME READY STATUS RESTARTS AGE
example-app-pod 1/1 Running 0 68s

And its logs:

$ kk logs -f example-app-pod 
This pod was created by MyApp operator 🚀

So, the Operator launched the Pod using our CustomResource in which he took the spec.banner field with the string "This pod was created by MyApp operator 🚀", and executed the command /bin/sh -c " $BANNER" command in the pod.

Resource templates: Kopf and Kubebuilder

Instead of having a separate pod-template.yaml file, we could describe everything directly in the operator code.

That is, you can describe something like this:

...
    # get optional banner value
    banner = spec.get('banner', '')

    # define Pod spec as a Python dict
    pod_spec = {
        "apiVersion": "v1",
        "kind": "Pod",
        "metadata": {
            "name": f"{name}-pod",
            "labels": {
                "app": name,
            },
        },
        "spec": {
            "containers": [
                {
                    "name": name,
                    "image": image,
                    "env": [
                        {
                            "name": "BANNER",
                            "value": banner
                        }
                    ],
                    "command": ["/bin/sh", "-c"],
                    "args": [f'echo "$BANNER"; exec sleep infinity'],
                    "ports": [
                        {
                            "containerPort": 80
                        }
                    ]
                }
            ]
        }
    }

    # create Kubernetes API client
    api = kubernetes.client.CoreV1Api()
...

And in the case of Kubebuilder, a function is usually created that uses the CustomResource manifest (cr *myappv1.MyApp) and forms an object of type *corev1.Pod using the Go structures corev1.PodSpec and corev1.Container:

...
// newPod is a helper function that builds a Kubernetes Pod object
// based on the custom MyApp resource. It returns a pointer to corev1.Pod,
// which is later passed to controller-runtime's client.Create(...) to create the Pod in the cluster.
func newPod(cr *myappv1.MyApp) *corev1.Pod {
    // `cr` is a pointer to your CustomResource of kind MyApp
    // type MyApp is generated by Kubebuilder and lives in your `api/v1/myapp_types.go`
    // it contains fields like cr.Spec.Image, cr.Spec.Banner, cr.Name, cr.Namespace, etc.
    return &corev1.Pod{
        // corev1.Pod is a Go struct representing the built-in Kubernetes Pod type
        // it's defined in "k8s.io/api/core/v1" package (aliased here as corev1)
        // we return a pointer to it (`*corev1.Pod`) because client-go methods like
        // `client.Create()` expect pointer types

        ObjectMeta: metav1.ObjectMeta{
            // metav1.ObjectMeta comes from "k8s.io/apimachinery/pkg/apis/meta/v1"
            // it defines metadata like name, namespace, labels, annotations, ownerRefs, etc.
            Name: cr.Name + "-pod", // generate Pod name based on the CR's name
            Namespace: cr.Namespace, // place the Pod in the same namespace as the CR
            Labels: map[string]string{ // set a label for identification or selection
                "app": cr.Name, // e.g., `app=example-app`
            },
        },

        Spec: corev1.PodSpec{
            // corev1.PodSpec defines everything about how the Pod runs
            // including containers, volumes, restart policy, etc.

            Containers: []corev1.Container{
                // define a single container inside the Pod

                {
                    Name: cr.Name, // use CR name as container name (must be DNS compliant)
                    Image: cr.Spec.Image, // container image (e.g., "nginx:1.25")

                    Env: []corev1.EnvVar{
                        // corev1.EnvVar is a struct that defines environment variables
                        {
                            Name: "BANNER", // name of the variable
                            Value: cr.Spec.Banner, // value from the CR spec
                        },
                    },

                    Command: []string{"/bin/sh", "-c"},
                    // override container ENTRYPOINT to run a shell command

                    Args: []string{
                        // run a command that prints the banner and sleeps forever
                        // fmt.Sprintf(...) injects the value at runtime into the string
                        fmt.Sprintf(`echo "%s"; exec sleep infinity`, cr.Spec.Banner),
                    },

                    // optional: could also add ports, readiness/liveness probes, etc.
                },
            },
        },
    }
}
...

And what about in real operators?

But we did this for “internal” Kubernetes resources.

What about external resources?

Here’s just an example — I haven’t tested it, but the general idea is this: just take an SDK (in the Python example, it’s boto3), and using the fields from the CustomResource (for example, subnets or schema), make the appropriate API requests to AWS through the SDK.

An example of such a CustomResource:

apiVersion: demo.rtfm.co.ua/v1
kind: MyIngress
metadata:
  name: myapp
spec:
  subnets:
    - subnet-abc
    - subnet-def
  scheme: internet-facing

And the code that could create an AWS ALB from it:

import kopf
import boto3
import botocore
import logging

# create a global boto3 client for AWS ELBv2 service
# this client will be reused for all requests from the operator
# NOTE: region must match where your subnets and VPC exist
elbv2 = boto3.client("elbv2", region_name="us-east-1")

# define a handler that is triggered when a new MyIngress resource is created
@kopf.on.create('demo.rtfm.co.ua', 'v1', 'myingresses')
def create_ingress(spec, name, namespace, status, patch, logger, **kwargs):
    # extract the list of subnet IDs from the CustomResource 'spec.subnets' field
    # these subnets must belong to the same VPC and be public if scheme=internet-facing
    subnets = spec.get('subnets')

    # extract optional scheme (default to 'internet-facing' if not provided)
    scheme = spec.get('scheme', 'internet-facing')

    # validate input: at least 2 subnets are required to create an ALB
    if not subnets:
        raise kopf.PermanentError("spec.subnets is required.")

    # attempt to create an ALB in AWS using the provided spec
    # using the boto3 ELBv2 client
    try:
        response = elbv2.create_load_balancer(
            Name=f"{name}-alb", # ALB name will be derived from CR name
            Subnets=subnets, # list of subnet IDs provided by user
            Scheme=scheme, # 'internet-facing' or 'internal'
            Type='application', # we are creating an ALB (not NLB)
            IpAddressType='ipv4', # only IPv4 supported here (could be 'dualstack')
            Tags=[ # add tags for ownership tracking
                {'Key': 'ManagedBy', 'Value': 'kopf'},
            ]
        )
    except botocore.exceptions.ClientError as e:
        # if AWS API fails (e.g. invalid subnet, quota exceeded), retry later
        raise kopf.TemporaryError(f"Failed to create ALB: {e}", delay=30)

    # parse ALB metadata from AWS response
    lb = response['LoadBalancers'][0] # ALB list should contain exactly one entry
    dns_name = lb['DNSName'] # external DNS of the ALB (e.g. abc.elb.amazonaws.com)
    arn = lb['LoadBalancerArn'] # unique ARN of the ALB (used for deletion or listeners)

    # log the creation for operator diagnostics
    logger.info(f"Created ALB: {dns_name}")

    # save ALB info into the CustomResource status field
    # this updates .status.alb.dns and .status.alb.arn in the CR object
    patch.status['alb'] = {
        'dns': dns_name,
        'arn': arn,
    }

    # return a dict, will be stored in the finalizer state
    # used later during deletion to clean up the ALB
    return {'alb-arn': arn}

In the case of Go and Kubebuilder, we would use the aws-sdk-go library:

import (
    "context"
    "fmt"

    elbv2 "github.com/aws/aws-sdk-go-v2/service/elasticloadbalancingv2"
    "github.com/aws/aws-sdk-go-v2/aws"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    networkingv1 "k8s.io/api/networking/v1"
)

func newALB(ctx context.Context, client *elbv2.Client, cr *networkingv1.Ingress) (string, error) {
    // build input for the ALB
    input := &elbv2.CreateLoadBalancerInput{
        Name: aws.String(fmt.Sprintf("%s-alb", cr.Name)),
        Subnets: []string{"subnet-abc123", "subnet-def456"}, // replace with real subnets
        Scheme: elbv2.LoadBalancerSchemeEnumInternetFacing,
        Type: elbv2.LoadBalancerTypeEnumApplication,
        IpAddressType: elbv2.IpAddressTypeIpv4,
        Tags: []types.Tag{
            {
                Key: aws.String("ManagedBy"),
                Value: aws.String("MyIngressOperator"),
            },
        },
    }

    // create ALB
    output, err := client.CreateLoadBalancer(ctx, input)
    if err != nil {
        return "", fmt.Errorf("failed to create ALB: %w", err)
    }

    if len(output.LoadBalancers) == 0 {
        return "", fmt.Errorf("ALB was not returned by AWS")
    }

    // return the DNS name of the ALB
    return aws.ToString(output.LoadBalancers[0].DNSName), nil
}

In the real AWS ALB Ingress Controller, the creation of an ALB is called in the elbv2.go file :

...
func (c *elbv2Client) CreateLoadBalancerWithContext(ctx context.Context, input *elasticloadbalancingv2.CreateLoadBalancerInput) (*elasticloadbalancingv2.CreateLoadBalancerOutput, error) {
  client, err := c.getClient(ctx, "CreateLoadBalancer")
  if err != nil {
    return nil, err
  }
  return client.CreateLoadBalancer(ctx, input)
}
...

Actually, that’s all there is to it.

Originally published at RTFM: Linux, DevOps, and system administration.

AWS: creating an OpenSearch Service cluster and configuring authentication and authorization

Arseny Zinchenko — Tue, 16 Sep 2025 06:55:43 +0000

In the previous part, AWS: Getting Started with OpenSearch Service as a Vector Store, we looked at AWS OpenSearch Service in general, figured out how data is organized in it, what shards and nodes are, and what types of instances we actually need for data nodes.

The next step is to create a cluster and look at authentication, which, in my opinion, is even more complicated than AWS EKS. Although, maybe it’s just a matter of habit.

What we’re going to do today is manually create an AWS OpenSearch Service cluster, look at the main options for creating a cluster, and then dive into the settings for accessing the cluster and OpenSearch Dashboards with AWS IAM and Fine-grained access control of OpenSearch itself and its Security plugin.

And in the next part, if I have time to write it, we’ll get to Terraform.

Manually creating a cluster in AWS Console
Storage
Nodes
Network
Access && permissions
Authentication and authorization
Configuring Domain Access policy
Resource-based policy
IP-based policies and access to the OpenSearch Dashboards
Identity-based policy
Fine-grained access control
Configuring the Fine-grained access control
Creating an OpenSearch Role

Manually creating a cluster in AWS Console

We will do a minimal PoC to play around, i.e., with t3 instances in one Availability Zone and without Master Nodes.

In Production, we also plan to have one small cluster with three dev/staging/prod indexes as a vector store for AWS Bedrock Knowledge Base.

Documentation from AWS — Creating OpenSearch Service domains.

Go to Amazon OpenSearch Service > Domains, click “Create domain”.

Set a name, select “Standard create” to have access to all options:

In “Templates”, select “Dev/test” — then you can choose a configuration without Master Nodes and deploy in a single Availability Zone.

In “Deployment option(s)”, select “Domain without standby” — then you will have access to t3 instances:

The case conveniently shows us the entire setup right away.

Storage

We discussed the number of shards per cluster in the previous post. Let’s assume that we plan to have a maximum of 20–30 GiB of data, so we will create 1 primary shard and 1 replica. But the shards will be configured later, when we create indexes with Terraform and opensearch_index_template.

And for these two shards, we will create two Data Nodes — one for the primary shard and one for the replica.

“Engine options” are described in Features by engine version in Amazon OpenSearch Service. Just leave the default value, the latest version.

For “Instance family” select “General purpose”, and for “Instance type,” select t3.small.search.

For the “EBS storage size per node” we will take 50 GiB — 20–30 gigabytes for data and a little extra for the operating system itself:

Nodes

Leave “Number of master nodes” and “Dedicated coordinator nodes” unchanged, i.e. without them:

Network

We are not changing anything in “Custom endpoint” yet, but later you can add your own domain from Route53 with a certificate from AWS Certificate Manager to access the cluster, see Creating a custom endpoint for Amazon OpenSearch Service.

In the “Network”, we are going with the simplest option for now, “Public access”, but for Production, we will do it inside the VPC:

However, you will need to test access to Dashboards, because if the cluster is created in VPC subnets, IP-based policies cannot be applied to it, see About access policies on VPC domains. We will discuss IP-based policies further here.

Access && permissions

Fine-grained access control (FGAC) — we’ll disable it for now and take a closer look at this mechanism later. Although I’m not sure it will be necessary, because you can easily divide access to different indexes in a single cluster using IAM.

SAML, JWT, and IAM Identity Center depend on FGAC, so we’ll skip them too, and I don’t plan to use them in the future, as they are not relevant to our case.

Cognito is also out of the question — we don’t use it (although later, I may look into integrating with Auth0 or Cognito for Dashboards):

“Access policy”” can be compared to S3 Access Policy, or to IAM Policy for EKS, which allows IAM users to access the cluster.

We will discuss this in more detail in the section on authentication. For now, let’s just leave the default - "Do not set domain level access policy” option selected:

The “Off-peak window” is the time of lowest load for installing updates and performing Auto-tune operations.

Our off-peak time will be at night in the US, so Production will be Central Time (CT) 05:00 UTC.

But since this is a test PoC, we’ll skip that too.

Auto-Tune is also well described and unavailable for our t3 instances.

Automatic software update is a useful feature for Production and will be performed at the time specified in the Off-peak window:

In еру “Advanced cluster settings” you can disable rest.action.multi.allow_explicit_index, but I don't know how our queries will be built, and I think I read somewhere that it can break the Dashboard, so let's leave the default enabled:

And that’s it, as a result we have the following setup:

Click “Create” and go have some tea, because creating a cluster takes a long time — longer than EKS, and creating OpenSearch took about 20 minutes.

Authentication and authorization

Now, perhaps, the most interesting part — users and access.

After creating a cluster, by default we have limited access rights to the OpenSearch API itself:

Because in the “Security Configuration” we have an explicit Deny:

Access to AWS OpenSearch Service has three “levels” — network, IAM, and OpenSearch’s own Security Plugin.

In IAM, we have two entities — Domain Access Policy , which we see in Security Configuration > Access Policy (attribute access_policies in Terraform), and Identity-based policies - which are regular AWS IAM Policies.

If we talk about these levels in more detail, they look something like this:

Network: Network > VPC access or Public access parameter: we set the access limit at the network level (see Launching your Amazon OpenSearch Service domains within a VPC)
or, if we take an analogy with EKS, these are Public and Private API endpoints, or with RDS, creating an instance in public or private subnets
AWS IAM:
Domain Access Policies:
Resource-based policies: policies that are described directly in the cluster settings
access is set for IAM Role, IAM User, AWS Accounts to a specific OpenSearch domain
IP-based policies: essentially the same as Resource-based policies, but with the ability to allow access without authentication for specific IPs (only if the access type is Public, see VPC versus public domains)
Identity-based policies: if Resource-based policies are part of the cluster’s security policy settings, then Identity-based policies are regular AWS IAM Policies that are added to a specific user or role
Fine-grained access control (FGAC): OpenSearch’s own Security Plugin — the advanced_security_options attribute in Terraform
if in Resource-based policies and Identity-based policies we set rules at the cluster (domain) and index levels, then in FGAC we can additionally describe restrictions on specific documents or fields
and even if Resource-based policies and Identity-based policies allow access to a resource in the cluster, it can be “trimmed” through Fine-grained access control

That is, the authentication and authorization flow will be as follows:

AWS API receives a request from the user, for example es:ESHttpGet
AWS IAM performs authentication — checks ACCESS:SECRET keys or Session token
AWS IAM performs authorization:

checks the user’s IAM Policy ( Identity-based policy ), if there is explicit permission here — we skip
checks the Domain Access Policy ( Resource-based policy ) of the cluster, if there is explicit permission here — we skip

The request comes to OpenSearch itself
If Fine-grained access control is not enabled, we allow it
If Fine-grained access control is configured, we check internal roles, and if the user is allowed, we execute the request

Let’s make some accesses and see how it all works.

Configuring Domain Access policy

The basic option is to add IAM User access to the cluster.

Resource-based policy

Edit the “Access policy” and specify your user, API operation types, and domain:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::492***148:user/arseny.zinchenko"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:492***148:domain/test/*"
    }
  ]
}

Wait a minute, and now we have access to the OpenSearch API (because Cluster health in the AWS Console is obtained from OpenSearch — see Cluster Health API):

And now we can use curl and --aws-sigv4 to access the cluster (see Authenticating Requests (AWS Signature Version 4)):

$ curl --aws-sigv4 "aws:amz:us-east-1:es" \
> --user "AKI ***B7A:pAu*** 2gW" \
> https://search-test-***.us-east-1.es.amazonaws.com/_cluster/health?pretty
{
  "cluster_name" : "492***148:test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

IP-based policies and access to the OpenSearch Dashboards

Similarly, through Domain Access Policy, we can open access to Dashboards — the simplest option, but it only works with Public domains. If the cluster is in VPC, additional authentication will be required, see Controlling access to Dashboards.

Edit the policy, add condition IpAddress.aws:SourceIp:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::492***148:user/arseny.zinchenko"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:492***148:domain/test/*"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:ESHttp*",
      "Resource": "arn:aws:es:us-east-1:492***148:domain/test/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "178. ***.***.184"
        }
      }
    }
  ]
}

And now we have access to the Dashboards:

Identity-based policy

Now, the second option is to create a separate IAM User and connect a separate IAM Policy to it.

Add a user in AWS IAM:

We can just take a ready-made AWS managed policies for Amazon OpenSearch Service:

Next, we simply create access keys for the Command Line Interface (CLI) and, without changing anything in the cluster’s Access policy, check access:

$ curl --aws-sigv4 "aws:amz:us-east-1:es" --user "AKI ***YUK:fXV*** 34I" https://search-test-***.us-east-1.es.amazonaws.com/_cluster/health?pretty
{
  "cluster_name" : "492***148:test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

So now we have a Domain Access Policy that grants access specifically to my user, and there is a separate IAM Policy — an identity-based policy — that grants access to the test user.

But there is one important point here: in the IAM Policy, we specify either the entire domain or only its subresources.

That is, if instead of the AmazonOpenSearchServiceFullAccess policy, we create our own policy in which we specify "Resource":***:domain/test/*":

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "es:*"
            ],
            "Resource": "arn:aws:es:us-east-1:492***148:domain/test/*"
        }
    ]
}

So we can execute es:ESHttpGet (GET _cluster/health) - but we cannot execute cluster-level operations, such as es:AddTags, even though we have permission for all calls in the Actions of the IAM policy - es:*:

$ aws --profile test-os opensearch add-tags --arn arn:aws:es:us-east-1:492***148:domain/test --tag-list Key=environment,Value=test

An error occurred (AccessDeniedException) when calling the AddTags operation: User: arn:aws:iam::492 ***148:user/test-opesearch-identity-based-policy is not authorized to perform: es:AddTags on resource: arn:aws:es:us-east-1:492*** 148:domain/test because no identity-based policy allows the es:AddTags action

If we want to allow all operations with the cluster, we need to set "Resource" to "arn:aws:es:us-east-1:492***148:domain/test", and then we can add tags.

See all API actions in Actions, resources, and condition keys for Amazon OpenSearch Service.

Fine-grained access control

Documentation — Fine-grained access control in Amazon OpenSearch Service.

The basic idea is very similar to Kubernetes RBAC.

In OpenSearch, there are three main concepts:

users — like Kubernetes Users and ServiceAccounts
roles — like Kubernetes RBAC Roles
mappings — like Kubernetes Role Bindings

Users can be from both AWS IAM and the internal OpenSearch database.

As in Kubernetes, OpenSearch has a set of default roles — see Predefined roles.

At the same time, roles, as in Kubernetes, can be cluster-wide or index-specific — analogous to ClusterRoleBinding and simply namespaced RoleBinding in Kubernetes, plus in OpenSearch FGAC you can additionally have document level or field level permissions.

Configuring the Fine-grained access control

Important note : once FGAC is enabled, you will not be able to revert to the old scheme. However, all accesses from IAM will remain, even if you switch to the internal database.

Edit “Security configuration” and enable “Fine-grained access control”:

First, we need to set up a Master user, which can be specified from IAM or created locally in OpenSearch.

If we create a user via the “Create master user” option, we specify a regular login:password, and in this case, OpenSearch will connect to the internal user database (internal_user_database_enabled in Terraform).

If we use the internal OpenSearch database, we can have regular users and perform HTTP basic authentication. See the AWS documentation — Tutorial: Configure a domain with the internal user database and HTTP basic authentication and Defining users and roles in the OpenSearch documentation itself, as these are its internal mechanisms.

This makes sense if you don’t want to use Cognito or SAML, and if each cluster will have its own user settings.

If you set an IAM user, the scheme will be similar to AIM authentication for RDS and IAM database authentication — access to the cluster is controlled by AWS IAM, but internal access to schemas and databases is controlled by PostgreSQL or MariaDB roles, see AWS: RDS with IAM database authentication, EKS Pod Identities, and Terraform.

In this case, AWS IAM will only perform user authentication, while authorization (access rights verification) will be handled by the Security plugin and OpenSearch roles.

Let’s try a local database, and I think we’ll use this scheme in Production as well:

We can leave “Access Policy” as it is:

Switching to the internal database will take some time because it will trigger a blue/green deployment of the new cluster, see Making configuration changes in Amazon OpenSearch Service.

And it took a long time, more than an hour , even though there is no data of ours in the cluster.

Once the changes are applied, Dashboards will now ask for a login and password. Use our Master user:

The master user receives two connected roles: all_access and security_manager.

It is security_manager that provides access to the Security and Users sections in the dashboard:

At the same time, we still have access to our AIM users, and we can continue to use curl: IAM users will be mapped to the default_role, which allows GET/PUT on all indexes - see About the default_role:

Let’s check our test user’s access now:

$ curl --aws-sigv4 "aws:amz:us-east-1:es" --user "AKI ***YUK:fXV*** 34I" https://search-test-***.us-east-1.es.amazonaws.com/_cluster/health?pretty
{
  "cluster_name" : "492***148:test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
...

Now let’s cut off access to all IAM users.

Creating an OpenSearch Role

To see how it works, let’s add a test index and map our test user with access to this index.

Add the index:

Go to Security > Roles, add a role:

Set Index permissions — full access to the index (crud):

Next, in this role, we move on to Mapped users > Map users:

And add the ARN of our test user:

Delete the default role:

Now our user does not have access to GET _cluster/health - here we get an error 403, no permissions :

$ curl --aws-sigv4 "aws:amz:us-east-1:es" --user "AKI ***YUK:fXV*** 34I" https://search-test-***.us-east-1.es.amazonaws.com/_cluster/health?pretty
{
  "error" : {
    ...
    "type" : "security_exception",
    "reason" : "no permissions for [cluster:monitor/health] and User [name=arn:aws:iam::492***148:user/test-opesearch-identity-based-policy, backend_roles=[], requestedTenant=null]"
  },
  "status" : 403
}

But has access to the test index:

$ curl --aws-sigv4 "aws:amz:us-east-1:es" --user "AKI ***YUK:fXV*** 34I" https://search-test-***.us-east-1.es.amazonaws.com/test-allowed-index/_search?pretty -d '{
    "query": {
      "match_all": {}
    }
  }' -H 'Content-Type: application/json'
{
  "took" : 78,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : []
  }
}

Done.

Originally published at RTFM: Linux, DevOps, and system administration.

Kubernetes: Kubernetes API, API groups, CRDs, and the etcd

Arseny Zinchenko — Mon, 15 Sep 2025 22:41:13 +0000

I actually started to write about creating my own Kubernetes Operator, but decided to make a separate topic about what a Kubernetes CustomResourceDefinition is, and how creating a CRD works at the level of the Kubernetes API and the etcd.

That is, to start with how Kubernetes actually works with resources, and what happens when we create or edit resources.

The second part: Kubernetes: what is Kubernetes Operator and CustomResourceDefinition.

Kubernetes API
Kubernetes API Groups and Kind
Kubernetes and etcd
CustomResourceDefinitions and Kubernetes API
Kubernetes API Service

Kubernetes API

So, all communication with the Kubernetes Control Plane takes place through its main endpoint — the Kubernetes API, which is a component of the Kubernetes Control Plane — see Cluster Architecture.

Documentation — The Kubernetes API and Kubernetes API Concepts.

Through the API, we communicate with Kubernetes, and all resources and information about them are stored in the database — etcd.

Other components of the Control Plane are the Kube Controller Manager with a set of default controllers that are responsible for working with resources, and the Scheduler, which is responsible for how resources will be placed on Worker Nodes.

The Kubernetes API is just a regular HTTPS REST API that we can access even from curl.

To access the cluster, we can use kubectl proxy, which will take the parameters from ~/.kube/config with the API Server address and token, and create a tunnel to it.

I have access to AWS EKS configured, so the connection will go to it:

$ kubectl proxy --port=8080 
Starting to serve on 127.0.0.1:8080

And we turn to the API:

$ curl -s localhost:8080 | jq
{
  "paths": [
    "/.well-known/openid-configuration",
    "/api",
    "/api/v1",
    "/apis",
    ...
    "/version"
  ]
}

Actually, what we see is a list of API endpoints supported by the Kubernetes API:

/api/: information on the Kubernetes API itself and the entry point to the core API Groups (see below)
/api/v1: core API group with Pods, ConfigMaps, Services, etc.
/apis/: APIGroupList - the rest of the API Groups in the system and their versions, including API Groups created from different CRDs
for example, for the API Group operator.victoriametrics.com we can see support for two versions - "operator.victoriametrics.com/v1" "operator.victoriametrics.com/v1beta1"
/version: information on the cluster version

And then we can go deeper and see what’s inside each endpoint, for example, to get information about all Pods in the cluster:

$ curl -s localhost:8080/api/v1/pods | jq
...
    {
      "metadata": {
        "name": "backend-ws-deployment-6db58cc97c-k56lm",
      ...
        "namespace": "staging-backend-api-ns"
        "labels": {
          "app": "backend-ws",
          "component": "backend",
      ...
      "spec": {
        "volumes": [
          {
            "name": "eks-pod-identity-token",
      ...
        "containers": [
          {
            "name": "backend-ws-container",
            "image": "492***148.dkr.ecr.us-east-1.amazonaws.com/challenge-backend-api:v0.171.9",
            "command": [
              "gunicorn",
              "websockets_backend.run_api:app",
      ...
            "resources": {
              "requests": {
                "cpu": "200m",
                "memory": "512Mi"
              }
            },
...

Here we can see information about the Pod named “backend-ws-deployment-6db58cc97c-k56lm” which lives in the Kubernetes Namespace “staging-backend-api-ns”, and the rest of the information about it — the volumes, which containers, resources, etc.

Kubernetes API Groups and Kind

API Groups are a way to organize resources in Kubernetes. They are grouped by groups, versions, and resource types (Kind).

That is the structure of the API:

API Group
versions
kind

For example, in /api/v1 we see the Kubernetes Core API Group, in /apis - API Groups apps, batch, events, and so on.

The structure will be as follows:

/apis/<group> - the group itself and its versions
/apis/<group>/<version> - a specific version of the group with specific resources (Kind)
/apis/<group>/<version>/<resource> - access to a specific resource and objects in it

Note : Kind vs resource: Kind is the name of the resource that is specified in the schema of this resource. And resource is the name that is used to build the URI when requesting the API Server.

For example, for the API Group apps we have the version v1:

$ curl -s localhost:8080/apis/apps | jq
{
  "kind": "APIGroup",
  "apiVersion": "v1",
  "name": "apps",
  "versions": [
    {
      "groupVersion": "apps/v1",
      "version": "v1"
    }
  ],
...

And inside the version — resources, for example, deployments:

$ curl -s localhost:8080/apis/apps/v1 | jq
{
...
    {
      "name": "deployments",
      "singularName": "deployment",
      "namespaced": true,
      "kind": "Deployment",
      "verbs": [
        "create",
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "update",
        "watch"
      ],
      "shortNames": [
        "deploy"
      ],
      "categories": [
        "all"
      ],
...

And using this group, version, and specific resource type (kind), we get all the objects:

$ curl -s localhost:8080/apis/apps/v1/deployments/ | jq
{
  "kind": "DeploymentList",
  "apiVersion": "apps/v1",
  "metadata": {
    "resourceVersion": "1534"
  },
  "items": [
    {
      "metadata": {
        "name": "coredns",
        "namespace": "kube-system",
        "uid": "9d7f6de3-041e-4afe-84f4-e124d2cc6e8a",
        "resourceVersion": "709",
        "generation": 2,
        "creationTimestamp": "2025-07-12T10:15:33Z",
        "labels": {
          "k8s-app": "kube-dns"
        },
...

Okay, so we’ve accessed the API — but where does it get all that data that we’re being shown?

Kubernetes and `etcd`

For storing data in Kubernetes, we have another key component of the Control Plane — etcd.

Actually, this is just a key:value database with all the data that forms our cluster — all its settings, all resources, all states of these resources, RBAC rules, etc.

When the Kubernetes API Server receives a request, for example, POST /apis/apps/v1/namespaces/default/deployments, it first checks if the object matches the resource schema (validation), and only then saves it to etcd.

The etcd database consists of a set of keys. For example, a Pod named "nginx-abc" will be stored in a key named /registry/pods/default/nginx-abc.

See the documentation Operating etcd clusters for Kubernetes.

In AWS EKS, we don’t have access to etcd (and that's a good thing), but we can start Minikube and have a look at it:

$ minikube start
...
🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Check the system pods:

$ kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
coredns-674b8bbfcf-68q8p 0/1 ContainerCreating 0 57s
etcd-minikube 1/1 Running 0 62s
...

Connect to the cluster:

$ minikube ssh

If we had used minikube start --driver=virtualbox, we would have used minikube ssh to enter the VirtualBox instance.

But since we have the default docker driver, we simply enter the minikube container.

Install etcd here to get the etcdctl CLI utility:

docker@minikube:~$ sudo apt update 
docker@minikube:~$ sudo apt install etcd

Check it:

docker@minikube:~$ etcdctl -version 
etcdctl version: 3.3.25

And now we can see what’s in the database:

docker@minikube:~$ sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/minikube/certs/etcd/ca.crt \
  --cert=/var/lib/minikube/certs/etcd/server.crt \
  --key=/var/lib/minikube/certs/etcd/server.key \
  get "" --prefix --keys-only
...
/registry/namespaces/kube-system
/registry/pods/kube-system/coredns-674b8bbfcf-68q8p
/registry/pods/kube-system/etcd-minikube
...
/registry/services/endpoints/default/kubernetes
/registry/services/endpoints/kube-system/kube-dns
...

The data in the keys is stored in Protobuf (Protocol Buffers) format, so with the usual etcdctl get KEY, the data will look a little crooked.

Let’s see what is in the database about the Pod of etcd itself :

docker@minikube:~$ sudo ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/var/lib/minikube/certs/etcd/ca.crt --cert=/var/lib/minikube/certs/etcd/server.crt --key=/var/lib/minikube/certs/etcd/server.key get "/registry/pods/kube-system/etcd-minikube"

The result:

OK.

CustomResourceDefinitions and Kubernetes API

So, when we create a CRD, we extend the Kubernetes API by creating our own API Group with our own name, version, and a new resource type (Kind) that is described in the CRD.

Documentation — Extend the Kubernetes API with CustomResourceDefinitions.

Let’s write a simple CRD:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: myapps.mycompany.com
spec:
  group: mycompany.com
  names:
    kind: MyApp
    plural: myapps
    singular: myapp
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                image:
                  type: string

Here we have:

use the existing API Group apiextensions.k8s.io and version v1
from it take the schema of the CustomResourceDefinition object
and based on this schema, we create our own API Group named mycompany.com
in this API Group, we describe a single resource type — kind: MyApp
and one version — v1
then using openAPIV3Schema we describe the schema of our resource - what fields it has and their types, and here you can also set default values (see OpenAPI Specification)

With this CRD, we will be able to create new Custom Resources with a manifest in which we pass the apiVersion, kind, and spec.image fields from the schema.openAPIV3Schema.properties.spec.properties.image of our CRD:

apiVersion: mycompany.com/v1
kind: MyApp
metadata:
  name: example
spec:
  image: nginx:1.25

Create the CRD:

$ kk apply -f test-crd.yaml 
customresourcedefinition.apiextensions.k8s.io/myapps.mycompany.com created

Check in the Kubernetes API (you can use the | jq '.groups[] | select(.name == "mycompany.com")' selector):

$ curl -s localhost:8080/apis/ | jq
...
{
  "name": "mycompany.com",
  "versions": [
    {
      "groupVersion": "mycompany.com/v1",
      "version": "v1"
    }
  ],
  ...
}
...

And the API Group mycompany.com itself :

$ curl -s localhost:8080/apis/mycompany.com/v1 | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "mycompany.com/v1",
  "resources": [
    {
      "name": "myapps",
      "singularName": "myapp",
      "namespaced": true,
      "kind": "MyApp",
      "verbs": [
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "create",
        "update",
        "watch"
      ],
      "storageVersionHash": "MZjF6nKlCOU="
    }
  ]
}

And in the etcd:

docker@minikube:~$ sudo ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/var/lib/minikube/certs/etcd/ca.crt --cert=/var/lib/minikube/certs/etcd/server.crt --key=/var/lib/minikube/certs/etcd/server.key get "" --prefix --keys-only
/registry/apiextensions.k8s.io/customresourcedefinitions/myapps.mycompany.com
...
/registry/apiregistration.k8s.io/apiservices/v1.mycompany.com
...

Here, the /registry/apiextensions.k8s.io/customresourcedefinitions/myapps. mycompany.com key stores information about the new CRD itself - the CRD structure, its OpenAPI schema, versions, etc., and the /registry/apiregistration.k8s.io/apiservices/v1.mycompany.com key registers the API Service for this group to access the group via the Kubernetes API.

And of course, we can see the CRD with kubectl`:

$ kk get crd NAME CREATED AT myapps.mycompany.com 2025-07-12T11:23:19Z

Create the CustomResource itself from the manifest we wrote above:

$ kk apply -f test-resource.yaml myapp.mycompany.com/example created

Test it:

$ kk describe MyApp Name: example Namespace: default Labels: <none> Annotations: <none> API Version: mycompany.com/v1 Kind: MyApp Metadata: Creation Timestamp: 2025-07-12T13:34:52Z Generation: 1 Resource Version: 4611 UID: a88e37fd-1477-4a7e-8c00-46c925f510ac Spec: Image: nginx:1.25

But this is just data in etcd for now - we don't have any real Pods resources, because there is no controller that handles resources from Kind: MyApp.

Note : looking ahead to the next post: actually, Kubernetes Operator is a set of CRDs and a controller that “controls” resources with the specified Kind

Kubernetes API Service

When we add a new CRD, Kubernetes not only has to create a new key in etcd with the new API Group and the corresponding resource schema, but also add a new endpoint to its routes - as we do in Python with @app.get("/") in FastAPI - so that the API server knows that the GET request /apis/mycompany.com/v1/myapps should return resources of this type.

The corresponding API Service will contain a spec with the group and version:

$ kk get apiservice v1.mycompany.com -o yaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2025-07-12T11:53:52Z" labels: kube-aggregator.kubernetes.io/automanaged: "true" name: v1.mycompany.com resourceVersion: "2632" uid: 26fc8c6b-6770-422f-8996-3f35d86be6c7 spec: group: mycompany.com groupPriorityMinimum: 1000 version: v1 versionPriority: 100 ...

That is, when we create a new CRD, Kubernetes API Server creates an API Service (writing it to /registry/apiregistration.k8s.io/apiservices/v1.mycompany.com), and adds it to its own routers in the /apis endpoint.

And now, having an idea of what the API looks like and the database that stores all the resources, we can move on to creating the CRD and controller, that is, to actually write the Operator itself.

But this is already in the next part.

Originally published at RTFM: Linux, DevOps, and system administration.

Kubernetes: Pod resources.requests, resources.limits and Linux cgroup

Arseny Zinchenko — Sun, 14 Sep 2025 22:15:02 +0000

Kubernetes: Pod resources.requests, resources.limits, and Linux cgroups

How exactly do resources.requests and resources.limits in a Kubernetes manifest works "under the hood", and how exactly will Linux allocate and limit resources for containers?

So, in Kubernetes for Pods, we can set two main parameters for CPU and Memory — the spec.containers.resources.requests and spec.containers.resources.limits fields:

resources.requests: affects how and where a Pod will be created and how many resources it is guaranteed to receive
resources.limits: affects how many resources it can consume at most if
resources.limits.memory is greater than the limit - the pod can be killed with OOMKiller if the WorkerNode does not have enough free memory (the Node Memory Pressure state)
if resources.limits.cpu is greater than the limit - then CPU throttling mode will be enabled

If everything is quite clear with Memory — we set the number of bytes, then with CPU everything is a little more interesting.

So first, let’s take a look at how the Linux kernel generally plans how much CPU time will be allocated to each process using the Control Groups mechanism.

Linux cgroups
The /sys/fs/cgroup/ directory
CPU and Memory in cgroups, and cgroups v1 vs cgroups v2
Why are Kubernetes CPU Limits may be a bad idea?
Linux CFS and cpu.weight
cpu.weight vs process nice
Linux cgroups summary
Kubernetes Pod resources and Linux cgroups
Kubernetes CPU Unit vs cgroup CPU share
Checking Kubernetes Pod resources in cgroup
Kubernetes kubepods.slice cgroup
Kubernetes, cpu.weight and cgroups v2
Kubernetes Quality of Service Classes

Linux `cgroups`

Linux Control Groups (cgroups) is one of the two main kernel mechanisms that provide isolation and control over processes:

Linux namespaces : create an isolated namespace with its own process tree (PID Namespace), network interfaces (net namespace), User IDs (User namespace), and so on — see What is: Linux namespaces, examples of PID and Network namespaces (in rus)
Linux cgroups : a mechanism for controlling resources by processes — how much memory, CPU, network resources and disk I/O operations will be available to the process

Groups in the name — because all processes are grouped in a parent-child tree.

Therefore, if a limit of 512 megabytes is set for a parent process, then the sum of the available memory of it and its children cannot exceed 512 MB.

All groups are defined in the /sys/fs/cgroup/ directory, which is connected by a separate file system type - cgroup2:

$ mount | grep cgro
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

cgroups has an older version 1 and a newer version 2, see man cgroups.

In fact, cgroups v2 is already a new standard, so we will talk about it - but cgroups v1 is still present when we talk about Kubernetes.

You can check the version using stat and the /sys/fs/cgroup/ directory:

$ stat -fc %T /sys/fs/cgroup/ 
cgroup2fs

If there is tmpfs here, then this is cgroups v1.

The `/sys/fs/cgroup/` directory

A typical view of a directory on a Linux host — here’s an example from my home laptop running Arch Linux:

$ tree /sys/fs/cgroup/ -d -L 2
/sys/fs/cgroup/
├── dev-hugepages.mount
├── dev-mqueue.mount
├── init.scope
...
├── system.slice
│ ├── NetworkManager.service
│ ├── bluetooth.service
│ ├── bolt.service
    ...
└── user.slice
    └── user-1000.slice

The same hierarchy can be seen with systemctl status or systemd-cgls.

In systemctl status, the tree looks like this:

$ systemctl status
● setevoy-work
    State: running
    ...
    Since: Mon 2025-06-09 12:21:11 EEST; 3 weeks 1 day ago
  systemd: 257.6-1-arch
   CGroup: /
           ├─init.scope
           │ └─1 /sbin/init
           ├─system.slice
           │ ├─NetworkManager.service
           │ │ └─858 /usr/bin/NetworkManager --no-daemon
           ...
           │ └─wpa_supplicant.service
           │ └─1989 /usr/bin/wpa_supplicant -u -s -O /run/wpa_supplicant
           └─user.slice

Here, all processes are grouped by type:

system.slice: all systemd services (nginx.service, docker.service, etc.)
user.slice: user processes
machine.slice: virtual machines, containers

Where slice is an abstraction of systemd by which it groups different processes - see man systemd.slice

You can see which cgroup a process belongs to in its /proc/<PID>/cgroup, for example, NetworkManager with PID "858":

$ cat /proc/858/cgroup 
0::/system.slice/NetworkManager.service

The cgroup slice can also be specified in the systemd file of the service:

$ cat /usr/lib/systemd/system/mdmon@.service | grep Slice 
Slice=system.slice

CPU and Memory in `cgroups`, and `cgroups` v1 vs `cgroups` v2

So, in the cgroup for the entire slide, you set the parameters of how much CPU and Memory processes of this group can use (hereinafter we will talk only about CPU and Memory).

For example, for my user setevoy (with ID 1000), we have the files cpu.max and memory.max:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.max
max 100000

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.max 
max

cpu.max in cgroups v2 replaced cpu.cfs_quota_us and cpu.cfs_period_us from cgroup v1

Here, in cpu.max, we have the settings for how much CPU time will be devoted to my user's processes.

The format of the file is <quota> <period>, where <quota> is the time available to the process (or group), and <period> is the duration of one period in microseconds (100,000 µs = 100 ms).

In cgroups v1, these values were set in cpu.cfs_quota - for <quota> in v2, and cpu.cfs_period_us - for <period> in v2.

That is, in the file above we see:

max: available all the time
100000 µs = 100 ms, one CPU period

The CPU period here is the time interval during which the Linux kernel checks how many processes in the cgroup have used CPU: if the group has a quota and the processes have exhausted it, they will be suspended until the end of the current period*(CPU throttling*).

That is, if a limit of 50,000 (50 ms) is set for a process with a period of 100,000 microseconds (100 ms), then processes can use only 50 ms in each 100 ms "window".

Memory usage can be seen in the file memory.current:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.current 
47336714240

Which gives us:

$ echo "47336714240 / 1024 / 1024" | bc 
45143

45 gigabytes of memory occupied by 1000 user processes.

You can also check the current resource usage of each group with systemd-cgtop:

Or by passing the slice name:

For CPU, there are general statistics for the group from the beginning of the creation of processes in this group — cpu.stat:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.stat 
usage_usec 2863938974603 
...

In Kubernetes, cpu.max and memory.max will be determined when we set resources.limits.cpu and resources.limits.memory.

Why are Kubernetes CPU Limits may be a bad idea?

It is often said that setting CPU limits in Kubernetes is a bad idea.

Why is this so?

Because if we set a limit (i.e., the value != max in cpu.max), then when a group of processes uses up its time in the current CPU Time window, these processes will be limited even though the CPU has the ability to fulfill requests in general.

That is, even if there are free cores in the system, but cgroup has already exhausted its cpu.max in the current period, the processes of this group will be suspended until the end of the period (CPU throttling), regardless of the overall system load.

See For the Love of God, Stop Using CPU Limits on Kubernetes, and Making Sense of Kubernetes CPU Requests And Limits.

Linux CFS and `cpu.weight`

Above we saw cpu.max, where my user is allowed to use all available CPU time for each CPU period.

But if the limit is not set (i.e. max), and several groups of processes want access to the CPU at the same time, then the kernel must decide who should be allocated more CPU time.

To do this, another parameter is set in cgroups — cpu.weight (in cgroups v2) or cpu.shares (in cgroups v1): this is the relative priority of a group of processes when determining the CPU access queue.

The value of cpu.weight is taken into account by Linux CFS (Completely Fair Scheduler) to allocate CPU proportionally among several cgroups. - See CFS Scheduler and Process Scheduling in Linux.

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.weight 
100

The range of values here is from 1 to 10,000, where 1 is the minimum priority and 10,000 is the maximum. The value 100 is the default.

The higher the priority, the more time CFS will allocate to processes in this group.

But this is only taken into account when there is a race for CPU time: when the processor is free, all processes get as much CPU time as they need.

In Kubernetes, cpu.weight will be determined from resources.requests.cpu.

But the value of resources.requests.memory only affects the Kubernetes Scheduler to select a Kubernetes WorkerNode to find a node that has enough free memory.

`cpu.weight` vs process `nice`

In addition to cpu.weight/cpu.shares, we also have process nice , which sets the priority of the task.

The difference between them is that cpu.weight is set at the cgroup level, while nice is set at the level of a specific process within the same group.

And if a higher value in cpu.weight indicates a higher priority, then with nice it is the opposite - the lower the nice value (from -19 to 20 maximum), the more time will be allocated to the process.

If both processes are in the same cgroup, but with different nice, then nice will be taken into account.

And if these are different cgroups, then cpu.weight will be taken into account.

That is, cpu.weight determines which group of processes is more important to the kernel, and nice determines which process in the group has priority.

Linux cgroups summary

So, each Control Group determines how much CPU and memory will be allocated to a process.

cpu.max: determines how much time from each CPU period a process group can spend
Kubernetes manifest values in resources.limits.cpu and resources.limits.memory affect the cpu.max and memory.max settings for the cgroup of the corresponding containers
memory.max: how much memory can be used without the risk of being killed by the Out of Memory Killer
Kubernetes manifest value of resources.requests.memory affects only the Kubernetes Scheduler for selecting a Kubernetes WorkerNode
cpu.weight: determines the priority of a group of processes when the CPU is under load
Kubernetes manifest the value in resources.requests.cpu affects the cpu.weight setting for the cgroup of the corresponding containers

Kubernetes Pod resources and Linux cgroups

Okay, now that we’ve figured out cgroups on Linux, let’s take a closer look at how the values in Kubernetes resources.requests and resources.limits affect containers.

When we set spec.container.resources in Deployment or Pod, and Pods are created on a WorkerNode, the kubelet on that node gets the values from PodSpec and passes them to the Container Runtime Interface (CRI) (ContainerD or CRI-O).

The CRI converts them into a container specification in JSON, which specifies the appropriate values for the cgroup of this container.

Kubernetes CPU Unit vs cgroup CPU share

In Kubernetes manifests, we specify CPU resources in CPU units: 1 unit == 1 full CPU core — physical or virtual, see CPU resource units.

1 millicpu or millicores is 1/1000 of one CPU core.

One Kubernetes CPU Unit is 1024 CPU shares in the corresponding Linux cgroup.

That is: 1 Kubernetes CPU Unit == 1000 millicpu == 1024 CPU shares in a cgroup.

In addition, there is a nuance with how Kubernetes calculates the cpu.weight for Pods - because Kubernetes uses CPU shares, which it then translates into cpu.weight for cgroup v2 - we will see how it looks like below.

Checking Kubernetes Pod `resources` in a cgroup

Let’s create a test Pod:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-test
  namespace: default
spec:
  containers:
    - name: nginx
      image: nginx
      resources:
        requests:
          cpu: "1"
          memory: "1Gi"
        limits:
          cpu: "1"
          memory: "1Gi"

Run it and find the appropriate WorkerNode:

$ kk describe pod nginx-test
Name: nginx-test
Namespace: default
Priority: 0
Service Account: default
Node: ip-10-0-32-142.ec2.internal/10.0.32.142
...

Let’s connect via SSH and take a look at the cgroups settings.

Kubernetes `kubepods.slice` cgroup

All parameters for Kubernetes Pods are set in the /sys/fs/cgroup/kubepods.slice/ directory:

[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/
...
drwxr-xr-x. 5 root root 0 Jul 2 12:30 kubepods-besteffort.slice
drwxr-xr-x. 6 root root 0 Jul 2 12:30 kubepods-burstable.slice
drwxr-xr-x. 4 root root 0 Jul 2 12:31 kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice
...

To find out which cgroup slice is responsible for our container, let’s check the running pods in the k8s.io namespace

[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers ls
CONTAINER IMAGE RUNTIME                  
00d432ee10181ce579af7f0d02a3a04167ced45f8438167f3922e385ed9ab58f 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-pod-identity-agent:v0.1.29 io.containerd.runc.v2    
...
987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f docker.io/library/nginx:latest io.containerd.runc.v2
...

Note _: the namespace in ctr are containerd namespaces, not Linux ones, see containerd namespaces for Docker, Kubernetes, and beyond_

Our container is_”987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f_”.

We check all the information on it:

[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers info 987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f
...
        "linux": {
            "resources": {
                "devices": [
                    {
                        "allow": false,
                        "access": "rwm"
                    }
                ],
                "memory": {
                    "limit": 1073741824,
                    "swap": 1073741824
                },
                "cpu": {
                    "shares": 1024,
                    "quota": 100000,
                    "period": 100000
                },
                "unified": {
                    "memory.oom.group": "1",
                    "memory.swap.max": "0"
                }
            },
            "cgroupsPath": "kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice:cri-containerd:987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f",
...

Here we see resources.memory and resources.cpu.

Everything is clear with memory, and in resources.cpu we have three fields:

shares: these are our requests from the Pod manifest (PodSpec)
quota: these are our limits
period: CPU period, which was mentioned above - the "accounting window" for CFS

In the cgroupsPath we see which cgroup slice contains information about this container:

[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice/cri-containerd-987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f.scope/
...
-rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.idle
-rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.max
...
-rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.weight
...
-rw-r--r--. 1 root root 0 Jul 2 12:31 memory.max
...

And the corresponding values in them:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.max
100000 100000

That is, a maximum of 100,000 microseconds from each window of 100,000 microseconds — because we set resources.limits.cpu == "1", i.e. "full kernel".

Kubernetes, `cpu.weight` and cgroups v2

But if we take a look at the cpu.weight file, the picture is as follows:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.weight 
39

Where did the value “39” come from?

In the container description, we saw shares == 1024:

...
        "linux": {
            "resources": {
                ...
                "cpu": {
                    "shares": 1024,
...

cpu.shares 1024 is the value we set in Kubernetes when we specified resources.requests.cpu == "1", because, as mentioned above, "One Kubernetes CPU Unit is 1024 CPU shares".

That is, for cgroups v1 — in the cpu.shares file we would have a value of 1024.

But cgroup v2 is a bit more interesting.

“Under the hood, Kubernetes still counts CPU Shares in the format 1 core == 1024 shares, which are then translated into the cgroups v2 format.

If we look at the total cpu.weights for the entire kubepods.slice, we will see a value of 76:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/cpu.weight 
76

Where did the “76” come from?

Let’s check the number of cores on this instance:

[root@ip-10-0-32-142 ec2-user]# lscpu | grep -E '^CPU\('
CPU(s): 2

The formula for calculating cpu.weight is described in the file group_manager_linux.go#L566:

...
func CpuSharesToCpuWeight(cpuShares uint64) uint64 {
  return uint64((((cpuShares - 2) * 9999) / 262142) + 1)
}
...

Having 2 cores == 2048 CPU shares for v1 — we calculate:

((((2048 - 2) * 9999) / 262142) + 1) 
79

That is, the entire kubepods.slice is assigned a "weight" of 79 cpu.weight.

But we counted all CPU shares in general — and in fact, part of the CPU is reserved for the system and controllers like kubelet.

Kubernetes Quality of Service Classes

See Kubernetes: Evicted pods and Pods Quality of Service, and Pod Quality of Service Classes.

In the directory /sys/fs/cgroup/kubepods.slice/ we have:

kubepods-besteffort.slice: BestEffort QoS : when requests and limits are specified, but limits are less than requests
kubepods-burstable.slice: for Burstable QoS - when only requests are specified
kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice: Quarateed QoS - when requests and limits are set and equal to each other

Our Pod is exactly Quarateed QoS:

$ kk describe pod nginx-test | grep QoS
QoS Class: Guaranteed

Because we set requests == limits.

And since we set 1 full core in requests, Kubernetes through cgroups allocates it half of the total cpu.weight available for kubepods.slice:

38*2 76

This is exactly the value we saw in kubepods.slice/cpu.weight.

That’s why Linux CFS will always give our container half of the available CPU time on both cores — or “one whole core”.

Useful links

How CPU Weight Is Calculated on VictoriaMetrics blogs
Making Sense of Kubernetes CPU Requests And Limits
Cgroups — Deep Dive into Resource Management in Kubernetes
Resource Management for Pods and Containers
cgroups (Arch Wiki)
CPU and Memory Management on Kubernetes with Cgroupsv2

Originally published at RTFM: Linux, DevOps, and system administration.

AWS: introduction to the OpenSearch Service as a vector store

Arseny Zinchenko — Sun, 14 Sep 2025 22:07:32 +0000

We are currently using AWS OpenSearch Service as a vector store for our RAG with AWS Bedrock Knowledge Base.

We will talk more about RAG and Bedrock another time, but today let’s take a look at AWS OpenSearch Service.

The task is to migrate our AWS OpenSearch Service Serverless to Managed, primarily due to (surprise) cost issues — because with Serverless, we constantly have unexpected spikes in OpenSearch Compute Units (OCU — processor, memory, and disk) usage, even when there are no changes in the data.

The main task is to plan the cluster size: disks, CPU and memory, and select the appropriate instance types.

In the second part, we will talk about access settings — AWS: creating an OpenSearch Service cluster and setting up authentication and authorization.

Elasticsearch vs OpenSearch vs AWS OpenSearch Service
AWS OpenSearch Service: an intro
Data schema: documents, indexes, and shards
Data, Master, and Coordinator Nodes
Pricing
Hot, UltraWarm, and Cold storage in the OpenSearch Service
Planning an AWS OpenSearch Service domain
Storage
Choosing the size of disks
Number of shards
Choosing a type of Data Nodes
Instance types
Data Node Storage
Data Node CPU
Data Node RAM
Calculating RAM for logs
RAM calculation for vector store
RAG, AWS Bedrock Knowledge Base, data, and vector creation
Number of vectors
Useful links

Elasticsearch vs OpenSearch vs AWS OpenSearch Service

In fact, OpenSearch is essentially the same as Elasticsearch: when Elasticsearch changed its license terms in 2021, AWS launched its own fork, naming it OpenSearch.

OpenSearch is compatible with Elasticsearch up to version 7.10, but unlike Elasticsearch, OpenSearch has a completely free license.

I once wrote about launching Elasticsearch as part of the ELK stack for logs here — Elastic Stack: overview and installation of ELK on Ubuntu (2022), but that article is more about self-hosted solutions and working with indexes in general. Now we will look specifically at the solution from AWS.

AWS OpenSearch Service is a fully AWS-managed service: as with Kubernetes, AWS takes care of all deployment, updates, and backups, and has tight integration with other AWS services — IAM, VPC, S3, and Bedrock, which is what we use it with.

AWS OpenSearch Service: an intro

Here and below, I will mainly talk about the Managed OpenSearch Service.

The basic concepts of AWS OpenSearch Service are domains, nodes, indexes (“databases”), and shards.

A “domain” is the cluster itself, which we configure to the desired number and type of nodes, and indexes are divided into shards (data blocks) that are distributed among the nodes:

The nodes in the cluster are essentially regular EC2 instances (as in RDS or even AWS Load Balancer), where the same regular compute instances run under the hood.

For the AWS OpenSearch Service cluster, as with Elastic Kubernetes Service, separate control nodes (master nodes) are created, but unlike EKS, we do not need to manage the Data Plane and WorkerNodes separately here.

As with RDS, we can set up automatic backups for the OpenSearch cluster.

For data visualization, AWS provides OpenSearch Dashboards.

Data schema: documents, indexes, and shards

To understand what types of instances to choose for our cluster, let’s take a look at what indexes are in OpenSearch (or Elasticsearch, because they are essentially the same thing).

So, an index is a collection of documents that have some common features. Each index has a unique name, just like a database in RDS PostgreSQL or MariaDB.

Although an index is often compared to a database, in practice it is more convenient to think of an index as a table, and the “database” as the entire cluster.

A document is a JSON object in an index and represents a basic unit of data storage. If we draw an analogy with the SQL databases, it is like a row in a table.

Each document has a set of key-value fields, where values can be strings, integers, dates, or more complex structures such as arrays or objects.

Indexes are divided into parts called shards for better performance, where each shard contains a portion of the index data. Each document is stored in only one shard, and searches can be performed in parallel across multiple shards.

Although technically not entirely accurate, shards can be thought of as separate mini-indexes or mini-databases.

Shards can be primary or replica: primary accepts all write operations and can process select, while replica is only for read-only operations.

At the same time, a replica is always created on another data node for fault tolerance, and a replica can become primary if the node with the primary shard fails.

The default value for the number of shards per index in AWS OpenSearch Service is 5, but it can be configured separately (i.e., with 5 primary shards, we will have 10 shards in total, because there will also be replicas).

It is recommended that shards be between 10 and 50 gigabytes in size: each shard requires CPU and memory to work with it, so a large number of small shards will increase the need for resources, while shards that are too large will slow down operations on them.

In the open-source OpenSearch (and Elasticsearch), the default number of primary shards is 1.

New documents are distributed evenly among all available shards.

Amazon OpenSearch Service 101: How many shards do I need

Data, Master, and Coordinator Nodes

Data Nodes — store data and shards, and execute search and aggregation queries. These are the main “working units” of the cluster.

Master Nodes — store metadata about indexes, mapping, cluster status, manage primary/replica shards, perform rebalancing — but do not process search queries. That is, their task is exclusively to control the cluster.

Coordinator nodes (client nodes) do not store any data and do not participate in its processing. The role of these nodes is to act as a kind of “proxy” between the client and the data nodes. They receive a query from the client, divide it into subqueries (scattering), send them to the appropriate data nodes, then collect the result (gather) and return it to the client. However, it is advisable to have separate nodes under Coordinators on large clusters in order to reduce the load on Master and Data nodes.

Pricing

As with most similar AWS services, we pay for compute resources (CPU, RAM) per disk (EBS) and for traffic — although traffic has some nuances (for the better) — because for multi-AZ deployments, we don’t pay for traffic between nodes in different Availability Zones (in RDS, too, I think), nor do we pay for traffic between UltraWarm/Cold Nodes and AWS S3.

Full documentation on pricing is available at Amazon OpenSearch Service Pricing, and here are the main points:

t3.medium.search: 2 vCPU, 4 GB RAM - $0.073 (regular t3.medium EC2 will cost less - $0.044)
General Purpose SSD (gp3) EBS: $0.122 per GB/month (regular EBS for EC2 — $0.08/GB-month)

Similar to AWS EKS, OpenSearch Service has two types of update support — Standard and extended — and, of course, Extended will be more expensive.

Hot, UltraWarm, and Cold storage in the OpenSearch Service

Data (indexes) storage in OpenSearch Service can be organized either on EBS on the data node itself (Hot), cached on a node with a “backend” in S3 (UltraWarm), or only in S3 (Cold):

Hot storage : regular data nodes on regular EC2 with EBS — for the most relevant data, provides fast access to data
UltraWarm storage: for data that is still relevant but not frequently accessed — data is stored in S3, and their cache is stored on the nodes, with the nodes themselves being a separate type of instance, such as ultrawarm1.medium.search
fast access to data that is in the cache, slower access to data that has not been accessed for a long time
the nodes themselves are more expensive (ultrawarm1.medium.search will cost $0.238), but savings are achieved by storing data in S3 instead of EBS
read-only data
not available whether there are T2 or T3 instances in the cluster :-(
Cold storage: this data is stored exclusively in S3 and can be accessed via the OpenSearch Service API
slow access, but here we only pay for S3
to use it, you need to have the Warm storage configured
similarly, not available if there are T2 or T3 instances in the cluster :-(

This is well described in Choose the right storage tier for your needs in Amazon OpenSearch Service.

Automatic backups are free and stored for 14 days.

Manual backups are charged for S3 storage, but there is no charge for the traffic used to store them.

Planning an AWS OpenSearch Service domain

OK, now that we’ve covered the basics, let’s think about how we’re going to build the cluster — its capacity planning and the selection of instance types for Data Nodes.

Storage

Choosing the size of disks

A very important point to start with is to determine how much space your index or indexes will take up.

This is well described in the Calculating storage requirements documentation, but let’s calculate it ourselves.

For example, we will have 3 data nodes, and we will store some logs.

We record 10 GiB of logs per day, which we store for 30 days, resulting in 300 gigabytes of occupied space. With three nodes, that’s 100 gigabytes per node.

But we also need to consider the following:

Number of replicas : each replica shard is a copy of the primary shard, so it will take up about the same amount of space
OpenSearch indexing overhead : OpenSearch takes up additional space for its own indexes; this is another +10% of the size of the data itself
Operating system reserved space : 5% of space on EBS is reserved by the operating system
OpenSearch Service overhead : another 20% — but no more than 20 gigabytes — is reserved on each node by OpenSearch Service itself for its own work

The documentation provides an interesting clarification on the last point:

if we have 3 nodes, each with a 500 GB disk, then together we will have 1.5 terabytes, while the total maximum amount of space reserved for OpenSearch will be 60 GB — 20 GB for each node
if we have 10 nodes, each with a 100 GB disk, then together we will have 1 terabyte, but the maximum amount of space reserved for OpenSearch will be 200 GB — 20 per node.

The formula for calculating space looks like this:

Source data * (1 + number of replicas) * (1 + indexing overhead) / (1 - Linux reserved space) / (1 - OpenSearch Service overhead) = minimum storage requirement

That is, if we need to store 300 GB of logs, we calculate:

Source data: 300 GiB
1 primary + 1 replica
1 + indexing overhead = 1.1 (+10% of 1)
1 — Linux reserved space = 0.95 (5%)
1 — OpenSearch Service overhead = 0.8 (but this is true if the disks are less than 100 GB)

In this case, for our 300 GiB of logs, we need:

300*2*1.1/0.95/0.8
867

867 GiB of total space.

Or there is a simpler formula — just use a coefficient of 1.45:

Source data * (1 + number of replicas) * 1.45 = minimum storage requirement

Then it turns out:

300*2*1.45
870.00

Almost the same 867 gigabytes.

Number of shards

The second important point, which is also described in the documentation, is Choosing the number of shards.

The essence is that in AWS OpenSearch Service, the index is split into 5 primary shards without replicas by default (in self-hosted Elasticsearch/OpenSearch, the default is 1 primary and 1 replica).

Once the index is created, you cannot simply change the number of shards, because the routing of requests to documents is tied to specific shards (this is well described here: Distributing Documents across Shards (Routing)).

The recommended shard size is 10–30 GiB for data with more searches and 30–50 for indexes with more write operations.

Indexing overhead, which we mentioned above, must be added to the size of the index itself — 10%.

If we consider a case where we write logs (i.e., write-intensive workload), the maximum index size will be 300 GiB + 10% == 330 GiB.

If we want to have primary shards of, say, 30 gigabytes, we get 11 primary shards.

Changing the number of primary shards requires creating a new index and performing a reindex — copying data from the old index to the new one, see Optimize OpenSearch index shard sizes.

But!

If the index is planned to be small, it is better to have one shard + 1 replica; otherwise, the cluster will create unnecessary empty shards that still consume resources.

In this case, it is still recommended to have three nodes: one will be the primary shard, the second will be the replica, and the third will be the backup:

if node-1 with the primary fails, node-2 will make the replica the new primary
and node-3 will receive a new replica

Choosing a type of Data Nodes

Another important point is how to choose the right type of data node.

What we need to understand in order to choose a node are the CPU, RAM, and disk requirements.

The documentation Choosing instance types and testing states:

try starting with a configuration closer to 2 vCPU cores and 8 GiB of memory for every 100 GiB of your storage requirement

But this is just for “starting”, and it is recommended to run some load tests and monitor the results.

We will talk about monitoring separately, but for now, let’s try to make our own estimate for the hardware we need.

Another useful resource is here: Operational best practices for Amazon OpenSearch Service.

Instance types

See Supported instance types in Amazon OpenSearch Service and Amazon OpenSearch Service Pricing.

The general rules here are the same as for regular EC2:

General Purpose (t3, m7g, m7i): standard servers with balanced CPU/RAM
well suited for master nodes or data nodes on small clusters
Compute Optimized (c7g, c7i): more CPU, less memory
suitable for data nodes that need more CPU (indexing, complex searches, and aggregations)
Memory Optimized (r7g, r7gd, r7i): conversely, more memory, less CPU
suitable for data nodes that need more RAM
Storage Optimized (i4g, i4i): better SSDs (NVMe SSDs) with high IOPS
suitable for data nodes that need to perform many write operations (logs, metrics)
OpenSearch Optimized (om2, or2): "tuned" instances from AWS itself with an optimal CPU/RAM ratio and disks, easier to configure
this is something for rich and large clusters :-)

Indexes here:

g: Gravitor processors (ARM64 from AWS) - productive for multi-threaded computations, better in terms of price:performance, but there may be compatibility issues
i: Intel (based on x86 - classic, compatible with everything, better for heavy single-threaded computations
d: "drive" - has an additional NVMe SSD

Data Node Storage

We seem to have figured out the disk in Choosing the number of shards:

10–30 gigabytes per shard if we plan to have more search operations
30–50 GiB per shard if there are more write operations

Next, we select the instance type so that it has enough storage, because there is still a limit on disk size — see EBS volume size quotas.

Data Node CPU

In the Shard to CPU ratio section, there is a recommendation to plan for “1.5 vCPU per shard”.

That is, if we plan to have 4 shards per data node, we allocate 6 vCPUs. We can add 1 (preferably 2) more cores for the needs of the operating system itself.

However, again, a lot depends on how the data will be processed.

If there are many search-heavy operations, then 1.5 CPU per shard is quite justified.

For write-intensive operations, you can consider 0.5 CPU per shard, and for warm and cold nodes, even less.

See OpenSearch Threadpool.

Data Node RAM

Now for the most interesting part: how do we calculate the required memory?

Here, the calculations will depend heavily on the type of index and data — whether it is simply documents in the form of logs or, as in our case, a vector store.

Before we calculate the requirements, let’s take a quick look at how memory is distributed on the instance:

JVM Heap Size: by default, it is set to 50% of RAM (but no more than 32 gigabytes): in the JVM Heap, we will have various OpenSearch proprietary data — metadata and shard/index management (mappings, routing, cluster status), query and response objects, search coordination, various internal caches and buffers — that is, purely internal needs of OpenSearch itself
off-heap memory (the operating system’s own memory):
when using the index as a vector store — HNSW graphs (k-NN search) + Linux page cache for data that is loaded from disk into OS memory for fast access
for simple logs — only Linux page cache for data that is loaded from disk into OS memory

Calculating RAM for logs

We plan to allocate 16 gigabytes for the JVM heap, keeping in mind that this will be 50%. Alternatively, we could allocate at least 8 gigabytes and then monitor JVMMemoryPressure (we will speak about monitoring more in a following blog post, it's already in drafts).

Next, we estimate the memory under off-heap — Linux will do mmap relevant for processing data requests (read data blocks from disk into memory when the process requests them).

Here we will have the “hot data” — that is, the data that is often needed by clients. For example, we know that most often we will be searching the logs for the last 24 hours, and we write 10 gigabytes of logs per day.

To these 10 GB, we should add 10–50 percent for the OpenSearch structures themselves, so the index will grow by 11–15 GB per day.

Of these 11–12 gigabytes, let’s say 50% will be actively used for search results — we’ll allocate 5–6 GiB of RAM for the “hot OS page cache”.

RAM calculation for vector store

If we use OpenSearch as a vector database, we need to consider the memory requirements for each graph for data search.

The size of the graph depends on the algorithm, but let’s take the default one — HNSW (Hierarchical Navigable Small Worlds). The choice of algorithm is well described in Choose the k-NN algorithm for your billion-scale use case with OpenSearch.

In order to estimate how much memory the HNSW structure will take up, we need to know the number of vectors in the index, their dimension (embedding dimension), and the number of connections between each node in the graph (how many neighbors to store for each point in this graph).

What do we have in the “vector” anyway?

a set of numbers specified in the dimension embedding model ([0.12, -0.88, ...])
metadata: various key:value pairs with information about which document this vector belongs to, source, and so on
optionally — the original text itself (the _source field does not affect the graph, but increases the size of the index)

id: "doc1-chunk1"
knn_vector: [0.12, -0.33, ...] // number set by dimension parameter
metadata: {doc_id: "doc1", chunk: 1, text: "some text"}

RAG, AWS Bedrock Knowledge Base, data, and vector creation

The RAG process itself is well described in this diagram (see Implementing Amazon Bedrock Knowledge Bases in support of GDPR (right to be forgotten) requests):

Here is an overview of how RAG works and the role of the vector database in it:

The client (e.g., a mobile app) sends a request to our Backend API, which runs on Kubernetes.
The Backend API receives it and generates a RetrieveAndGenerate request to Bedrock, which transmits the Knowledge Base ID and the text of the client's request
Bedrock launches the RAG pipeline, in which:
it sends a request to the embedding model to convert it into a vector (s)
performs a k-NN search in the OpenSearch index to find the most relevant data
forms an extended prompt that contains the original request + the data returned by OpenSearch
calls the GenAI model, to which it passes this extended prompt
receives a response from it
returns it in JSON format to our Backend API
The Backend API sends the result to the client

How the process of converting text to vectors looks like in AWS Bedrock Knowledge Base:

We have some source — for example, a txt file in S3
Bedrock reads it, and if it is large, it divides it into chunks with a size specified in the Bedrock parameters
Bedrock sends each chunk of text to the embedding LLM model, which converts this chunk into a vector of fixed length (dimension) and returns it to the Bedrock pipeline
Bedrock sends this vector along with metadata to the AWS OpenSearch vector store, where it is indexed for k-NN search

Number of vectors

The number of vectors in the index primarily depends on the data corpus (the size of all the input data we are working with) and how many chunks they will be divided into.

What you need to understand: vectors are not created for individual tokens, but for parts of text, for whole phrases.

Each embedding model has a limit on the number of tokens it can process at a time (the maximum “input length”).

If the text is long, it is broken down into chunks, and a separate vector is created for each chunk.

If we take, for example, an embedding model with a limit of 512 tokens and a dimension (dimension, d) of 1024 numbers, then:

the phrase “hello, world” fits into one “window” for embedding, and 1 vector will be created
a 300-word paragraph of English text will yield approximately 400 tokens — this also fits into the window, and 1 embedding vector will also be created
an article of 1,000 words will give approximately 1,300–1,400 tokens, so it will be divided into three chunks, and separate vectors will be created for them:
chunk_1 => [vector_1 with 1024 numbers]
chunk_2 => [vector_2 with 1024 numbers]
chunk_3 => [vector_3 with 1024 numbers]

d (dimension): is set by the embedding model, which converts data into vectors for storage in the vector store. For example, in Amazon Titan Embeddings, dimension=1024. This same parameter is specified when creating an index.

m (Maximum number of bi-directional links): the number of links between each node in the graph, this is a parameter of the HNSW graph, specified when we create an index, for example:

"bedrock-knowledge-base-default-vector": {
  "type": "knn_vector",
  "dimension": 1024,
  "method": {
    "name": "hnsw",
    "engine": "faiss",
    "parameters": {
      "m": 16,
      "ef_construction": 512
    },
    "space_type": "l2"
  }
}

Now, knowing all this data, we can calculate how much memory will be needed to build the graph in memory, for example:

a number of vectors: 1,000,000
d=1024
m=16

Formula:

num_vectors * 1.1 * (4 * d + 8 * m)

Here:

1.1: 10% reserve is added for HNSW service structures
4: each coordinate (number in the vector) is stored as float32 = 4 bytes
8: number of bytes for storing the id of each "neighbor" (64-bit int) (the number of which is given by m)

So, let’s calculate:

1.000.000 * 1.1 * (4*1024 + 8*16)

4646400000.0 bytes, or 4.64 gigabytes, is the volume for the HNSW graph across all vectors (excluding replicas and shards, which will be discussed later).

Now let’s consider the distribution into chunks and data nodes:

if we have a total index of 100 gigabytes
divided into 3 primary shards, and for each primary we have 1 replica shard — a total of 6 shards
we have 3 data nodes — each node will have 2 shards

A separate graph will be built for each shard, so we multiply 4.64 gigabytes by 2.

But since the index is distributed across 3 nodes, we divide the result by 3.

So the calculation will be as follows:

graph_total: our 4.64 gigabytes, the total volume for the graph
graph_cluster: graph_total * (1 + replicas) (primary + all replicas)
graph_per_node = graph_cluster / number of data nodes in the cluster

The formula will be as follows:

graph_total * (1 + replicas) / num_data_nodes

Having 1 primary shard + 1 replica shard, we get:

4.64 gigabytes * 2 / 3 data nodes

~ 3.1 GiB of memory per node purely for graphs.

k-NN graphs are stored in off-heap memory, so we can already estimate:

8 (preferably 16) gigabytes under JVM Heap for OpenSearch itself
3 GiB under graphs

The limit for k-NN graphs is set in knn.memory.circuit_breaker.limit, and is usually 50: off-heap memory - see k-NN differences, tuning, and limitations.

The metric in CloudWatch is KNNGraphMemoryUsage, see k-NN metrics.

Or in the OpenSearch API itself — _plugins/_knn/stats and _nodes/stats/indices,os,break (see Nodes Stats API).

And to this we must add the OS page cache for “hot” data — vectors/metadata/text that are mapped from disk to memory for quick access — as we calculated for the index with logs.

For the OS page cache, we can add another 20–50% of the total index size on the node, although this depends on the operations that will be performed. Ideally, if money is no object, you can add another 100% of the index size * 2 (for each replica of each shard) / number of nodes.

So, if we take 1,000,000 vectors in the database, and the database itself is 30 gigabytes, 3 primary shards and 1 replica for each, and 3 data nodes, we get:

8 (preferably 16) gigabytes under JVM Heap for OpenSearch itself
3 GB for graphs
30 * 2 / 3 * 0.5 (50% for OS page cache) == 10 GB

And add another 10–15% for the operating system itself, and we get (16 + 3 + 10) * 1.15 == ~34 GB RAM.

Useful links

Elsatissearch/OpenSearch general docs:

OpenSearch as a vector store:

Originally published at RTFM: Linux, DevOps, and system administration.

VictoriaLogs: the "rate limit exceeded" error and monitoring ingested logs

Arseny Zinchenko — Sat, 13 Sep 2025 13:19:18 +0000

We use two systems for collecting logs in the project: Grafana Loki and VictoriaLogs, to which Promtail simultaneously writes all collected logs.

We cannot get rid of Loki: although developers have long since switched to VictoriaLogs, some alerts are still created from metrics generated by Loki, so it is still present in the system.

And at some point, we started having two problems:

the disk on VictoriaLogs was getting clogged up — we had to reduce retention and increase the disk size, even though it had been sufficient before
Loki started dropping logs with the error “ Ingestion rate limit exceeded ”

Let’s dig in — what exactly is clogging up all the logs, why, and how can we monitor this?

The issue: Loki — Ingestion rate limit exceeded
Checking logs ingestion
Records per second
Bytes per second
The cause
Monitoring logs for the future
Loki metrics
VictoriaLogs metrics

The issue: Loki — Ingestion rate limit exceeded

I started digging with the “ Ingestion rate limit exceeded ” error in Loki, because the disk space on VictoriaLogs was full for the same reason — too many logs were being written.

In the alerts for Loki, it looks like this:

The alert itself is generated from the metric loki_discarded_samples_total:

...
      - alert: Loki Logs Dropped
        expr: sum by (cluster, job, reason) (increase(loki_discarded_samples_total[5m])) > 0
        for: 1s
...

I didn’t have an alert for VictoriaLogs, but it has a similar metric — vl_rows_dropped_total.

When Loki started dropping logs received from Promtail, I began checking Loki’s own logs, where I found errors with the rate limit:

...
path=write msg="write operation failed" details="Ingestion rate limit exceeded for user fake (limit: 4194304 bytes/sec) while attempting to ingest '141' lines totaling '1040783' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased" org_id=fake
...

Then I didn’t bother digging deeper, but simply increased the limit via limits_config, see Rate-Limit Errors:

...
    limits_config:
      ...
      ingestion_rate_mb: 8
      ingestion_burst_size_mb: 16
...

For VictoriaLogs, I simply increased the disk size — Kubernetes: PVC in StatefulSet, and the “Forbidden updates to StatefulSet spec” error.

This seemed to help for a while, but then the errors reappeared, so I had to investigate further.

Checking logs ingestion

So, what we need to do is determine who is writing so many logs.

We are interested in two parameters:

number of records per second
number of bytes per second

And we want to see this broken down by service.

Records per second

You can get the log rate per second in VictoriaLogs simply with the rate() function:

{app=~".*"}
| stats by (app) rate() records_per_second
| sort by (records_per_second) desc 
| limit 10

Here:

we group on the label app (sum by (app) in Loki)
with rate() we get the per-second rate of new records on the group app, store the result in a new field records_per_second
sort by records_per_second in descending
order and output the top 10 with limit (or head)

Well, actually…

We can see that VictoriaLogs is way ahead of the rest :-)

In addition, the VictoriaLogs graph shows that most logs come from the Namespace ops-monitoring-ns, where VictoriaLogs lives:

In Loki, you can view the per-second rate with a similar function, rate():

topk(10, sum by (app) (rate({app=~".+"}[1m])))

Bytes per second

The same pattern can be seen with bytes per second.

In Loki, we can see this simply with bytes_over_time():

topk(10, sum by (app) (bytes_over_time({app=~".+"}[1m])))

For VictoriaLogs, there is block_stats, but out of the box, it does not allow you to display statistics for each stream, see How to determine which log fields occupy the most disk space?

However, there is sum_len(), where we can get statistics, for example, like this:

* 
| stats by (app) sum_len() as bytes_used 
| sort (bytes_used) desc
| limit 10

Or per-second rate:

* 
| stats by (app) sum_len() as rows_len
| stats by (app) rate_sum(rows_len) as bytes_used_rate
| sort (bytes_used_rate) desc
| limit 10

The cause

Everything turned out to be quite simple.

All we had to do was look at the VictoriaLogs logs and see that it logs all entries received from Promtail — “ new log entry ”:

Let’s check out the options for VictoriaLogs in the documentation List of command-line flags, where I found the -logIngestedRows parameter:

-logIngestedRows

Whether to log all the ingested log entries; this can be useful for debugging of data ingestion; see https://docs.victoriametrics.com/victorialogs/data-ingestion/_ ; see also -logNewStreams_

The default value is not specified here, and I initially thought that it was simply set to “true,” so I went to the values of our chart to set it to “false,” where I saw:

...
victoria-logs-single:
  server:
    ...
    extraArgs:
      logIngestedRows: "true"
...

Ouch…

I enabled this logging for some reason once and forgot about it.

Actually, we can switch it to “false” (or just delete it, because it’s “false” by default), deploy it, and the problem is solved.

At the same time, we can switch loggerLevel, which is set to INFO by default.

And here, by the way, there could be an interesting picture: if both Loki and VictoriaLogs wrote a log about every log record they received, then…

Loki receives any record from Promtail
writes this event to its own log
Promtail sees a new record from the container with Loki and again transmits it to both Loki and VictoriaLogs
VictoriaLogs records in its log that it has received this record
Promtail sees a new record from the container with VictoriaLogs and transmits it to both Loki and VictoriaLogs
Loki receives this record from Promtail
records this event in its own log
…

A kind of “fork logs bomb”.

Monitoring logs for the future

Here, too, everything is simple: either we use the default metrics from Loki and VictoriaLogs, or we generate our own.

Loki metrics

In the Loki chart, there is an option called monitoring.serviceMonitor.enabled. You can simply enable it, and then VictoriaMetrics Operator will create VMServiceScrape and start collecting metrics.

The following may be of interest for Loki:

loki_log_messages_total: Total number of messages logged by Loki
loki_distributor_bytes_received_total: The total number of uncompressed bytes received per tenant
loki_distributor_lines_received_total: The total number of lines received per tenant
loki_discarded_samples_total: The total number of samples that were dropped
loki_discarded_bytes_total: The total number of bytes that were dropped

Or we can create our own metrics with information for each app:

kind: ConfigMap
apiVersion: v1
metadata:
  name: loki-recording-rules
data:
  rules.yaml: |-
  ...
      - name: Loki-Logs-Stats

        rules:

        - record: loki:logs:ingested_rows:sum:rate:5m
          expr: |
            topk(10, 
              sum by (app) (
                rate({app=~".+"}[5m])
              )
            )

        - record: loki:logs:ingested_bytes:sum:rate:5m
          expr: |
            topk(10, 
              sum by (app) (
                bytes_rate({app=~".+"}[5m])
              )
            )

Deploy and check:

And then use these metrics to create alerts:

...
      - alert: Loki Logs Ingested Rows Too High
        expr: sum by (app) (loki:logs:ingested_rows:sum:rate:5m) > 100
        for: 1s
        labels:
          severity: warning
          component: devops
          environment: ops
        annotations:
          summary: 'Loki Logs Ingested Rows Too High'
          description: |-
            Grafana Loki ingested too many log rows
            *App*: `{{ "{{" }} $labels.app }}`
            *Value*: `{{ "{{" }} $value | humanize }}` records per second
          tags: devops

      - alert: Loki Logs Ingested Bytes Too High
        expr: sum by (app) (loki:logs:ingested_bytes:sum:rate:5m) > 50000
        for: 1s
        labels:
          severity: warning
          component: devops
          environment: ops
        annotations:
          summary: 'Loki Logs Ingested Bytes Too High'
          description: |-
            Grafana Loki ingested too many log bytes
            *App*: `{{ "{{" }} $labels.app }}`
            *Value*: `{{ "{{" }} $value | humanize1024 }}` bytes per second
          tags: devops
...

VictoriaLogs metrics

Add metrics collection from VictoriaLogs:

...
victoria-logs-single:
  server:
    ...
    vmServiceScrape:
      enabled: true
..

Useful metrics:

vl_bytes_ingested_total: Total estimate of the volume of log bytes accepted by injectors, broken down by protocol using tags
vl_rows_ingested_total: Total number of log records successfully accepted by injectors, broken down by injection protocols using tags in the raw series
vl_rows_dropped_total: The total number of rows dropped by the server during injection, with the reasons marked (e.g., debug mode, too many fields, timestamps out of bounds)
vl_too_long_lines_skipped_total: Number of rows exceeding the size skipped due to exceeding the configured maximum row size
vl_free_disk_space_bytes: Current free space available on the file system hosting the storage path

All metrics can be found in the documentation: Metrics of VictoriaLogs.

And add an alert like this:

...
      - alert: VictoriaLogs Logs Dropped Rows Too High
        expr: sum by (reason) (vl_rows_dropped_total) > 0
        for: 1s
        labels:
          severity: warning
          component: devops
          environment: ops
        annotations:
          summary: 'VictoriaLogs Logs Dropped Rows Too High'
          description: |-
            VictoriaLogs dropped too many log rows
            *Reason*: `{{ "{{" }} $labels.app }}`
            *Value*: `{{ "{{" }} $value | humanize }}` records dropped
          tags: devops
...

But again, vl_rows_ingested_total won't tell us which app is writing too many logs.

Therefore, we can add RecordingRules, see VictoriaLogs: creating Recording Rules with VMAlert:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  name: vmlogs-alert-rules
spec:

  groups:

    - name: VM-Logs-Ingested
      # an expressions for the VictoriaLogs datasource
      type: vlogs
      rules:
        - record: vmlogs:logs:ingested_rows:stats:rate
          expr: |
            {app=~".*"} 
            | stats by (app) rate() records_per_second 
            | sort by (records_per_second) desc
            | limit 10

Deploy, check again:

And add an alert:

...
      - alert: VictoriaLogs Logs Ingested Rows Too High
        expr: sum by (app) (vmlogs:logs:ingested_rows:stats:rate) > 100
        for: 1s
        labels:
          severity: warning
          component: devops
          environment: ops
        annotations:
          summary: 'VictoriaLogs Logs Ingested Rows Too High'
          description: |-
            Grafana Loki ingested too many log rows
            *App*: `{{ "{{" }} $labels.app }}`
            *Value*: `{{ "{{" }} $value | humanize }}` records per second
          tags: devops
...

The result:

Well, that’s all for now.

This is basic monitoring for VictoriaLogs and Loki.

Originally published at RTFM: Linux, DevOps, and system administration.

VictoriaMetrics: migrating VMSingle and VictoriaLogs data between Kubernetes clusters

Arseny Zinchenko — Wed, 23 Jul 2025 11:00:00 +0000

We have VictoriaMetrics and VictoriaLogs running on an AWS Elastic Kubernetes Service cluster.

We do major upgrades to EKS by creating a new cluster, and therefore, we have to transfer monitoring data from the old VMSingle instance to the new one.

For VictoriaMetrics, there is the vmctl tool which can migrate data through the APIs of the old and new instances, acting as a proxy between the two instances.

With VictoriaLogs, the situation is still a bit more complicated and there are currently two options — let’s look at them further.

So, here’s our setup:

old Kubernetes cluster EKS 1.30
new Kubernetes cluster EKS 1.33

VictoriaMetrics and VictoriaLogs are deployed with our own Helm-chart which installs victoria-metrics-k8s-stack and victoria-logs-single through dependencies, plus a set of various additional services such as PostgreSQL Exporter.

Migrating VictoriaMetrics metrics

Running `vmctl`

vmctl supports migration from VMSinlge to VMClutser and vice versa, or simply between VMSinlge => VMSinlge or VMClutser => VMClutser instances.

In our case, these are just two instances of VMSingle.

You can install vmctl locally in a Pod with VMSingle, see How to build, but since the CLI still works through the API, it is easier to create a separate Pod and do everything from it. The Docker image is available here - victoriametrics/vmctl.

Since the entrypoint for this image is set in /vmctl-prod, to enter to the container we can pass --command, run ping and sleep in a loop, and then do everything we need from the console:

$ kubectl run vmctl-pod --image=victoriametrics/vmctl --restart=Never --command -- /bin/sh -c "while true; echo ping; do sleep 5; done"
pod/vmctl-pod created

There is no difference on which cluster to run it on.

Connect to the Pod:

$ kk exec -ti vmctl-pod -- sh 
/ #.

Check the CLI:

/ # /vmctl-prod vm-native --help
NAME:
   vmctl vm-native - Migrate time series between VictoriaMetrics installations via native binary format

USAGE:
   vmctl vm-native [command options] [arguments...]

OPTIONS:
   -s Whether to run in silent mode. If set to true no confirmation prompts will appear. (default: false)
   ...

Start the migration

The Kubernetes Pod with vmctl will act as a proxy between the source and destination, so it must have a stable network.

In addition, if you are migrating a large amount of data, then look towards the --vm-concurrency option to run the migration in several parallel threads, but keep in mind that each worker will use additional CPU and Memory.

The documentation also describes possible issues with limits — see Migrating data from VictoriaMetrics, and it is useful to look at the Migration tips section.

It is also recommended to add the --vm-native-filter-match='{__name__!~"vm_.*"}' filter to avoid migrating metrics that are related to VictoriaMetrics itself, as this can lead to data collision - duplicate time series.

Although in my case, VMAgent adds a metric with the name of the cluster to all metrics:

...
  vmagent:
    enabled: true
    spec:
      externalLabels:
        cluster: "eks-ops-1-33"
...

If resources.limits are set for the VMSingle Pod, it's better to disable them or increase them, and increase the resources.requests, because I was getting 504 and Pod Eviction few times.

And maybe it makes sense to move VMSingle to a separate WorkerNode, because in our case, the t3 and Spot EC2 instances are used for monitoring.

What and where we will migrate:

source : VMSingle on EKS 1.30
endpoint: vmsingle.monitoring.1-30.ops.example.co
destination : VMSingle on EKS 1.33 endpoint
endpoint: vmsingle.monitoring.1-33.ops.example.co

From the Pod with the vmctl, check access to both endpoints:

/ # apk add curl

/ # curl -X GET -I [https://vmsingle.monitoring.1-30.ops.example.co](https://vmsingle.monitoring.1-30.ops.example.co)
HTTP/2 400

/ # curl -X GET -I [https://vmsingle.monitoring.1-33.ops.example.co](https://vmsingle.monitoring.1-33.ops.example.co)
HTTP/2 200

And start the migration for the entire period — I don’t remember when exactly this cluster was created, let’s say January 2023:

/ # /vmctl-prod vm-native \
> --vm-native-src-addr=https://vmsingle.monitoring.1-30.ops.example.co/ \
> --vm-native-dst-addr=https://vmsingle.monitoring.1-33.ops.example.co \
> --vm-native-filter-match='{ __name__!~"vm_.*"}' \
> --vm-native-filter-time-start='2023-01-01'
VictoriaMetrics Native import mode
...

The process has started:

The resources on the source — memory — went up to 5–6 gigabytes:

The destination had a little more CPU, but less memory:

And completion — took 6 hours, but I did it without --vm-concurrency:

...
2025/06/23 19:07:29 Import finished!
2025/06/23 19:07:29 VictoriaMetrics importer stats:
  time spent while importing: 6h30m8.537582366s;
  total bytes: 16.5 GB;
  bytes/s: 705.9 kB;
  requests: 6882;
  requests retries: 405;
2025/06/23 19:07:29 Total time: 6h30m8.541808518s

Now we have a month’s worth of graphs on the new EKS cluster, even though the cluster was created just a week ago:

If migration fails

First, check the request — you need to find the old metrics on the new cluster.

In my case, I can check on the new cluster using the cluster label - a useful thing:

$ curl -s 'http://localhost:8429/prometheus/api/v1/series' -d 'match[]={cluster="eks-ops-1-30"}' | jq
...
    {
      " __name__": "yace_cloudwatch_targetgroupapi_requests_total",
      "cluster": "eks-ops-1-30",
      "job": "yace-exporter",
      "instance": "yace-service:5000",
      "prometheus": "ops-monitoring-ns/vm-k8s-stack"
    }
...

Documentation on deleting metrics and working with the VictoriaMetrics API in general — How to delete or replace metrics in VictoriaMetrics and Deletes time series from VictoriaMetrics.

Run a request to the /api/v1/admin/tsdb/delete_series endpoint:

$ curl -s 'http://localhost:8429/api/v1/admin/tsdb/delete_series' -d 'match[]={cluster="eks-ops-1-30"}

Check:

$ curl -s 'http://localhost:8429/prometheus/api/v1/series' -d 'match[]={cluster="eks-ops-1-30"}' | jq
{
  "status": "success",
  "data": []
}

Now you can repeat the migration.

Another option is to add the dedup.minScrapeInterval=1ms option, then VictoriaMetrics will remove duplicates by itself, but I have not tested this option.

VictoriaLogs migration

With VictoriaLogs, the situation is a little more complicated, because vlogscli does not yet (hopefully, they will add) have any option to transfer data like in vmctl.

And there is a problem here:

if there is no data in VictoriaLogs on the new cluster yet, you can simply copy the old data from rsync to the PVC of the new VictoriaLogs instance similarly
the same, if the new VMLogs instance already has some data, but with no overlapping days from the old instance, because the data in VictoriaLogs storage is stored in directories by day, and they can be safely transferred
but if there is data and/or days are duplicated, then for now the only option is to run two VictoriaLogs instances: one with old data, one with new data, and have a vlselect instance in front of them

When VictoriaLogs will add Object Storage support, it will be easier, and it is already in its Roadmap. Then you can just keep everything in AWS S3, as we do it now with Grafana Loki.

Option 1: copying data with `rsync`

So, the first option is if there is no data in the new VictoriaLogs instance, or there are no records on the same days on both instances — the old and the new.

Here we can simply copy the data, and it will be available on the new Kubernetes cluster.

See VictoriaLogs documentation — Backup and restore.

I did it with rsync, but you can try it with utilities like korb.

Let’s check where the logs are stored in the VictoriaLogs Pod:

$ kk describe pod atlas-victoriametrics-vmlogs-new-server-0
Name: atlas-victoriametrics-vmlogs-new-server-0
...
Containers:
  vlogs:
    ...
    Args:
      --storageDataPath=/storage
    ...
    Mounts:
      /storage from server-volume (rw)
    ...
Volumes:
  server-volume:
    Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName: server-volume-atlas-victoriametrics-vmlogs-new-server-0
    ReadOnly: false
...

And the contents of the /storage directory:

~ $ ls -l /storage/partitions/
total 32
drwxrwsr-x 4 1000 2000 4096 Jun 16 00:00 20250616
drwxrwsr-x 4 1000 2000 4096 Jun 17 00:00 20250617
drwxrwsr-x 4 1000 2000 4096 Jun 18 00:00 20250618
drwxrwsr-x 4 1000 2000 4096 Jun 19 00:00 20250619
drwxrwsr-x 4 1000 2000 4096 Jun 20 00:00 20250620
drwxr-sr-x 4 1000 2000 4096 Jun 21 00:00 20250621
drwxr-sr-x 4 1000 2000 4096 Jun 22 00:00 20250622
drwxr-sr-x 4 1000 2000 4096 Jun 23 00:00 20250623

But there’s no rsync or SSH in the event itself, and we can't even install it:

~ $ rsync
sh: rsync: not found
~ $ apk add rsync
ERROR: Unable to lock database: Permission denied
ERROR: Failed to open apk database: Permission denied
~ $ su
su: must be suid to work properly
~ $ sudo -s
sh: sudo: not found
~ $ ssh
sh: ssh: not found

So let’s just run rsync from the old EC2 to the new one.

How to find the right directory on the host — see in my Kubernetes: find a directory with a mounted volume in a Pod on its host post.

Setting up SSH access to EC2 for EKS — in the AWS: Karpenter and SSH for Kubernetes WorkerNodes.

Check Pod on the old cluster — find its EC2 and Container ID:

$ kk describe pod atlas-victoriametrics-vmlogs-new-server-0 | grep 'Node\|Container'
Node: ip-10-0-39-190.ec2.internal/10.0.39.190
Containers:
    Container ID: containerd://db9fa73a4d37045b0338ae48438f9815e4f6f92c3fd6546604ca5d1338f19844
...

Connect to the WorkerNode:

$ ssh -i ~/.ssh/eks_ec2 ec2-user@10.0.39.190

In the mounts[] find the directory for the /storage:

[root@ip-10-0-39-190 ec2-user]# crictl inspect db9fa73a4d37045b0338ae48438f9815e4f6f92c3fd6546604ca5d1338f19844 | jq
...
    "mounts": [
      {
        "containerPath": "/storage",
        "gidMappings": [],
        "hostPath": "/var/lib/kubelet/pods/5192e1f9-20ea-49c6-99ed-775af5e44183/volumes/kubernetes.io~csi/pvc-43c427fa-b05c-45b8-8bdb-92b00bff3496/mount",
...

Check its content:

[root@ip-10-0-39-190 ec2-user]# ll /var/lib/kubelet/pods/5192e1f9-20ea-49c6-99ed-775af5e44183/volumes/kubernetes.io~csi/pvc-43c427fa-b05c-45b8-8bdb-92b00bff3496/mount
total 24
drwxrwsr-x 3 ec2-user 2000 4096 Nov 19 2024 cache
-rw-rw-r-- 1 ec2-user 2000 0 Jun 20 19:20 flock.lock
drwxrws--- 2 root 2000 16384 Sep 4 2024 lost+found
drwxrwsr-x 10 ec2-user 2000 4096 Jun 25 00:25 partitions

We only need the data from the partitions directory here.

Repeat for VictoriaLogs on a new cluster, though Amazon Linux 2023 does not have critctl - however, it does have ctr.

Check ContainerD Namespaces for containers:

[root@ip-10-0-41-247 ec2-user]# ctr ns ls
NAME LABELS 
k8s.io

Check the container with the ctr containers info:

[root@ip-10-0-41-247 ec2-user]# ctr -n k8s.io containers info 9fd6fefaec92ab76093651239f6e177686e7c7dd012d53d4bf2e6820260aa884
...
            {
                "destination": "/storage",
                "type": "bind",
                "source": "/var/lib/kubelet/pods/4b2f179d-9ada-403e-9680-b76e3507563f/volumes/kubernetes.io~csi/pvc-da384ead-50e8-425f-b3b0-47c35f3a5155/mount",
...

And the content of the /var/lib/kubelet/pods/4b2f179d-9ada-403e-9680-b76e3507563f/volumes/kubernetes.io~csi/pvc-da384ead-50e8-425f-b3b0-47c35f3a5155/mount directory:

[root@ip-10-0-41-247 ec2-user]# ll /var/lib/kubelet/pods/4b2f179d-9ada-403e-9680-b76e3507563f/volumes/kubernetes.io~csi/pvc-da384ead-50e8-425f-b3b0-47c35f3a5155/mount
total 20
-rw-rw-r--. 1 ec2-user 2000 0 Jun 25 12:18 flock.lock
drwxrws---. 2 root 2000 16384 Jun 10 09:41 lost+found
drwxrwsr-x. 10 ec2-user 2000 4096 Jun 25 00:32 partitions

Pay attention to the user’s and group’s IDs, they must be the same — ec2-user(1000) and group 2000 on both EC2 instances in my case.

Create an SSH key on the old cluster and check the connection to the EC2 of the new cluster:

[root@ip-10-0-39-190 ec2-user]# ssh -i .ssh/eks ec2-user@10.0.41.247
...
[ec2-user@ip-10-0-41-247 ~]$

OK, we have it.

Now install rsync on both instances:

[root@ip-10-0-39-190 ec2-user]# yum -y install rsync

Just in case, you can back up the data on a new instance — either with an EBS snapshot or with tar.

One more thing about the retention period, I’m glad I mentioned it — we have only 7 days. Therefore, if you copy the data now, the old logs will be deleted.

Let’s change it:

... 
retentionPeriod: 30d 
...

On the new instance, let’s make a directory where we will transfer the data (but can be done directly to the PVC directory):

[root@ip-10-0-41-247 ec2-user]# mkdir vmlogs

And from the old EC2 run rsync to the new instance to the $HOME/vmlogs:

[root@ip-10-0-39-190 ec2-user]# rsync -avz --progress --delete -e "ssh -i .ssh/eks" \
> /var/lib/kubelet/pods/5192e1f9-20ea-49c6-99ed-775af5e44183/volumes/kubernetes.io~csi/pvc-43c427fa-b05c-45b8-8bdb-92b00bff3496/mount/partitions/ \
> ec2-user@10.0.41.247:/home/ec2-user/vmlogs/
...

Here:

-a: archive mode (saves permissions, create/modify time, and structure)
-v: verbose mode
-z: compress data
--progress: show progress
--delete: delete data from destination if it is deleted in source
-e: command with an ssh key

The first argument is the local directory, and the second is where to copy to.

And for the source, specify "/" at the end of .../mount/partitions/ - copy the contents, not the folder itself.

If you get errors with permission denied, add --rsync-path="sudo rsync".

The transfer is complete:

...
sent 2,483,902,797 bytes received 189,037 bytes 20,614,869.99 bytes/sec
total size is 2,553,861,458 speedup is 1.03

Check the data on the new instance:

[root@ip-10-0-41-247 ec2-user]# ll vmlogs/
total 0
drwxrwsr-x. 4 ec2-user ec2-user 35 Jun 18 00:00 20250618
drwxrwsr-x. 4 ec2-user ec2-user 35 Jun 19 00:00 20250619
drwxrwsr-x. 4 ec2-user ec2-user 35 Jun 20 00:00 20250620
drwxr-sr-x. 4 ec2-user ec2-user 35 Jun 21 00:00 20250621
drwxr-sr-x. 4 ec2-user ec2-user 35 Jun 22 00:00 20250622
drwxr-sr-x. 4 ec2-user ec2-user 35 Jun 23 00:00 20250623
drwxr-sr-x. 4 ec2-user ec2-user 35 Jun 24 00:00 20250624
drwxr-sr-x. 4 ec2-user ec2-user 35 Jun 25 00:00 20250625

And this is where I encountered the problem of overlapping data:

[root@ip-10-0-41-247 ec2-user]# cp -r vmlogs/* /var/lib/kubelet/pods/84a4ecd3-21a0-4eec-bebc-078a5105bf86/volumes/kubernetes.io~csi/pvc-da384ead-50e8-425f-b3b0-47c35f3a5155/mount/partitions/
cp: overwrite '/var/lib/kubelet/pods/84a4ecd3-21a0-4eec-bebc-078a5105bf86/volumes/kubernetes.io~csi/pvc-da384ead-50e8-425f-b3b0-47c35f3a5155/mount/partitions/20250618/datadb/parts.json'?

I asked the developers about the JSON merge option, but it will not work.

If the data doesn’t overlap, then just copy the data and restart the VictoriaLogs Pod.

In my case, I had to do it a little differently.

Option 2: run two VMLogs + `vlselect`

So, if we have data for the same days on the old and new VictoriaLogs instances, we can do the following:

create a second VMLogs instance on the new EKS cluster
copy data from the old cluster to the PVC of the new VMLogs instance
add a new Pod with vlselect
for the vlselect we specify two sources - both VMLogs instances
and then for the Grafana VictoriaLogs data source we use the URL of the vlselect service

We could just add vlselect and route the requests to the old cluster - but I need to delete the old cluster.

`vlselect` vs VMLogs

In fact, vlselect is the same binary file as VictoriaLogs, which simplifies the whole setup for us - see the VictoriaLogs cluster documentation:

Note that all the VictoriaLogs cluster components — _vlstorage, vlinsert, and vlselect - share the same executable - _victoria-logs-prod.

Therefore, we can simply take another victoria-logs-single Helm chart and run everything from it.

And we’ll actually be building a kind of “minimal VictoriaLogs cluster”:

our current VictoriaLogs instance will play the role of vlinsert and vlstorage - new logs of the new cluster are written there
the new VictoriaLogs instance will play the role of vlstorage - we will store data from the old cluster in it
the third VictoriaLogs instance will play the role of vlselect - it will be an endpoint for Grafana, and will make API requests to search for logs from both VictoriaLogs instances

Helm chart update

I’m not ready to run the full version of the VictoriaLogs cluster yet, so let’s just add a couple more dependencies to our current Helm chart.

Edit Chart.yaml:

...
dependencies:
...
- name: victoria-logs-single
  version: ~0.11.2
  repository: https://victoriametrics.github.io/helm-charts
  alias: vmlogs-new
- name: victoria-logs-single
  version: ~0.11.2
  repository: https://victoriametrics.github.io/helm-charts
  alias: vmlogs-old
- name: victoria-logs-single
  version: ~0.11.2
  repository: https://victoriametrics.github.io/helm-charts
  alias: vlselect
...

Here, we deploy three charts (more precisely, one chart, just with different values, see Helm: multiple deployments of a single chart with Chart’s dependency), and each one has its own alias:

vmlogs-new: the current VMLogs instance on the new EKS cluster
vmlogs-old: a new instance to which we will transfer data from the old EKS cluster
vlselect: will be our new endpoint for searching logs

The only thing is that during deployment there may be an error due to the length of the pod names, because I initially set too long names in the alias:

...
Pod "atlas-victoriametrics-victoria-logs-single-old-server-0" is invalid: metadata.labels: Invalid value: "atlas-victoriametrics-victoria-logs-single-old-server-77cf9cd79d": must be no more than 63 characters 
...

Check the default values.yaml of the victoria-logs-single chart:

...
  persistentVolume:
    # -- Create/use Persistent Volume Claim for server component. Use empty dir if set to false
    enabled: true
    size: 10Gi
...
  ingress:
    # -- Enable deployment of ingress for server component
    enabled: false
...

For the vlselect instance, add the storageNode parameter, and specify the endpoints of both VictoriaLogs separated by commas, and, if necessary, set parameters for persistentVolume:

...
vmlogs-new:
  server:
    persistentVolume:
      enabled: true
      storageClassName: gp2-retain
      size: 30Gi
    retentionPeriod: 14d

vmlogs-old:
  server:
    persistentVolume:
      enabled: true
      storageClassName: gp2-retain
      size: 30Gi
    retentionPeriod: 14d

vlselect:
  server:
    extraArgs:
      storageNode: atlas-victoriametrics-vmlogs-new-server:9428,atlas-victoriametrics-vmlogs-old-server:9428
...

Deploy the chart, and check the Pods:

$ kk get pod | grep 'vmlogs\|vlselect'
atlas-victoriametrics-vlselect-server-0 1/1 Running 0 19h
atlas-victoriametrics-vmlogs-new-server-0 1/1 Running 0 76s
atlas-victoriametrics-vmlogs-old-server-0 1/1 Running 0 76s

Services:

$ kk get svc | grep 'vmlogs\|vlselect'
atlas-victoriametrics-vlselect-server ClusterIP None <none> 9428/TCP 22h
atlas-victoriametrics-vmlogs-new-server ClusterIP None <none> 9428/TCP 42s
atlas-victoriametrics-vmlogs-old-server ClusterIP None <none> 9428/TCP 42s

Now we have Promtail on the new cluster continuing to write to atlas-victoriametrics-vmlogs-new-server, and in the atlas-victoriametrics-vmlogs-old-server we have an empty VMLogs instance.

We can check access to the logs through the vlselect instance:

$ kk port-forward svc/atlas-victoriametrics-vlselect-server 9428

Transferring data from the old cluster

Next, we simply repeat what we did above: find the PVC directory, and copy the data from the old cluster there.

This time, I’ll first copy data to my work laptop, and then from it will copy to the Kubernetes cluster:

[setevoy@setevoy-work ~] $ mkdir vmlogs_back

While writing this, VictoriaLogs on the old cluster has already moved to the new EC2, so we’re looking for the data again.

Switch kubectl to the old cluster and find the Pod and its WorkerNode:

$ kk describe pod atlas-victoriametrics-victoria-logs-single-server-0 | grep 'Node\|Container'
Node: ip-10-0-38-72.ec2.internal/10.0.38.72
Containers:
    Container ID: containerd://c168d4487282dd7d868aadcfcd1840e4e15cfd360f56f542a98b77978f91e252
...

Connect to the EC2, find the directory:

[root@ip-10-0-38-72 ec2-user]# crictl inspect c168d4487282dd7d868aadcfcd1840e4e15cfd360f56f542a98b77978f91e252
...
    "mounts": [
      {
        "containerPath": "/storage",
        "gidMappings": [],
        "hostPath": "/var/lib/kubelet/pods/f84ef4b9-272f-437e-9f98-649e1707ed09/volumes/kubernetes.io~csi/pvc-43c427fa-b05c-45b8-8bdb-92b00bff3496/mount",
...

Install rsync there, and copy the data to the laptop:

$ rsync -avz --progress -e "ssh -i .ssh/eks_ec2" \
> --rsync-path="sudo rsync" \
> ec2-user@10.0.38.72:/var/lib/kubelet/pods/f84ef4b9-272f-437e-9f98-649e1707ed09/volumes/kubernetes.io~csi/pvc-43c427fa-b05c-45b8-8bdb-92b00bff3496/mount/partitions/ \
> /home/setevoy/vmlogs_back/
...

Check data locally:

$ ll ~/vmlogs_back/
total 32
drwxrwsr-x 4 setevoy setevoy 4096 Jun 19 03:00 20250619
drwxrwsr-x 4 setevoy setevoy 4096 Jun 20 03:00 20250620
drwxrwsr-x 4 setevoy setevoy 4096 Jun 21 03:00 20250621
...

Now, we’ll move everything to the new cluster, where the atlas-victoriametrics-vmlogs-old-server-0 Pod is running.

Switch kubectl to the new cluster, find the WorkerNode and Container ID:

$ kd atlas-victoriametrics-vmlogs-old-server-0 | grep 'Node\|Container'
Node: ip-10-0-36-143.ec2.internal/10.0.36.143
Containers:
    Container ID: containerd://f10118b10afab75c43e03adcc0644af5caa8654687cd81e59cdf15bd8c32cb31
...

SSH to EC2, and find the directory:

[root@ip-10-0-36-143 ec2-user]# ctr -n k8s.io containers info f10118b10afab75c43e03adcc0644af5caa8654687cd81e59cdf15bd8c32cb31
...
            {
                "destination": "/storage",
                "type": "bind",
                "source": "/var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount",
...

Check its content, it should be empty:

drwxr-sr-x. 2 ec2-user 2000 4096 Jun 26 13:14 partitions
[root@ip-10-0-36-143 ec2-user]# ls -l /var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount/partitions/
total 0

Install rsync there, and copy the data from the local directory /home/setevoy/vmlogs_back/ to the new EKS cluster:

$ rsync -avz --progress -e "ssh -i .ssh/eks_ec2" --rsync-path="sudo rsync" \
> /home/setevoy/vmlogs_back/ \
> ec2-user@10.0.36.143:/var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount/partitions/
...

Check the data there:

[root@ip-10-0-36-143 ec2-user]# ls -l /var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount/partitions/
total 32
drwxrwsr-x. 4 ec2-user ec2-user 4096 Jun 19 00:00 20250619
drwx--S---. 2 root ec2-user 4096 Jun 26 13:39 20250620
drwx--S---. 2 root ec2-user 4096 Jun 26 13:39 20250621
drwx--S---. 2 root ec2-user 4096 Jun 26 13:39 20250622
drwx--S---. 2 root ec2-user 4096 Jun 26 13:39 20250623
...

Change the user and group:

[root@ip-10-0-36-143 ec2-user]# chown -R ec2-user:2000 /var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount/partitions/
[root@ip-10-0-36-143 ec2-user]# ls -l /var/lib/kubelet/pods/297b75ec-63fa-4061-bb23-7a6a120da939/volumes/kubernetes.io~csi/pvc-c7373468-f247-4596-b2e2-87852aad71bb/mount/partitions/
total 32
drwxrwsr-x. 4 ec2-user 2000 4096 Jun 19 00:00 20250619
drwxrwsr-x. 4 ec2-user 2000 4096 Jun 20 00:00 20250620
...

Restart the atlas-victoriametrics-vmlogs-old-server-0 Pod.

Checking the data

Let’s look for something.

First, something about the hostname: "ip-10-0-36-143.ec2.internal" - it's an EC2 from the new EKS cluster, and it should come from the atlas-victoriametrics-vmlogs-new-server-0 instance, i.e. from the old instance on the new Kubernetes cluster:

And now some node from the old cluster:

Everything is there.

Done.

Originally published at RTFM: Linux, DevOps, and system administration.

Forem: Arseny Zinchenko

AWS: Monitoring AWS OpenSearch Service Cluster with CloudWatch

Contents

CloudWatch metrics

Memory monitoring

kNN Memory usage

JVM Memory usage

Collecting metrics to VictoriaMetrics

Creating a Grafana dashboard

VictoriaMetrics/Prometheus sum(), avg() and max()`

HELP aws_es_cpuutilization_average CloudWatch metric AWS/ES CPUUtilization Dimensions: [ClientId, DomainName, NodeId] Statistic: Average Unit: Percent

TYPE aws_es_cpuutilization_average gauge

Cluster status

Nodes status

CPUUtilization: Stats

CPUUtilization: Graph

JVMMemoryPressure: Graph

JVMGCYoungCollectionCount and JVMGCOldCollectionCount

KNNHitCount vs KNNMissCount

Final result

t3.small.search vs t3.medium.search on graphs

Creating Alerts

Terraform: creating an AWS OpenSearch Service cluster and users

Contents

Terraform files structure

PROD

destroy-prod:

terraform destroy -var-file=envs/prod/prod.tfvars

Project planning

Creating a cluster

ROOT

generate root password

waiting for write-only: https://github.com/hashicorp/terraform-provider-aws/pull/43621

then will update it with the ephemeral type

store the root password in AWS Parameter Store

Custom endpoint configuration

TLS for the Custom Domain

Terraform Outputs

Creating OpenSearch Users

Error: elastic: Error 403 (Forbidden)

Creating Internal Users

Internal database users

KRAKEN

store the root password in AWS Parameter Store

Create a user

And a full user, role and role mapping example:

Adding IAM Users

BACKEND TEAM

Creating AWS Bedrock IAM Roles and OpenSearch Role mappings

MAIN ROLE FOR KNOWLEDGE BASE

grants permissions for AWS Bedrock to interact with other AWS services

IAM policy for Knowledge Base to access OpenSearch Managed

BEDROCK

'aws_iam_role' is defined in iam.tf

Creating OpenSearch indexes

Dev Index

Terraform: using Ephemeral Resources and Write-Only Attributes

Contents

Example without ephemeral values and write-only arguments

Using Write-Only Attributes

Using Ephemeral Resources

The “This output value is not declared as returning an ephemeral value” error

The “Ephemeral outputs are not allowed in context of a root module” error

Using values from Ephemeral resources

Using Ephemeral Outputs

Useful links

Terraform: AWS EKS Terraform module update from version 20.x to version 21.x

Upgrade AWS EKS Terraform module

Upgrade AWS Provider Version 6

Upgrade terraform-aws-modules/eks/aws

Renamed variables в terraform-aws-modules/eks/aws

Removed variables в terraform-aws-modules/eks/aws//modules/karpenter

Important: Karpenter’s EKS Identity Provider Namespace

eks_managed_node_groups: attribute "taints": map of object required

Deploying changes

ALB Controller error: “failed to fetch VPC ID from instance metadata”

Issue: Node Group Status CREATE_FAILED

“Check the logs, Billy!”

“Container runtime network not ready — cni plugin not initialized”

Kubernetes: PVC in a StatefulSet, and the “Forbidden updates to statefulset spec” error

VictoriaMetrics/Prometheus `sum()`, `avg()` and max()`

`t3.small.search` vs `t3.medium.search` on graphs

Upgrade `terraform-aws-modules/eks/aws`

Renamed variables в `terraform-aws-modules/eks/aws`

Removed variables в terraform-aws-`modules/eks/aws//modules/karpenter`

`eks_managed_node_groups`: attribute "taints": map of object required

Kubernetes and `etcd`

Linux `cgroups`

The `/sys/fs/cgroup/` directory

CPU and Memory in `cgroups`, and `cgroups` v1 vs `cgroups` v2

Linux CFS and `cpu.weight`

`cpu.weight` vs process `nice`

Checking Kubernetes Pod `resources` in a cgroup

Kubernetes `kubepods.slice` cgroup

Kubernetes, `cpu.weight` and cgroups v2