Forem: Darius Juodokas

Installing Linux Mint with LVM and LUKS

Darius Juodokas — Tue, 08 Apr 2025 10:56:31 +0000

So, I got a new work laptop. I'm one of a few people in our company who prefers to set up the laptop myself. And I'm a Linux kind of person, so of course it will be running some flavour of Linux. LinuxMint, to be exact. Why? Because it works OOTB and I get the best out of the Linux world w/o having to spend days tinkering the environment to my liking. Although I'm considering NixOS too... But that's for another time.

Now, over the years of owning a Linux laptop (actually, laptop*S* -- my personal ones are also Linux-driven), I've noticed several patterns making my life easier. One of such is using LVM for disk partitioning. 2 VGs are usually enough: system and user. System hosts root and swap (if needed), and user is for.. you know.. all the user stuff, starting with /home.
One of the reasons I really like LVM is that you can easily extend your filesystems if you need more space. And here's the trick: you should NOT assign all your free space from the beginning to leverage this LVM's feature. Here's how I normally do:

partition the disk into
- EFI
- /boot
- LVM shards -- 4-5x unformatted partitions of 50GB each, and the rest -- ~20GB each
pvcreate all the shards
vgcreate both system and user groups with ~40-50GB each (initially)
lvcreate both root and home volumes, create filesystems
install OS there and use it normally
vgextend + lvextend the filesystem that's running out of space. IDK which one it will be in advance, so it's really nice to have the flexibility to do this when I need it, w/o planning in advance.

Now here's the rub. It all works OOTB with my personal laptop, but for a work laptop, I must use encrypted storage. And in order to do anything with an encrypted volume, I first have to unlock (open) it.

And this post is a list of what I had to do to make LUKS+LVM work together.

Partitioning

Let's assume we are on an empty device. At first I wasn't, but I nuked all MS Windows-related filesystems right away and started with a tabula rasa.

Using gparted (or parted if it's more convenient), create 3 partitions:

~256MB in size -- UEFI
~10GB in size -- /boot (ext4)
remaining free space -- LUKS (unformatted)

LUKS setup

Create LUKS container

First, we need to create a LUKS container, and then we'll be able to create volumes inside of it for our OS installation.

## Create LUKS container on the big partition
cryptsetup luksFormat /dev/nvme0n0p3
## Create password

## Open the LUKS container as /dev/mapper/lukslvm
cryptsetup open /dev/nvme0n0p3 lukslvm
## Enter password

Create shards for LVM

Now that the LUKS container is created and unlocked, we can start making a filesystem table inside of it.
I'm using parted for this job, because it's easier to make notes of it than screenshotting everything in gparted. Also, gparted doesn't really like LUKS-protected partitions...

Here we create a GPT filesystem table and create 15 partitions of variable size. I'm creating them using relative % units rather than absolute GB, because this way parted is able to automatically adjust disk geometry for better performance (usually it's needed to leave some unallocated space in the beginning).

## Make a GPT partition table in the LUKS container
parted -s /dev/mapper/lukslvm mklabel gpt

## Open the LUKS container with parted
parted /dev/mapper/lukslvm

(parted) unit s           ## switch to relative units (automatically adjusts geometry for performance)
(parted) print            ## show all partitions -- should be empty
(parted) print free       ## show free space slots
(parted) mkpart primary 0% 10%   ## Create partition (shard) and assign 10% of device's size to it
(parted) mkpart primary 10% 20%  ## Create partition (shard) and assign 10% of device's size to it
(parted) mkpart primary 20% 30%  ## Create partition (shard) and assign 10% of device's size to it
(parted) mkpart primary 30% 50%  ## Create partition (shard) and assign 10% of device's size to it
(parted) mkpart primary 40% 50%  ## Create partition (shard) and assign 10% of device's size to it
(parted) mkpart primary 50% 55%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 55% 60%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 60% 65%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 65% 70%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 70% 75%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 75% 80%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 80% 85%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 85% 90%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 90% 95%  ## Create partition (shard) and assign 5% of device's size to it
(parted) mkpart primary 95% 100% ## Create partition (shard) and assign 5% of device's size to it
(parted) print free       ## show free space slots. Could be a small slot in the beginning of the disk.
(parted) print            ## show all the created partitions

Create LVM

Now we have all the shards created. Next, we'll create LVM PhysicalVolumes out of them and construct the rest of the LVM system.

pvcreate /dev/mapper/lukslvm1
pvcreate /dev/mapper/lukslvm2
pvcreate /dev/mapper/lukslvm3
pvcreate /dev/mapper/lukslvm4
pvcreate /dev/mapper/lukslvm5
pvcreate /dev/mapper/lukslvm6
pvcreate /dev/mapper/lukslvm7
pvcreate /dev/mapper/lukslvm8
pvcreate /dev/mapper/lukslvm9
pvcreate /dev/mapper/lukslvm10
...

Let's attach x2 50GB PVs to each VG for initial setup and leave the rest for later in the future, i.e. when we'll really need them.

vgcreate system /dev/mapper/lukslvm1
vgextend system /dev/mapper/lukslvm2

vgcreate user /dev/mapper/lukslvm3
vgextend user /dev/mapper/lukslvm4

lvcreate -L 50G -n root system
lvcreate -L 50G -n home user

pvs
vgs
lvs

Install OS

At this point, we should have LVM set up. Volumes are available, so we can now install the OS.
Select:

When the install is complete, do not reboot just yet.

Configure LUKS to be auto-mounted and LVM auto-discovered

Mount and chroot the installed system

mount /dev/mapper/system-root /mnt ## system-root should be created by lvcreate
mount /dev/nvme0n0p2 /mnt/boot
mount --bind /proc /mnt/proc
mount --bind /sys  /mnt/sys
mount --bind /run  /mnt/run
mount --bind /dev  /mnt/dev

chroot /mnt

Now we're more or less inside the OS we've just installed.

Configure LUKS

In order to auto-unlock the LUKS container on boot, we must create the /etc/crypttab file containing information about that container: where to find it (device's UUID), what options to use, and how to name it after unlocking.

blkid | grep /dev/nvme0n0p3 ## note down the UUID
echo "cryptpool UUID=${UUID} none luks,discard" >/etc/crypttab

## make sure both crypttab and fstab are present and contain valid entries
cat /etc/fstab
cat /etc/crypttab

Configure LVM bootstrap

Now here's the fun part. When Linux is loading, it first extracts and boots into the in-ram filesystem. Then it's the kernel's job to use tools and configurations available in this filesystem to prepare the host for further booting: ensuring hardware, filesystems, drivers, etc, are all in place.

Here's the rub: by default, in initramfs, the kernel will not do both: unlock LUKS and discover LVM. We need to give it a little push. Initramfs should be OK to open the LUKS container: if you boot into your installed OS, it should ask for a password, but after that, you'll be dropped into a busybox shell, as after opening LUKS filesystems are not yet detected.

While still in the LiveUSB system and in chroot, create this:

cat <<-'EOF' >/etc/initramfs-tools/scripts/local-top/kpartx
#!/bin/sh
PREREQ="udev"

prereqs() {
    echo "$PREREQ"
}

case $1 in
    prereqs)
        prereqs
        exit 0
        ;;
esac

. /scripts/functions

# checking if lukslvm is already present
if [ ! -e /dev/mapper/lukslvm ]; then
    echo "lukslvm still unavailable"
    exit 1
fi

echo "[initramfs] running kpartx lukslvm"
kpartx -av /dev/mapper/lukslvm

vgchange -ay

exit 0
EOF

This script above will be run by the kernel in initramfs. local-top/ scripts will be launched after LUKS is unlocked. This script launches kpartx to rediscover partitions after luks-unlocking (there's no udev daemon -- new devices are not detected automatically). Then it does an LVM scan with vgchange to make sure all the LVM volumes become available to the system.

Make sure to make it executable!!

chmod +x /etc/initramfs-tools/scripts/local-top/kpartx

And then pack it into the initramfs of all the available kernels:

update-initramfs -u -k all

and... reboot. If everything's done right, you should see your system prompting for a LUKS password, then an OS logo should appear, and then you should be booted into the OS login screen.

If something did not work as expected, you will likely be dropped into the (initramfs) busybox shell. You can try and make tweaks there, but they will only be ephemeral, i.e. they will disappear after a reboot. To continue booting, simply exit the busybox shell.

References:

Building my own (new) Linux router. WiFi card properties and their explanations (iw phy)

Darius Juodokas — Fri, 22 Sep 2023 08:10:03 +0000

So, long story short, I'm trying to set up my new Linux-based WiFi router. This time I got a mini-PC with a proper PCIe so I could harness the full power of Wi-Fi cards (instead of messing around with USB dongles).

Setting up the WIFI AP is not that difficult if everything works OOTB (OutOfTheBox), but sometimes it just doesn't. In this endeavour iw is the go-to tool to figure out what's happening and to make things right. Simply put, iw is a CLI tool to access the wireless card's settings. iw phy allows us to see what these settings are.

Every time I have to dive into this output I feel like a 5-year-old at a chemistry lab - I have no clue what to look at and what means what. This time I decided to figure each line out with the help of ChatGPT, Google Bard and some manual googling. Bear in mind, that I'm not a wifi-savvy person and I only changed ChatGPT's responses with manual Google searches that felt "weak" to me, but I don't believe I've detected all of them.

FTR: my Wi-Fi card is: Intel 8265NGW

Here's the full output.

# iw phy
Wiphy phy0                                     ## name/ID of the device: phy0
        wiphy index: 0                         ## index of the device: 0 (hence phy*0*)
        max # scan SSIDs: 20                   ## during a scan, this card can detect up to 20 SSIDs (Wi-Fi devices)
        max scan IEs length: 422 bytes         ## during scan, this card can receive up to 422b of InformationElements (IE) from each SSID/device. Examples of IEs: SSID, supported data rates (1Mbps, 6Mbps, etc.), BSSID (MAC), security information (WPA2-PSK, WPA3-Enterprise, etc.), channel and frequency information, vendor-specific (proprietary) information, country code (specifies which Wi-Fi channels are allowed for use), signal strength (dBm) and noise levels (dB) (to assess signal quality), HighThroughput capabilities, extended capabilities, etc. -- all this information must fit within 422 bytes
        max # sched scan SSIDs: 20             ## this card can schedule periodic scans for up to 20 different Wi-Fi networks. During each scheduled scan cycle, it will send scan requests for these 20 SSIDs to detect and update information about those networks
        max # match sets: 11                   ## match set is a collection of conditions or criteria that a Wi-Fi device uses to determine which network to connect to when multiple Wi-Fi networks are available. Each match set can include specific parameters like SSID (network name), security type, signal strength, and other network characteristics. This card can evaluate up to 11 match sets
        Retry short limit: 7                   ## short frames typically carry less data and are used for low-data-rate transmissions or control purposes in a Wi-Fi network. If a short frame transmission fails (e.g., due to interference or collisions), the Wi-Fi device will make up to 7 retry attempts before considering the transmission as failed.
        Retry long limit: 4                    ## long frames usually carry more extensive data, such as data packets or larger information payloads. If a long-frame transmission fails, the Wi-Fi device will make up to 4 retry attempts before deeming the transmission unsuccessful
                                               ## These retry limits are part of the error recovery mechanism in Wi-Fi networks. They are designed to improve the chances of successfully transmitting frames in noisy or congested environments by allowing multiple retry attempts. After reaching the retry limit without a successful transmission, the Wi-Fi device typically backs off and may choose a different channel or adjust its transmission parameters. 
                                               ## These values are configurable in Wi-Fi devices to some extent, but they are often set to specific defaults by Wi-Fi standards and the device's firmware. The purpose of these retry limits is to strike a balance between maximizing successful transmissions and avoiding excessive network congestion due to repeated retries.
        Coverage class: 0 (up to 0m)           ## this device's driver says the device's coverage range is 0 meters (very small). I don't know why it says so. It can be increased with `iw phy0 set distance <meters>`. However, increasing coverage also increases ACK timeouts (physical distance is longer so the signal may take longer to travel), effectively reducing throughput
        Device supports RSN-IBSS.              ## RSN (Robust Security Network) is a set of security protocols and mechanisms used in Wi-Fi networks to provide robust security features, including data encryption, authentication, and data integrity protection. RSN is often associated with the WPA2 (Wi-Fi Protected Access 2) security standard, which is widely used to secure Wi-Fi communications. IBSS (Independent Basic Service Set) is a mode of operation in Wi-Fi where devices communicate with each other directly in an ad-hoc or peer-to-peer manner, without the need for a central access point (AP). In an IBSS network, devices connect directly to each other without the infrastructure of a traditional Wi-Fi network
        Device supports AP-side u-APSD.        ## it means that the access point (AP) is capable of managing u-APSD (unscheduled Automatic Power Save Delivery) for connected Wi-Fi clients. In other words, the AP can control and optimize the power-saving features for connected devices, allowing them to enter and exit power-saving mode as needed. APSD is a mechanism that allows Wi-Fi devices to enter a power-saving mode when they are not actively transmitting or receiving data. It helps conserve battery life in devices like laptops and smartphones, improves QoS
        Supported Ciphers:                     ## encryption algorithms that a Wi-Fi device or network supports for securing wireless communications. These cyphers are used to encrypt data to protect it from unauthorized access or interception. Different Wi-Fi security protocols support various ciphers
                * WEP40 (00-0f-ac:1)           ## WEP (Wired Equivalent Privacy) encryption protocol with a 40-bit key length and a specific hexadecimal identifier (00-0f-ac:1). WEP supports different key lengths, with 40-bit and 104-bit being common options. A longer key is generally more secure than a shorter one, but WEP's security weaknesses are not solely due to key length
                * WEP104 (00-0f-ac:5)          ## WEP (Wired Equivalent Privacy) encryption protocol with a 104-bit key length and a specific hexadecimal identifier (00-0f-ac:5). 
                * TKIP (00-0f-ac:2)            ## refers to a specific configuration of the TKIP (Temporal Key Integrity Protocol) encryption protocol with a specific hexadecimal identifier (00-0f-ac:2). TKIP is used in WPA standard, which is considered outdated and insecure. It's better to look for AES (CCMP)-based standard support, like WPA2 or WPA3
                * CCMP-128 (00-0f-ac:4)        ## CCMP (Counter Mode with Cipher Block Chaining Message Authentication Code Protocol) encryption protocol with 128-bit encryption key (which is a strong level of encryption; CCMP can also support 256-bit keys) and a specific hexadecimal identifier (00-0f-ac:4)
                * CMAC (00-0f-ac:6)            ## CMAC is a cryptographic algorithm used for message authentication and integrity verification. It is a type of MAC (Message Authentication Code) algorithm that uses a block cypher, such as AES (Advanced Encryption Standard), to generate authentication codes for messages.
        Available Antennas: TX 0x3 RX 0x3      ## this card has 4 antennas available: 2 for transmitting (TX) and 2 for receiving (RX) a signal. This is a binary addition. 0x1 = left, 0x2 = right, 0x3 = left & right
        Configured Antennas: TX 0x3 RX 0x3     ## this card is configured to use all 4 antennas it has available: 2 for transmitting (TX) and 2 for receiving (RX) a signal
        Supported interface modes:             ## different operating modes or configurations that a network interface or wireless adapter is capable of supporting. These modes determine how the network interface interacts with the network and the types of functions it can perform. Different network interfaces may support various interface modes, depending on their capabilities and intended use
                 * IBSS                        ## Ad-Hoc (IBSS) mode allows devices to connect directly to each other in a peer-to-peer manner without the need for a central access point (AP). It's commonly used for creating ad-hoc networks for file sharing or gaming
                 * managed                     ## this mode is often used for standard client devices like laptops and smartphones. In "Managed" mode, the device connects to a wireless access point (AP) to access the network
                 * AP                          ## this card can function as an Access Point, allowing it to create its own wireless network to which other devices can connect. This mode is used to set up Wi-Fi hotspots.
                 * AP/VLAN                     ## dynamic wireless VLANs ("dynamic VLAN" in hostapd) using MAC address RADIUS authentication. This must be a fairly common scenario - trusted clients (e.g. laptops) with a recognised MAC address join the trusted LAN, whereas anything else (mobile phones, Amazon Echo etc.) go into an untrusted "Hot LAN" with Internet access but not much else. This avoids having to have multiple SSIDs on the wireless AP
                 * monitor                     ## in "Monitor" mode, the network interface passively captures and analyzes wireless traffic without actively participating in a network. This mode is often used for network monitoring, packet analysis, and security testing.
                 * P2P-client                  ## in P2P-client mode, a device operates as a client in a Wi-Fi Direct group. It connects to another device that is functioning as a group owner (P2P-GO) to establish a direct peer-to-peer connection. This mode is similar to how a client device connects to a traditional wireless access point (AP) in a standard Wi-Fi network but in the context of Wi-Fi Direct
                 * P2P-GO                      ## a P2P-GO device is essentially the group leader or access point in a Wi-Fi Direct group. It creates and manages the Wi-Fi Direct group, allowing other devices to connect to it. P2P-GO devices facilitate direct communication between multiple P2P-client devices, enabling peer-to-peer interactions within the group. This mode is useful when one device needs to act as a central point for communication among multiple P2P devices, similar to an access point in a traditional Wi-Fi network
                 * P2P-device                  ## P2P-device mode refers to a Wi-Fi Direct-enabled device that is capable of participating in peer-to-peer connections. This mode is a general designation for devices that can either act as P2P-clients or, in some cases, become P2P-GO devices if needed. P2P-devices can discover and connect to other Wi-Fi Direct-capable devices in the vicinity, regardless of whether they are operating as P2P-GO or P2P-clients
        Band 1:                                ## a specific frequency band within the radio spectrum that is used for wireless communication. The radio spectrum is divided into various frequency bands, each with its own characteristics and applications. In other words, "Band 1" == "Frequency Range 1" (2.4GHz/5GHz/6GHz/...)
                Capabilities: 0x11ef           ## 
                        RX LDPC                ## Low-Density Parity-Check is used for validating integrity of received data streams. This is particularly important in scenarios where the wireless channel may introduce errors or noise during transmission. LDPC can help mitigate these issues and ensure that the received data is as accurate as possible
                        HT20/HT40              ## High Throughput w/ 20MHz/40MHz bandwidth channels. HT20 is the most widely used, mostly in the 2.4GHz band, and HT40 allows to double the throughput. However, it can potentially lead to increased interference and reduced compatibility with older Wi-Fi devices, as it may overlap with neighboring channels, so use it carefully in wifi-crouded areas
                        SM Power Save disabled ## "SM Power Save" (SM stands for Station Management) can be useful in situations where low latency and immediate responsiveness are more critical than power conservation. For example, in real-time applications like online gaming or voice-over-IP (VoIP) calls, you might want to keep "SM Power Save" disabled to ensure minimal delay in data transmission
                        RX HT20 SGI            ## SGI (Short Guard Interval) is enabled for data reception on the 20MHz band. SGI is a short duration of time inserted between symbols or transmissions to avoid interference and overlap. A short guard interval is a more advanced and efficient technique for managing the guard time between symbols. While it can increase throughput, in congested areas ISI (Inter-Symbol-Interference) can increase creating more faulty transmissions and effectively reducing throughput.
                        RX HT40 SGI            ## SGI (Short Guard Interval) is enabled for data reception on the 40MHz band. SGI is a short duration of time inserted between symbols or transmissions to avoid interference and overlap. A short guard interval is a more advanced and efficient technique for managing the guard time between symbols. While it can increase throughput, in congested areas ISI (Inter-Symbol-Interference) can increase creating more faulty transmissions and effectively reducing throughput.
                        TX STBC                ## "Transmit Space-Time Block Coding" is a coding technique used to enhance the robustness of data transmission. STBC involves transmitting multiple copies of the same data symbol across multiple antennas and at different time instants. This redundancy allows the receiver to better recover the transmitted data even in the presence of interference, signal fading, or other impairments. This technique is especially valuable in wireless communication standards that support multiple-input, multiple-output (MIMO) technology, where both the transmitter and receiver have multiple antennas. When "TX STBC" is enabled, the transmitting device (which has multiple antennas) sends multiple versions of the same data symbol, often at slightly different times and with specific phase shifts. The receiving device (which may also have multiple antennas) can use these multiple copies of the transmitted symbol to improve the accuracy of data recovery. STBC is particularly useful in scenarios with fading channels, where the signal strength and quality can vary rapidly due to obstacles or interference. 
                        RX STBC 1-stream       ## the receiving device is configured to use Space-Time Block Coding (STBC) to improve the reception of one data stream. STBC involves sending multiple copies of the same data symbol across multiple antennas and at different times, allowing the receiver to better recover the transmitted data. "1-stream" suggests that the receiver is focused on improving the reception of a single data stream using STBC. The use of STBC can enhance the reliability of data reception, particularly in scenarios where the wireless channel experiences interference, signal fading, or other impairments. "1-stream" indicates that the receiver is configured to receive one data stream. In MIMO technology, a "stream" refers to an independent data transmission or reception path. MIMO systems can have multiple streams, allowing for simultaneous data transmission and reception on multiple spatial paths.
                        Max AMSDU length: 3839 bytes ## an A-MSDU is essentially a way to group multiple data packets (MSDUs) into a single larger frame before transmitting them over the wireless network. This aggregation can offer several benefits, including: Efficiency (aggregating multiple data packets into one frame can reduce the overhead associated with individual packet headers and improve the overall efficiency of data transmission), Reduced Airtime Usage (by transmitting fewer frames with larger payloads, the network can use airtime more efficiently, which can lead to increased throughput), Lower Latency (aggregation can also reduce the latency associated with transmitting multiple small packets, as they are combined into a single frame). In this case, the maximum A-MSDU length is specified as 3839 bytes, which is relatively large and can be useful for optimizing data transmission efficiency in networks that support such frame sizes
                        DSSS/CCK HT40          ## DSSS (Direct Sequence Spread Spectrum) and CCK (Complementary Code Keying), combined with HT40. DSSS is a modulation and encoding technique used in wireless communication. It is an older technology primarily associated with the 802.11b standard, one of the early Wi-Fi standards. DSSS spreads the signal across a wider frequency band by using a pseudorandom spreading code. This technique helps improve the resistance to interference and provides a more robust communication link. CCK is another modulation and encoding technique used in wireless communication, primarily associated with the 802.11b and 802.11g standards. CCK is an improvement over the original DSSS modulation, providing higher data rates and improved performance. DSSS/CCK HT40 could be a legacy configuration, as on modern devices (802.11n and later) more efficient modulation and encoding techniques are used that offer higher data rates, such as those used in 802.11n (HT) or 802.11ac (VHT) (e.g. OFDM, QAM)
                Maximum RX AMPDU length 65535 bytes (exponent: 0x003) ## This specifies the maximum size or length of an A-MPDU frame that can be received within the wireless network. In this case, the maximum length is specified as 65535 bytes. WiFi protocol allows the sender to aggregate multiple AMSDU (also referred to as MPDUs) units into a single AMPDU while allowing CRC checks and retries for each AMSDU within an AMPDU. Thus the WiFi protocol allows us to achieve higher MAC efficiency by transmitting AMPDUs while limiting PERs (Packet Error Rates) and re-transmissions at the AMSDU level. 
                Minimum RX AMPDU time spacing: 4 usec (0x05)          ## This parameter sets the minimum amount of time that should elapse between the reception of consecutive A-MPDU frames - minimum 4 µs (microseconds)
                HT Max RX data rate: 300 Mbps                         ## the maximum data rate achievable when receiving data in HT mode. It's a measure of the device's ability to receive data packets at a rate of 300 megabits per second. 300 Mbps rate is typically associated with devices that support two spatial streams and a 40 MHz channel width. The use of multiple spatial streams and wider channel bandwidths allows for higher data rates. The actual achievable data rate in a Wi-Fi network can vary based on factors such as signal strength, interference, the capabilities of the wireless access point or router, and the quality of the wireless connection. The maximum data rate specified here serves as a theoretical upper limit and may not be sustained under all conditions.
                HT TX/RX MCS rate indexes supported: 0-15             ## the device supports MCS (Modulation and Coding Scheme) rate indexes ranging from 0 to 15. MCS rate indexes define specific combinations of modulation and coding schemes that determine the data rate at which data can be transmitted and received wirelessly. Each MCS rate index corresponds to a particular combination of modulation and coding. In other words, this device can use any of the 16 available MCS rate indexes for both transmission (TX) and reception (RX). MCS index 0 corresponds to the lowest data rate with the simplest modulation and coding. MCS index 15 corresponds to the highest data rate with the most advanced modulation and coding. These MCS rate indexes allow the device to adapt its data rate dynamically based on the current wireless conditions, including signal strength, interference, and noise. When network conditions are favourable, the device can use higher MCS rate indexes to achieve higher data rates, and when conditions are challenging, it can use lower MCS rate indexes for better reliability.
                Bitrates (non-HT):             ## devices that support non-HT (i.e. predating the High Throughput (802.11n) standard) data rates are typically backwards-compatible with older Wi-Fi standards to ensure connectivity with legacy devices and networks. However, when newer devices communicate with each other and with modern Wi-Fi access points, they often use the High Throughput (HT) mode, which offers higher data rates and improved performance.
                        * 1.0 Mbps
                        * 2.0 Mbps (short preamble supported)         ## short preamble is a feature that reduces the preamble length in Wi-Fi frames, allowing for faster transmission but with slightly reduced robustness
                        * 5.5 Mbps (short preamble supported)
                        * 11.0 Mbps (short preamble supported)
                        * 6.0 Mbps
                        * 9.0 Mbps
                        * 12.0 Mbps
                        * 18.0 Mbps
                        * 24.0 Mbps
                        * 36.0 Mbps
                        * 48.0 Mbps
                        * 54.0 Mbps
                Frequencies:                   ## specific radio frequencies or channels used for wireless communication. These frequencies are typically allocated within certain frequency bands by regulatory authorities to avoid interference and ensure the proper functioning of wireless devices
                        * 2412 MHz [1] (22.0 dBm)   ## Frequency: 2412 MHz, Channel Number: 1, Power Level: 22.0 dBm
                        * 2417 MHz [2] (22.0 dBm)
                        * 2422 MHz [3] (22.0 dBm)
                        * 2427 MHz [4] (22.0 dBm)
                        * 2432 MHz [5] (22.0 dBm)
                        * 2437 MHz [6] (22.0 dBm)
                        * 2442 MHz [7] (22.0 dBm)
                        * 2447 MHz [8] (22.0 dBm)
                        * 2452 MHz [9] (22.0 dBm)
                        * 2457 MHz [10] (22.0 dBm)
                        * 2462 MHz [11] (22.0 dBm)
                        * 2467 MHz [12] (22.0 dBm)
                        * 2472 MHz [13] (22.0 dBm)
                        * 2484 MHz [14] (disabled)  ## the "disabled" status means that this specific channel (Channel 14) is not available for use or is not currently enabled. In many regions, Channel 14 is not part of the standard Wi-Fi channel allocation and may have regulatory restrictions. The availability and usage of Channel 14 may vary by country or region
        Band 2:                                ## a specific frequency band within the radio spectrum that is used for wireless communication. The radio spectrum is divided into various frequency bands, each with its own characteristics and applications. In other words, "Band 1" == "Frequency Range 1" (2.4GHz/5GHz/6GHz/...)
                Capabilities: 0x11ef           ## 
                        RX LDPC                ## Low-Density Parity-Check is used for validating the integrity of received data streams. This is particularly important in scenarios where the wireless channel may introduce errors or noise during transmission. LDPC can help mitigate these issues and ensure that the received data is as accurate as possible
                        HT20/HT40              ## High Throughput w/ 20MHz/40MHz bandwidth channels. HT20 is the most widely used, mostly in the 2.4GHz band, and HT40 allows to double the throughput. However, it can potentially lead to increased interference and reduced compatibility with older Wi-Fi devices, as it may overlap with neighbouring channels, so use it carefully in wifi-crowded areas
                        SM Power Save disabled ## "SM Power Save" (SM stands for Station Management) can be useful in situations where low latency and immediate responsiveness are more critical than power conservation. For example, in real-time applications like online gaming or voice-over-IP (VoIP) calls, you might want to keep "SM Power Save" disabled to ensure minimal delay in data transmission
                        RX HT20 SGI            ## SGI (Short Guard Interval) is enabled for data reception on the 20MHz band. SGI is a short duration of time inserted between symbols or transmissions to avoid interference and overlap. A short guard interval is a more advanced and efficient technique for managing the guard time between symbols. While it can increase throughput, in congested areas ISI (Inter-Symbol-Interference) can increase creating more faulty transmissions and effectively reducing throughput.
                        RX HT40 SGI            ## SGI (Short Guard Interval) is enabled for data reception on the 40MHz band. SGI is a short duration of time inserted between symbols or transmissions to avoid interference and overlap. A short guard interval is a more advanced and efficient technique for managing the guard time between symbols. While it can increase throughput, in congested areas ISI (Inter-Symbol-Interference) can increase creating more faulty transmissions and effectively reducing throughput.
                        TX STBC                ## "Transmit Space-Time Block Coding" is a coding technique used to enhance the robustness of data transmission. STBC involves transmitting multiple copies of the same data symbol across multiple antennas and at different time instants. This redundancy allows the receiver to better recover the transmitted data even in the presence of interference, signal fading, or other impairments. This technique is especially valuable in wireless communication standards that support multiple-input, multiple-output (MIMO) technology, where both the transmitter and receiver have multiple antennas. When "TX STBC" is enabled, the transmitting device (which has multiple antennas) sends multiple versions of the same data symbol, often at slightly different times and with specific phase shifts. The receiving device (which may also have multiple antennas) can use these multiple copies of the transmitted symbol to improve the accuracy of data recovery. STBC is particularly useful in scenarios with fading channels, where the signal strength and quality can vary rapidly due to obstacles or interference.
                        RX STBC 1-stream       ## the receiving device is configured to use Space-Time Block Coding (STBC) to improve the reception of one data stream. STBC involves sending multiple copies of the same data symbol across multiple antennas and at different times, allowing the receiver to better recover the transmitted data. "1-stream" suggests that the receiver is focused on improving the reception of a single data stream using STBC. The use of STBC can enhance the reliability of data reception, particularly in scenarios where the wireless channel experiences interference, signal fading, or other impairments. "1-stream" indicates that the receiver is configured to receive one data stream. In MIMO technology, a "stream" refers to an independent data transmission or reception path. MIMO systems can have multiple streams, allowing for simultaneous data transmission and reception on multiple spatial paths.
                        Max AMSDU length: 3839 bytes ## an A-MSDU is essentially a way to group multiple data packets (MSDUs) into a single larger frame before transmitting them over the wireless network. This aggregation can offer several benefits, including: Efficiency (aggregating multiple data packets into one frame can reduce the overhead associated with individual packet headers and improve the overall efficiency of data transmission), Reduced Airtime Usage (by transmitting fewer frames with larger payloads, the network can use airtime more efficiently, which can lead to increased throughput), Lower Latency (aggregation can also reduce the latency associated with transmitting multiple small packets, as they are combined into a single frame). In this case, the maximum A-MSDU length is specified as 3839 bytes, which is relatively large and can be useful for optimizing data transmission efficiency in networks that support such frame sizes
                        DSSS/CCK HT40          ## DSSS (Direct Sequence Spread Spectrum) and CCK (Complementary Code Keying), combined with HT40. DSSS is a modulation and encoding technique used in wireless communication. It is an older technology primarily associated with the 802.11b standard, one of the early Wi-Fi standards. DSSS spreads the signal across a wider frequency band by using a pseudorandom spreading code. This technique helps improve the resistance to interference and provides a more robust communication link. CCK is another modulation and encoding technique used in wireless communication, primarily associated with the 802.11b and 802.11g standards. CCK is an improvement over the original DSSS modulation, providing higher data rates and improved performance. DSSS/CCK HT40 could be a legacy configuration, as on modern devices (802.11n and later) more efficient modulation and encoding techniques are used that offer higher data rates, such as those used in 802.11n (HT) or 802.11ac (VHT) (e.g. OFDM, QAM)
                Maximum RX AMPDU length 65535 bytes (exponent: 0x003)   ## this specifies the maximum size or length of an A-MPDU frame that can be received within the wireless network. In this case, the maximum length is specified as 65535 bytes. WiFi protocol allows the sender to aggregate multiple AMSDU (also referred to as MPDUs) units into a single AMPDU while allowing CRC checks and retries for each AMSDU within an AMPDU. Thus the WiFi protocol allows us to achieve higher MAC efficiency by transmitting AMPDUs while limiting PERs (Packet Error Rates) and re-transmissions at the AMSDU level.
                Minimum RX AMPDU time spacing: 4 usec (0x05)            ## this parameter sets the minimum amount of time that should elapse between the reception of consecutive A-MPDU frames - minimum 4 µs (microseconds)
                HT Max RX data rate: 300 Mbps                           ## the maximum data rate achievable when receiving data in HT mode. It's a measure of the device's ability to receive data packets at a rate of 300 megabits per second. 300 Mbps rate is typically associated with devices that support two spatial streams and a 40 MHz channel width. The use of multiple spatial streams and wider channel bandwidths allows for higher data rates. The actual achievable data rate in a Wi-Fi network can vary based on factors such as signal strength, interference, the capabilities of the wireless access point or router, and the quality of the wireless connection. The maximum data rate specified here serves as a theoretical upper limit and may not be sustained under all conditions.
                HT TX/RX MCS rate indexes supported: 0-15               ## the device supports MCS (Modulation and Coding Scheme) rate indexes ranging from 0 to 15. MCS rate indexes define specific combinations of modulation and coding schemes that determine the data rate at which data can be transmitted and received wirelessly. Each MCS rate index corresponds to a particular combination of modulation and coding. In other words, this device can use any of the 16 available MCS rate indexes for both transmission (TX) and reception (RX). MCS index 0 corresponds to the lowest data rate with the simplest modulation and coding. MCS index 15 corresponds to the highest data rate with the most advanced modulation and coding. These MCS rate indexes allow the device to adapt its data rate dynamically based on the current wireless conditions, including signal strength, interference, and noise. When network conditions are favourable, the device can use higher MCS rate indexes to achieve higher data rates, and when conditions are challenging, it can use lower MCS rate indexes for better reliability.
                VHT Capabilities (0x039071b0): ## VHT (Very High Throughput) capabilities of a wireless networking device, particularly in the context of the 802.11ac or 802.11ax (Wi-Fi 5 or Wi-Fi 6) standard. This notation provides information about the device's VHT capabilities encoded in a hexadecimal (hex) format. 0x039071b0 is a hexadecimal value that encodes various VHT capabilities and parameters. Each digit and group of digits in this hexadecimal value represents a specific capability or setting. To interpret it, you would need to decode the individual bits and fields according to the Wi-Fi standard's specifications
                        Max MPDU length: 3895  ## the maximum MPDU (MAC Protocol Data Unit) length supported by this device in 3895 bytes. Simply put, this device can "speak" in max 3895 B long packets
                        Supported Channel Width: neither 160 nor 80+80 ## the device does not support the wider channel widths of 160 MHz or 80+80 MHz. These wider channel widths are typically associated with high-speed data transmission in the 5 GHz band and are part of the 802.11ac and 802.11ax standards. 160 MHz: A channel width of 160 MHz allows for very high data rates but requires a relatively clean and interference-free wireless environment. It provides a wide frequency range for data transmission. 80+80 MHz: This refers to a channel width configuration where two 80 MHz channels are bonded together to form a wider 160 MHz channel. It also supports high data rates but requires careful channel planning and low interference. The device's inability to support these wider channel widths may be due to hardware limitations or regulatory restrictions
                        RX LDPC                ## Low-Density Parity-Check is used for validating the integrity of received data streams. This is particularly important in scenarios where the wireless channel may introduce errors or noise during transmission. LDPC can help mitigate these issues and ensure that the received data is as accurate as possible
                        short GI (80 MHz)      ## "short Guard Interval" (GI) operates on the 80MHz channel width. It reduces the duration of the guard interval compared to a standard or long guard interval. By reducing the guard interval duration, more data can be transmitted in the same amount of time, increasing the effective data rate. This configuration is commonly associated with high-speed data transmission in modern Wi-Fi standards such as 802.11ac (Wi-Fi 5) and 802.11ax (Wi-Fi 6)
                        TX STBC                ## Transmit Space-Time Block Coding" is a coding technique used to enhance the robustness of data transmission. STBC involves transmitting multiple copies of the same data symbol across multiple antennas and at different time instants. This redundancy allows the receiver to better recover the transmitted data even in the presence of interference, signal fading, or other impairments. This technique is especially valuable in wireless communication standards that support multiple-input, multiple-output (MIMO) technology, where both the transmitter and receiver have multiple antennas. When "TX STBC" is enabled, the transmitting device (which has multiple antennas) sends multiple versions of the same data symbol, often at slightly different times and with specific phase shifts. The receiving device (which may also have multiple antennas) can use these multiple copies of the transmitted symbol to improve the accuracy of data recovery. STBC is particularly useful in scenarios with fading channels, where the signal strength and quality can vary rapidly due to obstacles or interference.
                        SU Beamformee          ## Single-User Beamformee refers to a single user device or client that is the target of a beamforming transmission from the wireless access point or transmitter. Beamforming is a technique used to focus the wireless signal in the direction of the target device, improving signal strength and quality for that specific device.
                        MU Beamformee          ## Multi-User Beamformee refers to multiple user devices or clients that are collectively the targets of a beamforming transmission. Beamforming technology is used to simultaneously improve signal strength and quality for multiple devices in different directions, allowing for better performance in a multi-user environment.
                VHT RX MCS set:                ## This set specifies the MCS (Modulation and Coding Scheme) rates that a wireless device or access point can use to receive (RX) data in the VHT (associated with the 802.11ac (Wi-Fi 5) and 802.11ax (Wi-Fi 6) standards) mode. It includes a range of MCS values, typically from the lowest supported MCS (lower data rate) to the highest supported MCS (higher data rate). MCS defines the combination of modulation and coding used to transmit and receive data wirelessly. Higher MCS values typically represent higher data rates with more complex modulation and coding schemes. For example, a VHT RX MCS set might include values such as MCS 0 (lowest data rate) to MCS 9 (highest data rate) for 802.11ac. Each MCS value represents a specific combination of modulation and coding that allows for data transmission at a particular rate.
                        1 streams: MCS 0-9     ## when the device receives data using a single spatial stream, it supports MCS rates ranging from 0 to 9. Each MCS value corresponds to a specific combination of modulation and coding that allows for data transmission at a particular rate. As you go from MCS 0 to MCS 9, the data rate generally increases
                        2 streams: MCS 0-9     ## when the device receives data using two spatial streams, it also supports MCS rates ranging from 0 to 9. This provides flexibility for higher data rates when using multiple spatial streams, as compared to a single stream.
                        3 streams: not supported ## the device does not support three (or more) spatial streams for receiving data in the VHT mode. Three or more spatial streams are often used to achieve even higher data rates, but this device is limited to supporting up to two streams.
                        4 streams: not supported
                        5 streams: not supported
                        6 streams: not supported
                        7 streams: not supported
                        8 streams: not supported
                VHT RX highest supported: 0 Mbps ## the device's highest supported receive data rate in Very High Throughput (VHT) mode is reported as 0 Mbps. This seems to be a contradiction or an anomaly. Could be a driver bug
                VHT TX MCS set:                ## This set specifies the MCS (Modulation and Coding Scheme) rates that a wireless device or access point can use to transmit (TX) data in the VHT (associated with the 802.11ac (Wi-Fi 5) and 802.11ax (Wi-Fi 6) standards) mode. It includes a range of MCS values, typically from the lowest supported MCS (lower data rate) to the highest supported MCS (higher data rate). MCS defines the combination of modulation and coding used to transmit and receive data wirelessly. Higher MCS values typically represent higher data rates with more complex modulation and coding schemes. For example, a VHT RX MCS set might include values such as MCS 0 (lowest data rate) to MCS 9 (highest data rate) for 802.11ac. Each MCS value represents a specific combination of modulation and coding that allows for data transmission at a particular rate.
                        1 streams: MCS 0-9     ## when the device transmits data using a single spatial stream, it supports MCS rates ranging from 0 to 9. Each MCS value corresponds to a specific combination of modulation and coding that allows for data transmission at a particular rate. As you go from MCS 0 to MCS 9, the data rate generally increases
                        2 streams: MCS 0-9     ## when the device transmits data using two spatial streams, it also supports MCS rates ranging from 0 to 9. This provides flexibility for higher data rates when using multiple spatial streams, as compared to a single stream.
                        3 streams: not supported ## the device does not support three (or more) spatial streams for transmitting data in the VHT mode. Three or more spatial streams are often used to achieve even higher data rates, but this device is limited to supporting up to two streams.
                        4 streams: not supported
                        5 streams: not supported
                        6 streams: not supported
                        7 streams: not supported
                        8 streams: not supported
                VHT TX highest supported: 0 Mbps ## the device's highest supported transmitted data rate in Very High Throughput (VHT) mode is reported as 0 Mbps. This seems to be a contradiction or an anomaly. Could be a driver bug
                Bitrates (non-HT):             ## devices that support non-HT (i.e. predating the High Throughput (802.11n) standard) data rates are typically backwards-compatible with older Wi-Fi standards to ensure connectivity with legacy devices and networks. However, when newer devices communicate with each other and with modern Wi-Fi access points, they often use the High Throughput (HT) mode, which offers higher data rates and improved performance.
                        * 6.0 Mbps
                        * 9.0 Mbps
                        * 12.0 Mbps
                        * 18.0 Mbps
                        * 24.0 Mbps
                        * 36.0 Mbps
                        * 48.0 Mbps
                        * 54.0 Mbps
                Frequencies:                                                ## frequencies this device can operate in. Frequencies are assigned to particular channels, hence the format: {FREQ} MHz [{CHAN}] ({STRENGTH} dBm) ({..FLAGS}). Flags refer to restrictions applied to each channel. Restrictions can be either enforced by the manufacturer (hardware), the driver or the user (configurable). Configurable restrictions mostly are linked to the country the device operates in, as each country may have its own restrictions. See `iw reg get` below
                        * 5180 MHz [36] (22.0 dBm) (no IR)                  ## on channel 36 this device operates in 5180MHz frequency at 22.0dBm strength. There's a "no IR" (no InitiatingRadiation) flag applied to this channel, meaning the device cannot initiate radiation on this frequency, meaning it cannot act as an AP on channel 36.
                        * 5200 MHz [40] (22.0 dBm) (no IR)
                        * 5220 MHz [44] (22.0 dBm) (no IR)
                        * 5240 MHz [48] (22.0 dBm) (no IR)
                        * 5260 MHz [52] (22.0 dBm) (no IR, radar detection) ## on channel 52 this device operates in 5260MHz frequency at 22.0dBm strength. There are two flags restricting this channel: "no IR" (see above, chan. 36) and "radar detection". The RadarDetection flag means that DFS restrictions apply to this frequency. These are regulatory restrictions and a subject to jurisdiction/country regulations. Not all devices support DFS. Effectively, channel 52 cannot be used for AP due to the "no IR" flag; even if this restriction wasn't there, the AP would have to operate in a more complicated DSF mode, assuming it is supported by the hardware/firmware/driver.
                                                                            ## DFS-enabled Wi-Fi access points (APs) continuously monitor the frequency band they are operating in (usually the 5 GHz band) for radar signals. Radar systems, including weather radar and military radar, use this same frequency band for their operations. When a DFS-enabled AP detects a radar signal on its operating channel, it takes specific actions to avoid interference with the radar system. The AP immediately stops transmitting on the detected channel to avoid interfering with the radar signal. This is critical because radar systems operate at much higher power levels than Wi-Fi devices and can be disrupted by Wi-Fi transmissions. After vacating the channel, the AP selects a new channel from a DFS channel list. This list typically includes channels that are not in use by radar systems and are considered safe for Wi-Fi operation. Once the AP has switched to a new channel, it can resume normal Wi-Fi operation, including serving client devices and transmitting data.
                        * 5280 MHz [56] (22.0 dBm) (no IR, radar detection)
                        * 5300 MHz [60] (22.0 dBm) (no IR, radar detection)
                        * 5320 MHz [64] (22.0 dBm) (no IR, radar detection)
                        * 5340 MHz [68] (disabled)                          ## channel 68 with 5340MHz frequency is disabled and cannot be used on this device
                        * 5360 MHz [72] (disabled)
                        * 5380 MHz [76] (disabled)
                        * 5400 MHz [80] (disabled)
                        * 5420 MHz [84] (disabled)
                        * 5440 MHz [88] (disabled)
                        * 5460 MHz [92] (disabled)
                        * 5480 MHz [96] (disabled)
                        * 5500 MHz [100] (22.0 dBm) (no IR, radar detection)
                        * 5520 MHz [104] (22.0 dBm) (no IR, radar detection)
                        * 5540 MHz [108] (22.0 dBm) (no IR, radar detection)
                        * 5560 MHz [112] (22.0 dBm) (no IR, radar detection)
                        * 5580 MHz [116] (22.0 dBm) (no IR, radar detection)
                        * 5600 MHz [120] (22.0 dBm) (no IR, radar detection)
                        * 5620 MHz [124] (22.0 dBm) (no IR, radar detection)
                        * 5640 MHz [128] (22.0 dBm) (no IR, radar detection)
                        * 5660 MHz [132] (22.0 dBm) (no IR, radar detection)
                        * 5680 MHz [136] (22.0 dBm) (no IR, radar detection)
                        * 5700 MHz [140] (22.0 dBm) (no IR, radar detection)
                        * 5720 MHz [144] (22.0 dBm) (no IR, radar detection)
                        * 5745 MHz [149] (22.0 dBm) (no IR)
                        * 5765 MHz [153] (22.0 dBm) (no IR)
                        * 5785 MHz [157] (22.0 dBm) (no IR)
                        * 5805 MHz [161] (22.0 dBm) (no IR)
                        * 5825 MHz [165] (22.0 dBm) (no IR)
                        * 5845 MHz [169] (disabled)
                        * 5865 MHz [173] (disabled)
                        * 5885 MHz [177] (disabled)
                        * 5905 MHz [181] (disabled)
        Supported commands:                    ## a list of commands supported by this device. Basically, it's a list of things you can do with this network card. These commands provide a wide range of functionality for managing and configuring wireless networking on a Linux system. They are typically used with utilities like `iw`, `wpa_supplicant`, or other wireless management tools to control various aspects of wireless networking, including connecting to Wi-Fi networks, configuring access points, and managing mesh networks, among other tasks
                 * new_interface               ## create a new wireless network interface
                 * set_interface               ## configure settings for a wireless network interface
                 * new_key                     ## create a new encryption key for securing wireless communication
                 * start_ap                    ## start an access point (AP) for creating a Wi-Fi hotspot
                 * new_station                 ## add a new station (client) to an access point or wireless network
                 * new_mpath                   ## create a new mesh path for mesh networking
                 * set_mesh_config             ## configure settings for a mesh network
                 * set_bss                     ## configure settings for a Basic Service Set (BSS), which represents a single wireless network
                 * authenticate                ## authenticate a client device to the network
                 * associate                   ## associate a client device with the network
                 * deauthenticate              ## deauthenticate a client device from the network
                 * disassociate                ## disassociate a client device from the network
                 * join_ibss                   ## join an Independent Basic Service Set (IBSS), often used for ad-hoc wireless networking
                 * join_mesh                   ## join a mesh network
                 * remain_on_channel           ## request to remain on a specific channel for a certain duration
                 * set_tx_bitrate_mask         ## configure a bitmask for selecting transmit data rates
                 * frame                       ## send a custom frame or packet over the wireless network
                 * frame_wait_cancel           ## send a custom frame and wait for a response or cancel it
                 * set_wiphy_netns             ## set the wireless network namespace for a device
                 * set_channel                 ## set the operating channel for a wireless interface
                 * start_sched_scan            ## start a scheduled Wi-Fi scan
                 * probe_client                ## probe a client device on the network
                 * set_noack_map               ## configure the No Acknowledgment (NoAck) map to improve performance
                 * register_beacons            ## register custom beacon frames for broadcasting
                 * start_p2p_device            ## start a peer-to-peer (P2P) device for Wi-Fi Direct communication
                 * set_mcast_rate              ## set the multicast rate for data transmission
                 * connect                     ## connect to a wireless network
                 * disconnect                  ## disconnect from a wireless network
                 * channel_switch              ## switch the operating channel of a wireless interface
                 * set_qos_map                 ## set Quality of Service (QoS) mapping settings
                 * add_tx_ts                   ## add a transmission timestamp
                 * set_multicast_to_unicast    ## configure multicast-to-unicast conversion
        WoWLAN support:                        ## Wake-on-Wireless-LAN (WoWLAN) capabilities and features supported by a network interface or wireless device. WoWLAN allows a computer or device to remain in a low-power state while still being able to wake up or respond to specific network-related events. These WoWLAN features are useful for scenarios where you want a device to conserve power but still be responsive to network activity or specific network events. For example, it can be valuable for laptops or mobile devices to save battery power while remaining connected to a network, waking up only when needed for network-related tasks or communication.
                 * wake up on disconnect       ## the device can wake up from a low-power state when it disconnects from a wireless network.
                 * wake up on magic packet     ## the device can wake up in response to a "magic packet," which is a specially formatted network packet that is used to wake up a sleeping or powered-off device remotely
                 * wake up on pattern match, up to 20 patterns of 16-128 bytes,
                   maximum packet offset 0 bytes    ## the device can wake up when it detects specific patterns in incoming network traffic. It supports up to 20 patterns of 16-128 bytes each, with a maximum packet offset of 0 bytes.
                 * can do GTK rekeying         ## the device can perform Group Temporal Key (GTK) rekeying, which is a security feature for maintaining encryption keys in a wireless network
                 * wake up on GTK rekey failure     ## the device can wake up when there is a failure in the GTK rekeying process, which may be related to security issues or key rotation
                 * wake up on EAP identity request  ## the device can wake up when it receives an Extensible Authentication Protocol (EAP) identity request, often used in the authentication process for secure network access
                 * wake up on 4-way handshake  ## the device can wake up when it detects a 4-way handshake, which is part of the process of establishing a secure connection in WPA/WPA2-protected networks
                 * wake up on rfkill release   ## the device can wake up when the hardware or software rfkill (radio frequency kill) switch is released or disabled, allowing the device's radio to be re-enabled
                 * wake up on network detection, up to 11 match sets ## the device can wake up when it detects specific network-related events or conditions. It supports up to 11 match sets for network detection
        software interface modes (can always be added): ## 
                 * AP/VLAN                     ## dynamic wireless VLANs ("dynamic VLAN" in hostapd) using MAC address RADIUS authentication. This must be a fairly common scenario - trusted clients (e.g. laptops) with a recognised MAC address join the trusted LAN, whereas anything else (mobile phones, Amazon Echo etc.) goes into an untrusted "Hot LAN" with Internet access but not much else. This avoids having to have multiple SSIDs on the wireless AP
                 * monitor                     ## in "Monitor" mode, the network interface passively captures and analyzes wireless traffic without actively participating in a network. This mode is often used for network monitoring, packet analysis, and security testing.
        valid interface combinations:          ## allowable combinations of different interface modes for a wireless network interface. These combinations are defined based on the capabilities and limitations of the hardware and the wireless driver. These restrictions are in place to ensure that the operation of the wireless interfaces is within the capabilities of the hardware and to prevent conflicts or excessive interference that could degrade network performance. For example, a typical configuration might involve one network interface in managed mode for connecting to a Wi-Fi network, one in AP mode for creating a hotspot, and one in P2P client or P2P-GO mode for Wi-Fi Direct communication
                 * #{ managed } <= 1, #{ AP, P2P-client, P2P-GO } <= 1, #{ P2P-device } <= 1,
                   total <= 3, #channels <= 2  ## `#{ managed } <= 1`: This part indicates that there can be a maximum of one network interface in "managed" mode. Managed mode is the typical client mode used for connecting to Wi-Fi networks. `#{ AP, P2P-client, P2P-GO } <= 1`: This part specifies that there can be a maximum of one network interface in any of the following modes: Access Point (AP), Peer-to-Peer (P2P) client mode, or P2P Group Owner (P2P-GO) mode. These modes are often associated with creating or participating in Wi-Fi networks. `#{ P2P-device } <= 1`: This part states that there can be a maximum of one network interface in P2P device mode. P2P device mode is used for Wi-Fi Direct communication between devices. `total <= 3`: The total number of network interfaces across all modes should not exceed three. This means that you can have up to three network interfaces in various modes simultaneously. `#channels <= 2`: The total number of different Wi-Fi channels used by all interfaces should not exceed two. This is likely in consideration of channel availability and potential interference issues.
        HT Capability overrides:               ## these properties allow to configure specific parameters related to High Throughput (HT) capabilities for a wireless network interface. These settings can be used to modify or override default behaviour and fine-tune the behaviour of the HT features
                 * MCS: ff ff ff ff ff ff ff ff ff ff ## each value (ff) corresponds to a Modulation and Coding Scheme index, which represents a specific combination of modulation and coding used for data transmission. Specifying specific MCS rates allows to control the allowed data rates for transmissions
                 * maximum A-MSDU length       ## this parameter allows to set the maximum length of an Aggregated MAC Service Data Unit (A-MSDU), which is a frame aggregation technique used to improve efficiency in data transmission
                 * supported channel width     ## this setting allows you to specify the supported channel width. Channel width determines how much frequency spectrum is allocated for data transmission. We can configure this parameter to support specific channel widths, such as 20 MHz or 40 MHz.
                 * short GI for 40 MHz         ## "GI" stands for "Guard Interval," which is a time interval between symbols to prevent interference and signal overlap. This parameter enables or disables the use of a short Guard Interval (GI) when using a 40 MHz channel width. A short GI can improve data transmission efficiency by reducing the GI duration.
                 * max A-MPDU length exponent  ## this parameter sets the maximum length of an Aggregated MAC Protocol Data Unit (A-MPDU), which is another frame aggregation technique. We can configure this parameter to control the maximum size of A-MPDU frames.
                 * min MPDU start spacing      ## this parameter specifies the minimum spacing between the start of consecutive MPDUs (MAC Protocol Data Units) in an A-MPDU. It can be configured to meet specific requirements or constraints of the wireless network.
        Device supports TX status socket option.                         ## supports SO_WIFI_STATUS option, defined in Linux kernel's `/usr/include/asm-generic/socket.h`. This option allows to access packet delivery information (e.g. ACK timestamp)
        Device supports HT-IBSS.                                         ## the device can create or join ad-hoc wireless networks that take advantage of the enhanced data rates and features provided by the 802.11n standard. High Throughput Independent Basic Service Set (HT-IBSS) is an extension of the traditional Independent Basic Service Set (IBSS) mode, which is commonly known as ad-hoc mode. In ad-hoc mode, devices can connect to each other directly without the need for a central access point (AP). HT-IBSS mode is commonly used in scenarios where devices need to communicate with each other in a peer-to-peer fashion without relying on a centralized access point. This mode is suitable for applications like file sharing, gaming, or communication between devices in a temporary or ad-hoc network. 
        Device supports SAE with AUTHENTICATE command                    ## this corresponds to `NL80211_FEATURE_SAE` wiphy feature. Related to WPA3 and WiFi 6
        Device supports low priority scan.                               ## Low-priority scans are a type of Wi-Fi scan that are performed with lower urgency or priority compared to regular or high-priority scans. During a low-priority scan, the device may prioritize tasks such as maintaining an existing network connection or conserving power over actively searching for new networks. Support for low-priority scans can be beneficial in scenarios where power efficiency, background network discovery, or coexistence with other wireless tasks is essential. It allows the device to manage scanning in a way that aligns with its operational priorities
        Device supports scan flush.                                      ## Scan flush refers to the ability to cancel or prematurely terminate an ongoing Wi-Fi scan. When a scan flush operation is initiated, any ongoing scans are halted, and the scanning process is interrupted
        Device supports per-vif TX power setting                         ## the device has the capability to adjust and configure the transmit (TX) power settings on a per-virtual interface (vif) basis. This feature allows for independent control of the transmission power for each virtual interface on the device
        P2P GO supports CT window setting                                ## the device can configure and adjust the CT (Channel Time) window setting. This feature allows the Group Owner to control how much time is allocated for channel-switching operations
        P2P GO supports opportunistic powersave setting                  ## this device can enable or configure opportunistic power-saving settings. Opportunistic power-saving is a feature designed to reduce power consumption in Wi-Fi devices when they are not actively transmitting or receiving data. Opportunistic power-saving is a mechanism in Wi-Fi networks that allows devices to enter a low-power sleep mode when they are not actively communicating over the network
        Driver supports full state transitions for AP/GO clients         ## 
        Driver supports a userspace MPM                                  ## this device can work with a userspace MPM (Multi-Path Manager) component. MPM is a networking feature that manages the aggregation of multiple paths or connections to optimize network performance and reliability. Userspace MPM can coordinate the use of multiple network interfaces or paths for various purposes, such as load balancing, failover, or improved throughput
        Driver/device bandwidth changes during BSS lifetime (AP/GO mode) ## this card can dynamically adjust its bandwidth or channel width during the lifetime of a Basic Service Set (BSS) in Access Point (AP) or Group Owner (GO) mode. The BSS encompasses the area covered by a single AP or GO, and it defines the network's operational characteristics. The network device can adjust its channel width or bandwidth in response to changing network conditions, usage patterns, or requirements. Wi-Fi networks can operate with different channel widths, including 20 MHz, 40 MHz, 80 MHz, and 160 MHz, depending on the specific Wi-Fi standard (e.g., 802.11n, 802.11ac, 802.11ax). Dynamic bandwidth adjustment allows the device to switch between these channel widths as needed. 
        Device adds DS IE to probe requests                              ## this card includes a DS (Distribution System) Information Element (IE) in the probe requests it sends during Wi-Fi scanning. A probe request is a message sent by a client device to discover nearby wireless networks. These requests are used by client devices to identify available access points (APs) and initiate the process of associating with a specific AP. It means that the client device includes information about the channel it is currently using in the probe requests it broadcasts. This information helps APs and other devices understand which channel the client is operating on
        Device can update TPC Report IE                                  ## this card can update or modify the TPC (Transmit Power Control) Report Information Element (IE) in its communication with other devices on a Wi-Fi network. The TPC Report IE is used to convey information about the device's transmit power control capabilities and settings. TPC allows devices to dynamically adjust their transmit power levels based on network conditions. It helps in optimizing the coverage, minimizing interference, and conserving power
        Device supports static SMPS                                      ## static SMPS (Spatial Multiplexing Power Save) is a power-saving mode in which the device uses a fixed spatial multiplexing mode for data transmission. Spatial multiplexing is a technique that allows multiple data streams to be transmitted simultaneously over different spatial channels (antennas) to improve data throughput. In Static SMPS, the device uses a predetermined spatial multiplexing mode, typically either "Spatial Multiplexing" (MIMO) or "Spatial Multiplexing, High Rate" (MIMO, HR), to transmit data. This mode is static because it remains fixed and does not change dynamically based on network conditions
        Device supports dynamic SMPS                                     ## dynamic SMPS (Spatial Multiplexing Power Save) is a power-saving mode in which the device can dynamically adjust its spatial multiplexing mode based on network conditions and data traffic. When network traffic is low or idle, the device may switch to a lower spatial multiplexing mode (e.g., from MIMO to SISO) to conserve power. When network traffic increases, the device may switch to a higher spatial multiplexing mode to improve data throughput. Dynamic SMPS allows the device to adapt its power-saving behaviour to the current network workload.
        Device supports WMM-AC admission (TSPECs)                        ## this card supports the use of Traffic Specification (TSPEC) as a part of the Wi-Fi Multimedia (WMM) Admission Control (AC) mechanism. WMM-AC is a quality of service (QoS) mechanism in Wi-Fi networks that allows for prioritization and management of different types of traffic based on their specific requirements. AC involves the process of determining whether a new traffic flow or stream can be admitted to the network without negatively impacting existing traffic flows. TSPEC allows a device or application to request specific QoS parameters for a new traffic flow. TSPECs provide detailed information about the traffic's characteristics, such as the desired data rate, traffic pattern, and timing requirements. Effectively, this card can receive and process TSPEC requests from other devices or applications; make TSPEC requests to request specific QoS parameters for its traffic flows; engage in the admission control process, which involves evaluating TSPEC requests to determine whether a new traffic flow can be admitted while maintaining QoS for existing flows
        Device supports configuring vdev MAC-addr on create.             ## this card can configure the MAC (Media Access Control) address for a virtual device (vdev) when creating or setting up that virtual device.  Vdev allows multiple logical interfaces to share a single physical network interface. This capability provides flexibility in network configuration by allowing you to specify the MAC address for virtual devices, which can be useful for various networking scenarios, including virtualization and network testing (MAC spoofing)
        Device supports randomizing MAC-addr in scans.                   ## Randomizing the MAC address during scans helps protect user privacy by making it more difficult for passive observers to track and profile a device based on its MAC address. MAC address randomization is a privacy enhancement because it prevents third parties, such as Wi-Fi access points and tracking entities, from consistently identifying and tracking a device based on its MAC address. Instead, the device generates a random MAC address for each scan. In some regions or jurisdictions, MAC address randomization may be required or recommended for compliance with privacy regulations.
        Device supports randomizing MAC-addr in sched scans.             ## Randomizing the MAC address during scans helps protect user privacy by making it more difficult for passive observers to track and profile a device based on its MAC address. MAC address randomization is a privacy enhancement because it prevents third parties, such as Wi-Fi access points and tracking entities, from consistently identifying and tracking a device based on its MAC address. Instead, the device generates a random MAC address for each scan. In some regions or jurisdictions, MAC address randomization may be required or recommended for compliance with privacy regulations.
        Device supports randomizing MAC-addr in net-detect scans.        ## Randomizing the MAC address during scans helps protect user privacy by making it more difficult for passive observers to track and profile a device based on its MAC address. MAC address randomization is a privacy enhancement because it prevents third parties, such as Wi-Fi access points and tracking entities, from consistently identifying and tracking a device based on its MAC address. Instead, the device generates a random MAC address for each scan. In some regions or jurisdictions, MAC address randomization may be required or recommended for compliance with privacy regulations.
        max # scan plans: 2                                              ## a scan plan refers to a predefined set of scanning parameters and behaviour that the device can use during Wi-Fi scans. Scans are performed by devices to discover available Wi-Fi networks, access points (APs), and other wireless devices in the vicinity. A scan plan consists of a set of parameters and rules that govern how a device conducts its scanning operations. These parameters include scan interval, scan duration, channels to scan, and more. Scan plans allow for flexibility in how scans are performed and can be used to optimize network discovery, power consumption, and scanning efficiency.
        max scan plan interval: 65535                                    ## 
        max scan plan iterations: 254                                    ## 
        Supported TX frame types:                                        ## 
                 * IBSS: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * managed: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * AP: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * AP/VLAN: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * mesh point: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * P2P-client: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * P2P-GO: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
                 * P2P-device: 0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xa0 0xb0 0xc0 0xd0 0xe0 0xf0
        Supported RX frame types:                                        ## 
                 * IBSS: 0x40 0xb0 0xc0 0xd0
                 * managed: 0x40 0xb0 0xd0
                 * AP: 0x00 0x20 0x40 0xa0 0xb0 0xc0 0xd0
                 * AP/VLAN: 0x00 0x20 0x40 0xa0 0xb0 0xc0 0xd0
                 * mesh point: 0xb0 0xc0 0xd0
                 * P2P-client: 0x40 0xd0
                 * P2P-GO: 0x00 0x20 0x40 0xa0 0xb0 0xc0 0xd0
                 * P2P-device: 0x40 0xd0
        Supported extended features:                                     ## 
                * [ VHT_IBSS ]: VHT-IBSS                                 ## "VHT-IBSS" stands for Very High Throughput Independent Basic Service Set. This feature indicates that the device supports the use of VHT (Very High Throughput) in Independent Basic Service Set (IBSS) mode, also known as ad-hoc mode. VHT is an extension of the IEEE 802.11 standard that provides higher data rates and improved performance compared to earlier Wi-Fi standards. VHT-IBSS means that devices operating in ad-hoc mode can take advantage of these higher data rates.
                * [ RRM ]: RRM                                           ## "RRM" stands for Radio Resource Management. This feature indicates that the device supports Radio Resource Management capabilities. RRM encompasses a set of techniques and mechanisms used in Wi-Fi networks to optimize the allocation and utilization of radio resources, including channels and transmit power, to improve network performance and reliability. The support for Radio Resource Management (RRM) allows the device to participate in dynamic channel selection, transmit power control, and other optimization techniques that contribute to better network performance and reduced interference.
                * [ MU_MIMO_AIR_SNIFFER ]: MU-MIMO sniffer               ## the purpose of a MU-MIMO sniffer is to gather information about MU-MIMO communication, including the number of streams used, devices involved, and the efficiency of the transmissions.
                * [ SCAN_START_TIME ]: scan start timestamp              ## the exact timestamp at which a Wi-Fi scan operation was initiated by the device
                * [ BSS_PARENT_TSF ]: BSS last beacon/probe TSF          ## this card can record and provide information related to the TSF (Timestamp Synchronization Function) of the last beacon or probe response frame received from a specific BSS (Basic Service Set)
                * [ FILS_STA ]: STA FILS (Fast Initial Link Setup)
                * [ FILS_MAX_CHANNEL_TIME ]: FILS max channel attribute override with dwell time
                * [ ACCEPT_BCAST_PROBE_RESP ]: accepts broadcast probe response
                * [ OCE_PROBE_REQ_HIGH_TX_RATE ]: probe request TX at high rate (at least 5.5Mbps)
                * [ OCE_PROBE_REQ_DEFERRAL_SUPPRESSION ]: probe request tx deferral and suppression
                * [ CONTROL_PORT_OVER_NL80211 ]: control port over nl80211
                * [ TXQS ]: FQ-CoDel-enabled intermediate TXQs
                * [ EXT_KEY_ID ]: Extended Key ID support
                * [ CONTROL_PORT_NO_PREAUTH ]: disable pre-auth over nl80211 control port support
                * [ DEL_IBSS_STA ]: deletion of IBSS station support
                * [ SCAN_FREQ_KHZ ]: scan on kHz frequency support
                * [ CONTROL_PORT_OVER_NL80211_TX_STATUS ]: tx status for nl80211 control port support

And the lines referring to "regulatory" matters actually refer to country-specific legal requirements. The WiFi card can be set into a specific country's mode (with all the requirements linked to a country code) in order to apply all the country's regulations/restrictions.

In order to see which country is set to the card (and what regulations are applied to it) use the iw reg get command. To change the country, use the iw reg set CC command, where CC is CountryCode (LT, DE, US, etc.; 00 being "global")

# iw reg get
global                                                                   ## this regulatory domain information applies globally, meaning it is the default regulatory domain for your wireless network interface.
country LT: DFS-ETSI                                                     ## the country code, which in this case is "LT" (Lithuania), and also indicates that it follows the DFS (Dynamic Frequency Selection) rules set by ETSI (European Telecommunications Standards Institute).
        (2400 - 2483 @ 40), (N/A, 20), (N/A)                             ## the 2.4 GHz band, which ranges from 2400 MHz to 2483 MHz, with a channel width of 40 MHz. The numbers in parentheses indicate the maximum allowed transmit power (N/A here), the maximum EIRP (Equivalent Isotropically Radiated Power) limit (20 dBm), and any other special restrictions (N/A).
        (5150 - 5250 @ 80), (N/A, 23), (N/A), NO-OUTDOOR, AUTO-BW        ## for the 5 GHz band, the following is true for the 5150-5250 MHz frequency band: (1) the maximum channel bandwidth is 80 MHz; (2) the maximum transmit power is 23 dBm; (3) outdoor use of this band is not allowed; (4) the channel bandwidth is automatically selected by the device. The `(N/A, 23)` part of the line means that there is no transmit power limit for this band, but the device will not transmit at more than 23 dBm. This is because the regulatory domain in which the device is operating may have a lower transmit power limit. The `AUTO-BW` part of the line means that the device will automatically select the channel bandwidth based on the regulatory domain in which it is operating and the capabilities of the other devices on the network
        (5250 - 5350 @ 80), (N/A, 20), (0 ms), NO-OUTDOOR, DFS, AUTO-BW
        (5470 - 5725 @ 160), (N/A, 26), (0 ms), DFS                      ## the following is true for the 5470-5725 MHz frequency band: (1) the maximum channel bandwidth is 160 MHz; (2) the maximum transmit power is 26 dBm; (3) the device will listen for radar signals before using a channel in this band. The `(N/A, 26)` part of the line means that there is no transmit power limit for this band, but the device will not transmit at more than 26 dBm. This is because the regulatory domain in which the device is operating may have a lower transmit power limit. The `(0 ms)` part of the line means that the device will listen for radar signals for 0 milliseconds before using a channel in this band. This is because the 5470-5725 MHz frequency band is also used by radar systems, and the device needs to avoid interfering with them. The `DFS` part of the line means that the device will use Dynamic Frequency Selection (DFS) to detect and avoid radar signals. DFS is a protocol that allows Wi-Fi devices to use the 5470-5725 MHz frequency band without interfering with radar systems.
        (5725 - 5875 @ 80), (N/A, 13), (N/A)
        (5945 - 6425 @ 160), (N/A, 23), (N/A), NO-OUTDOOR
        (57000 - 66000 @ 2160), (N/A, 40), (N/A)

phy#0 (self-managed)                                                     ## the phy#0 device is managing its own regulatory settings
country 00: DFS-UNSET                                                    ## the country code is set to 00, which is a special code that indicates that the device is not configured to operate in a specific regulatory domain.
        (2402 - 2437 @ 40), (6, 22), (N/A), AUTO-BW, NO-HT40MINUS, NO-80MHZ, NO-160MHZ           ## the following is true for the 2402-2437 MHz frequency band: 
                                                                                                 ## - the maximum channel bandwidth is 40 MHz; 
                                                                                                 ## - the maximum transmit power is 22 dBm;
                                                                                                 ## - outdoor use of this band is allowed;
                                                                                                 ## - the channel bandwidth is automatically selected by the device [AUTO-BW];
                                                                                                 ## - HT40- is not allowed [NO-HT40MINUS]; HT40- is a Wi-Fi technology that allows for 40 MHz channels below the primary channel;
                                                                                                 ## - 80 MHz channels are not allowed [NO-80MHZ];
                                                                                                 ## - 160 MHz channels are not allowed [NO-160MHZ].
                                                                                                 ## The `(N/A)` part of the line means that the transmit power limit for this band is not specified by the regulatory domain.
        (2422 - 2462 @ 40), (6, 22), (N/A), AUTO-BW, NO-80MHZ, NO-160MHZ
        (2447 - 2482 @ 40), (6, 22), (N/A), AUTO-BW, NO-HT40PLUS, NO-80MHZ, NO-160MHZ            ## the following is true for the 2447-2482 MHz frequency band: 
                                                                                                 ## - the maximum channel bandwidth is 40 MHz;
                                                                                                 ## - the maximum transmit power is 22 dBm;
                                                                                                 ## - outdoor use of this band is allowed;
                                                                                                 ## - the channel bandwidth is automatically selected by the device [AUTO-BW];
                                                                                                 ## - HT40+ is not allowed [NO-HT40PLUS]; HT40+ is a Wi-Fi technology that allows for 40 MHz channels above the primary channel;
                                                                                                 ## - 80 MHz channels are not allowed [NO-80MHZ];
                                                                                                 ## - 160 MHz channels are not allowed [NO-160MHZ]; 160 MHz channels are the widest Wi-Fi channels that are currently available.
                                                                                                 ## The `(N/A)` part of the line means that the transmit power limit for this band is not specified by the regulatory domain.
        (5170 - 5190 @ 80), (6, 22), (N/A), NO-OUTDOOR, AUTO-BW, IR-CONCURRENT, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN  ## the following is true for the 5170-5190 MHz frequency band:
                                                                                                                       ## - the maximum channel bandwidth is 80 MHz; 
                                                                                                                       ## - the maximum transmit power is 22 dBm; 
                                                                                                                       ## - outdoor use of this band is not allowed [NO-OUTDOOR];
                                                                                                                       ## - the channel bandwidth is automatically selected by the device [AUTO-BW];
                                                                                                                       ## - IR-CONCURRENT is enabled; IR-CONCURRENT is a feature that allows the device to use the same frequency band for both indoor and outdoor communication. This is useful for extending the coverage of a Wi-Fi network
                                                                                                                       ## - HT40- is not allowed [NO-HT40MINUS];
                                                                                                                       ## - 160 MHz channels are not allowed [NO-160MHZ];
                                                                                                                       ## - the device will only passively scan for channels in this band [PASSIVE-SCAN]; this means that the device will not send out any probe requests, but it will listen for beacon frames from other access points. This is useful for conserving battery power and reducing interference.
                                                                                                                       ## The `(6, 22)` part of the line means that there is no transmit power limit for this band, but the device will not transmit at more than 22 dBm. This is because the regulatory domain in which the device is operating may have a lower transmit power limit.
                                                                                                                       ## The `(N/A)` part of the line means that the device will not listen for radar signals before using a channel in the 5170-5190 MHz frequency band. This is because the regulatory domain in which the device is operating may not require it.
        (5190 - 5210 @ 80), (6, 22), (N/A), NO-OUTDOOR, AUTO-BW, IR-CONCURRENT, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5210 - 5230 @ 80), (6, 22), (N/A), NO-OUTDOOR, AUTO-BW, IR-CONCURRENT, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5230 - 5250 @ 80), (6, 22), (N/A), NO-OUTDOOR, AUTO-BW, IR-CONCURRENT, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5250 - 5270 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN  ## the following is true for the 5250-5270 MHz frequency band:
                                                                                                  ## - the maximum channel bandwidth is 80 MHz;
                                                                                                  ## - the maximum transmit power is 22 dBm;
                                                                                                  ## - the device will listen for radar signals before using a channel in this band [DFS];
                                                                                                  ## - the channel bandwidth is automatically selected by the device [AUTO-BW];
                                                                                                  ## - HT40- is not allowed [NO-HT40MINUS];
                                                                                                  ## - 160 MHz channels are not allowed [NO-160MHZ];
                                                                                                  ## - the device will only passively scan for channels in this band [PASSIVE-SCAN].
                                                                                                  ## The `(0 ms)` part of the line means that the device will listen for radar signals for 0 milliseconds before using a channel in this band. This is because the 5250-5270 MHz frequency band is also used by radar systems, and the device needs to avoid interfering with them.
        (5270 - 5290 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5290 - 5310 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5310 - 5330 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5490 - 5510 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5510 - 5530 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5530 - 5550 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5550 - 5570 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5570 - 5590 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5590 - 5610 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5610 - 5630 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5630 - 5650 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5650 - 5670 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5670 - 5690 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5690 - 5710 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5710 - 5730 @ 80), (6, 22), (0 ms), DFS, AUTO-BW, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5735 - 5755 @ 80), (6, 22), (N/A), AUTO-BW, IR-CONCURRENT, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5755 - 5775 @ 80), (6, 22), (N/A), AUTO-BW, IR-CONCURRENT, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5775 - 5795 @ 80), (6, 22), (N/A), AUTO-BW, IR-CONCURRENT, NO-HT40MINUS, NO-160MHZ, PASSIVE-SCAN
        (5795 - 5815 @ 80), (6, 22), (N/A), AUTO-BW, IR-CONCURRENT, NO-HT40PLUS, NO-160MHZ, PASSIVE-SCAN
        (5815 - 5835 @ 20), (6, 22), (N/A), AUTO-BW, IR-CONCURRENT, NO-HT40MINUS, NO-HT40PLUS, NO-80MHZ, NO-160MHZ, PASSIVE-SCAN

So, if I run iw reg set LT I should be able to set the card in a Lithuanian mode, effectively applying LT-specific regulations. Unfortunately, the iw reg set command only changes the global section and has no effect on the phy#0 one.

The fact that phy0 claims to be self-managed suggests that it attempts to do things automatically. And that's where is the problem. My card is Intel-based, and its driver uses LAR (Location Aware Regulatory) to automatically detect and apply regional settings to the card. The automation is quite straight forward: look what others are doing and do it yourself. Basically, LAR listens to WiFi beacons transmitted by surrounding APs and looks into them to see what country code they are operating with. And then it sets the same country code. But have you ever run an airodump-ng scan in your apartment/house to see what country codes do the surrounding devices broadcast? I did. I found lots of WiFi APs, and none of them were broadcasting LT (that's where I live - Lithuania): some were DE, some were US, some - CN, but most of them were nomads, i.e. not broadcasting any country code inside their beacons. It's very likely that in an environment like this Intel LAR gets confused and just falls back to the restrictive - Global mode, meaning "no IR". This effectively disallows any of the 5GHz channels to be used for AP (either disabled or with the NO_IR flag). A simple workaround is to unload the iwlwifi module and reload it with LAR disabled: modprobe iwlwifi lar_disable=1. Unfortunately, Intel removed this ability around 2019/2020 and, since Intel maintain their Linux driver themselves (and not the community) and pushes it directly to the Linux kernel's source code, there is nothing we can do at this point. I cannot use this (or any Intel-based) Wi-Fi card as a 5GHz AP on Linux.

So I guess I better go shopping for another vendor's WiFi5/6 card that doesn't use Intel's chipset :/

P.S. The airodump-ng scan is easy to carry out.

# systemctl stop NetworkManager
# iwconfig wlan0 mode monitor   ## wlan0 - that's your WiFi interface's; change it to whatever is in your setup
# airodump-ng --beacons wlan0 --manufacturer --uptime --band a -w /tmp/wifi_dump_5GHz.pcap
# ## ↑↑ let it run for a while
# airodump-ng --beacons wlan0 --manufacturer --uptime -w /tmp/wifi_dump_2GHz.pcap
# ## ↑↑ let it run for a while
# 
# ## Now reenable your wifi interface - back to normal mode
# ifconfig wlan0 down
# iwconfig wlan0 mode managed
# ifconfig wlan0 up
# systemctl restart NetworkManager

Then open the cap files with Wireshark, find Beacon frames and see what country code they broadcast.

References

https://medium.com/@renaudcerrato/how-to-build-your-own-wireless-router-from-scratch-part-3-d54eecce157f
https://blog.iptel.com.au/wifi-and-the-problem-with-radar
https://openwrt-devel.openwrt.narkive.com/S2ThWHw0/weak-signal-mynet-n600
AP/VLAN, Dynamic VLAN https://www.radiusdesk.com/old_wiki/technical_discussions/dynamicvlan
AP/VLAN, Dynamic VLAN http://linuxwireless.sipsolutions.net/en/users/Documentation/hostapd/__v24.html
AP/VLAN, Dynamic VLAN https://www.clearos.com/clearfoundation/social/community/dynamic-wireless-vlans-using-radius
AP/VLAN, Dynamic VLAN https://linux-wireless.vger.kernel.narkive.com/7uJUknkF/help-guidance-on-ap-vlan-mode
https://www.computerworld.com/article/2481909/what-every-it-professional-should-know-about-802-11n--part-1-.html
AMSDU, AMPDU https://arxiv.org/pdf/1704.07015.pdf
SO_WIFI_STATUS https://www.spinics.net/lists/netdev/msg176403.html
SO_WIFI_STATUS https://www.spinics.net/lists/netdev/msg176415.html
NL80211_FEATURE_SAE https://github.com/raspberrypi/linux/issues/4718
lar_disable https://unix.stackexchange.com/questions/253933/wifi-iw-reg-set-us-has-no-effect
LAR https://unix.stackexchange.com/questions/385565/what-is-iwlwifis-lar-disable/385590#385590
lar_disable removed from iwlwifi https://github.com/torvalds/linux/commit/f06021a18fcf8d8a1e79c5e0a8ec4eb2b038e153#diff-1bb86765baf2eb0ae5f8999777c4f6b45c641006088444e0a2cb855e3dab8e9e
LAR is causing issues https://bugzilla.kernel.org/show_bug.cgi?id=205695
LAR cannot be disabled https://www.reddit.com/r/archlinux/comments/11ussvc/patching_the_broken_iwlwifi_due_to_intel_removing/
https://openwrt.org/docs/guide-user/network/wifi/wifi_countrycode

HP LoadRunner results analysis. Queries

Darius Juodokas — Thu, 27 Jul 2023 15:30:09 +0000

About

I've been working as a performance engineer for a number of years. I started my PE career with JMeter, spent lots of time with HP LoadRunner and culminated with k6. The most annoying of the three was HP LR. Mostly, because it is very rigid (at least from a user's perspective), only runs on Windows and can't be talked into doing anything the OSS way. Our team was lean, fast and innovative, so it was only a matter of time before LR had to either budge and do it my way, or hit the highway.

I won.

Struggles

HP LR is a massive, very powerful tool with proper customer support and quite a large community. However, it's married to MS Windows. If you need anything more than the defaults it can provide, you have to open results' files with the HP LR tool, creatively (/s) called "Analysis". And it only works on MS Windows. Want to integrate performance testing in your CI? Not happening! Unless you are willing to run a dedicated Windows server with Analysis installed in it, acting as a SPOF, chomping on your resources on free time.

I said NO

Running a Windows machine 24/7, or even worse - spinning it up on demand and hoping it will start up correctly before the CI times out - was not an option for me. Not to mention licencing nuances I'd have to address/request in the corp. I worked for. I needed a lean, simple and easily maintainable approach. I searched for hints in the HPLR community forums, asked others for assistance, but the answers I got either were along the lines of "forget it" or "use a windows server for that - Analysis has a CLI and <... I stopped reading at that point>".

Discovery

I decided to have a darn good look at the resources I have. My main question was: is the HPLR results' file proprietary or can I perhaps open it somehow and extract what I need from it myself? I downloaded all the run files and analyzed them closely. One of the zip archives contained a particularly big file with an extension: .mdb. A quick google search told me that's an Access DB file. Now, how do I access it with Java....? Then I found the ucanaccess JDBC driver.

I added 2 + 2 and a potential solution was born in my head.

Struggles, again

The .mdb file structure is not well docummented, so it took quite some time to understand what's where. The "meat" is in the table event_meter, but its structure is not immediatelly obvious. I struggled with the acount concept (to be fair, I no longer remember what it means, but I know it's something about data aggregation - perhaps how many such events happened in that single second? IDK). A few weeks later, the dragon was tamed and the queries were born.

Queries

https://gitlab.com/-/snippets/2575521

run_details

This query returns a run summary:

+--------+---------------+-------------+-----------------+---------------+----------+----------+-------------+-----+-----------+
|TOTAL_VU|VU_RAMPUP_START|VU_RAMPUP_END|VU_RAMPDOWN_START|VU_RAMPDOWN_END|test_start|test_end  |test_duration|tz   |result_name|
+--------+---------------+-------------+-----------------+---------------+----------+----------+-------------+-----+-----------+
|25000   |82             |294          |948              |1327           |1690377335|1690378663|1328         |28800|res9694.lrr|
+--------+---------------+-------------+-----------------+---------------+----------+----------+-------------+-----+-----------+

with run_details(test_start, test_end, test_duration, tz, result_name)
         as (select (result.[start time])                            as test_start,
                    (result.[result end time])                       as test_end,
                    (result.[result end time] - result.[start time]) as test_duration,
                    (result.[time zone])                             as tz,
                    (result.[result name])                           as result_name
             from result),
     vuser_events(inout_flag, status_name, time)
         as (SELECT (events.[inout flag])        as inout_flag,
                    (status.[vuser status name]) as status_name,
                    (events.[end time])          as time
             FROM vuserevent_meter events
                      JOIN vuserstatus status ON (events.[vuser status id]) = (status.[vuser status id])
             order by time asc)
select (select sum(inout_flag) from vuser_events where status_name = 'RUN' and inout_flag > 0) as total_vu,
       (select min(time) from vuser_events where status_name = 'RUN' and inout_flag > 0)       as vu_rampup_start,
       (select max(time) from vuser_events where status_name = 'RUN' and inout_flag > 0)       as vu_rampup_end,
       (select min(time) from vuser_events where status_name = 'RUN' and inout_flag < 0)       as vu_rampdown_start,
       (select max(time) from vuser_events where status_name = 'RUN' and inout_flag < 0)       as vu_rampdown_end,
       details.test_start                                                                      as test_start,
       details.test_end                                                                        as test_end,
       details.test_duration                                                                   as test_duration,
       details.tz                                                                              as tz,
       details.result_name                                                                     as result_name
from run_details details;

run_errors

This query returns test run errors linked to scripts, actions, VU ID and with a test run window offset (sort of a timestamp)

+--------+------------+-----+-------+---------+-------------+-----------------+------------+---------------------------------------------------------------------------------+
|END_TIME|EVENT_NAME  |VU_ID|LINE_NO|ITERATION|SCRIPT_NAME  |ACTION_NAME      |LG_NAME     |MESSAGE                                                                          |
+--------+------------+-----+-------+---------+-------------+-----------------+------------+---------------------------------------------------------------------------------+
|84.365  |Error -26366|47   |23     |1        |01_MyScript_1|Create_Cart_Items|srvhplrlg001|Create_Cart_Items.c(23) Error -26366 "Text="cart_id1"" not found for web_reg_find|
+--------+------------+-----+-------+---------+-------------+-----------------+------------+---------------------------------------------------------------------------------+
<........>

with test_script(id, name) as (select
                                   [script id] as id,
                                   [script name] as name
                               from Script),
     event(id, type, name) as (select
                                   [event id] as id,
                                   [event type] as type,
                                   [event name] as name
                               from event_map),
     err(event_id, end_time, error_id, vu_id, host_id, script_id, line_no, action_id, iteration) as (select
                                                                                                         [event id] as event_id,
                                                                                                         [end time] as end_time,
                                                                                                         [error id] as error_id,
                                                                                                         [vuser id] as vu_id,
                                                                                                         [host id] as host_id,
                                                                                                         [script id] as script_id,
                                                                                                         [line number] as line_no,
                                                                                                         [action id] as action_id,
                                                                                                         [iteration number] as iteration
                                                                                                     from Error_meter),
     error(id, message) as (select
                                [error id] as id,
                                [error message] as message
                            from errormessage),
     action(id, name) as (select
                              [action id] as id,
                              [action name] as name
                          from ScriptActions),
     lg(id, name) as (select
                          [host id] as id,
                          [host name] as name
                      from host),
     sorted_errors(end_time, event_name, vu_id, line_no, iteration, script_name, action_name, lg_name, message)
         as (select err.end_time,
                    event.name       as event_name,
                    err.vu_id,
                    err.line_no,
                    err.iteration,
                    test_script.name as script_name,
                    action.name      as action_name,
                    lg.name          as lg_name,
                    error.message
             from err
                      join error on err.error_id = error.id
                      left join test_script on err.script_id = test_script.id
                      left join action on err.action_id = action.id
                      left join lg on err.host_id = lg.id
                      join event on err.event_id = event.id
             order by end_time, vu_id asc)
select *
from sorted_errors;

run_results

This is a big one. It's the core query my automation revolves around. It summarizes test results for each Action (transaction) and makes it rather easy to pluck those numbers out of the database and use them in my report. It also becomes quite easy to compare runs with each other.

NOTE BENE this query is memory-hungry and I ended up alocating 8GB to the alpine containers running it. I guess your memory requirements will depend on how long your tests run for and how many scripts/actions you have.

+-------------------------------+------------+------------+-----------+------------+------------+-----------+-------+----------+----------+----------+----------------+----------------+
|TRANSACTION_NAME               |PASSED_COUNT|FIRST_PASSED|LAST_PASSED|FAILED_COUNT|FIRST_FAILED|LAST_FAILED|AVG_TPS|MAXIMUM_RT|MINIMUM_RT|AVERAGE_RT|RT_PERCENTILE_90|RT_PERCENTILE_50|
+-------------------------------+------------+------------+-----------+------------+------------+-----------+-------+----------+----------+----------+----------------+----------------+
|#TOTAL                         |616917      |81          |198        |2963        |84          |198        |3427   |29.829    |0.026     |1.059     |0               |0               |
|01_MyScript_1_Create_Cart_Items|3403        |81          |198        |283         |84          |198        |18     |23.546    |0.337     |4.082     |10.5            |2.559           |
+-------------------------------+------------+------------+-----------+------------+------------+-----------+-------+----------+----------+----------+----------------+----------------+
<........>

NOTE in the lines 3 and 4 set the correct timeframe to generate the report for. Values are seconds from the start of the test.

with tally as (select rownum() as num from event_meter),
     test_configuration(timeframe_start, timeframe_end)
         as (select %d as timeframe_start,
                 %d as timeframe_end
             from dual),
     run_details(test_start, test_end, test_duration, tz, result_name)
         as (select (result.[start time])                            as test_start,
                    (result.[result end time])                       as test_end,
                    (result.[time zone])                             as tz,
                    (result.[result name])                           as result_name,
                    (result.[result end time] - result.[start time]) as test_duration
             from result),
     all_events (event_id, value, acount, end_time, event_name, event_type, status1, describe_id)
         as (select (data_raw.[event instance id]) as event_id,
                    data_raw.value                 as value,
                    data_raw.acount                as acount,
                    (data_raw.[end time])          as end_time,
                    (event.[event name])           as event_name,
                    (event.[event type])           as event_type,
                    data_raw.status1               as status1,
                    (event.[describe id])          as describe_id
             from event_meter data_raw
                      join event_map event
                           on (event.[event id]) = (data_raw.[event id])
             order by value asc),
     all_events_steady (event_id, value, acount, end_time, event_name, event_type, status1, describe_id)
         as (select event_id    as event_id,
                    value       as value,
                    acount      as acount,
                    end_time    as end_time,
                    event_name  as event_name,
                    event_type  as event_type,
                    status1     as status1,
                    describe_id as describe_id
             from all_events
             where end_time >= (select timeframe_start from test_configuration)
               and end_time <= (select timeframe_end from test_configuration)),
     passed_events (event_id, value, acount, end_time, event_name, event_type, status1, describe_id, tx_status)
         as (select data_all_passed.event_id,
                    data_all_passed.value,
                    data_all_passed.acount,
                    data_all_passed.end_time,
                    data_all_passed.event_name,
                    data_all_passed.event_type,
                    data_all_passed.status1,
                    data_all_passed.describe_id,
                    (tx_status.[transaction end status]) as tx_status
             from all_events_steady data_all_passed
                      JOIN transactionendstatus tx_status
                           ON data_all_passed.status1 = tx_status.status1
                      left join tally on tally.num <= data_all_passed.acount
             where data_all_passed.event_type = 'Transaction'
               AND (tx_status.[transaction end status]) = 'Pass'
             ORDER BY data_all_passed.value ASC),
     failed_events(event_id, value, acount, end_time, event_name, event_type, status1, describe_id, tx_status)
         as (select data_all_failed.event_id,
                    data_all_failed.value,
                    data_all_failed.acount,
                    data_all_failed.end_time,
                    data_all_failed.event_name,
                    data_all_failed.event_type,
                    data_all_failed.status1,
                    data_all_failed.describe_id,
                    (tx_status.[transaction end status]) as tx_status
             from all_events_steady data_all_failed
                      JOIN transactionendstatus tx_status
                           ON data_all_failed.status1 = tx_status.status1
                      left join tally on tally.num <= data_all_failed.acount
             where data_all_failed.event_type = 'Transaction'
               AND (tx_status.[transaction end status]) = 'Fail'),
     grouped_events (event_name,
                     event_count, first_event, last_event,
                     value_max, value_min, value_avg,
                     pctl_id_98, pctl_id_95, pctl_id_90, pctl_id_50) as (select data_passed.event_name                as event_name,
                                                                                count(data_passed.value)              as event_count,
                                                                                min(data_passed.end_time)             as first_event,
                                                                                max(data_passed.end_time)             as last_event,
                                                                                max(data_passed.value)                as value_max,
                                                                                min(data_passed.value)                as value_min,
                                                                                avg(data_passed.value)                as value_avg,
                                                                                (count(data_passed.value) * 98 / 100) as pctl_id_98,
                                                                                (count(data_passed.value) * 95 / 100) as pctl_id_95,
                                                                                (count(data_passed.value) * 90 / 100) as pctl_id_90,
                                                                                (count(data_passed.value) * 50 / 100) as pctl_id_50
                                                                         from passed_events data_passed
                                                                         GROUP BY data_passed.describe_id,
                                                                                  data_passed.event_name),
     summary_table(transaction_name,
                   passed_count, first_passed, last_passed,
                   failed_count, first_failed, last_failed,
                   avg_tps,
                   maximum_rt, minimum_rt, average_rt,
                   rt_percentile_90_id, rt_percentile_90,
                   rt_percentile_50_id, rt_percentile_50) as (select grouped.event_name                                    as transaction_name,
                                                                     grouped.event_count                                   as passed_count,
                                                                     grouped.first_event                                   as first_passed,
                                                                     grouped.last_event                                    as last_passed,
                                                                     (select count(failed.acount)
                                                                      from failed_events failed
                                                                      where failed.event_name = grouped.event_name)        as failed_count,
                                                                     (select min(failed.end_time)
                                                                      from failed_events failed
                                                                      where failed.event_name = grouped.event_name)        as first_failed,
                                                                     (select max(failed.end_time)
                                                                      from failed_events failed
                                                                      where failed.event_name = grouped.event_name)        as last_failed,
                                                                     (grouped.event_count /
                                                                      (select timeframe_end - timeframe_start
                                                                       from test_configuration)
                                                                         )                                                 as avg_tps,
                                                                     grouped.value_max                                     as maximum_rt,
                                                                     grouped.value_min                                     as minimum_rt,
                                                                     grouped.value_avg                                     as average_rt,
                                                                     grouped.pctl_id_90                                    as rt_percentile_90_id,
                                                                     (select rows_numbered.value
                                                                      from (select sorted_by_value.value, rownum() as rownum
                                                                            from passed_events sorted_by_value
                                                                            where sorted_by_value.event_name = grouped.event_name) as rows_numbered
                                                                      where rows_numbered.rownum = grouped.pctl_id_90 + 1) as rt_percentile_90,
                                                                     grouped.pctl_id_50                                    as rt_percentile_50_id,
                                                                     (select rows_numbered.value
                                                                      from (select sorted_by_value.value, rownum() as rownum
                                                                            from passed_events sorted_by_value
                                                                            where sorted_by_value.event_name = grouped.event_name) as rows_numbered
                                                                      where rows_numbered.rownum = grouped.pctl_id_50 + 1) as rt_percentile_50
                                                              from grouped_events grouped)
select transaction_name           transaction_name,
       passed_count               passed_count,
       first_passed               first_passed,
       last_passed                last_passed,
       failed_count               failed_count,
       first_failed               first_failed,
       last_failed                last_failed,
       round(avg_tps, 3)          avg_tps,
       round(maximum_rt, 3)       maximum_rt,
       round(minimum_rt, 3)       minimum_rt,
       round(average_rt, 3)       average_rt,
       round(rt_percentile_90, 3) rt_percentile_90,
       round(rt_percentile_50, 3) rt_percentile_50
from summary_table
UNION
select '#TOTAL'                                                       as transaction_name,
       count(all_passed.value)                                        as passed_count,
       min(all_passed.end_time)                                       as first_passed,
       max(all_passed.end_time)                                       as last_passed,
       (select count(failed.acount) from failed_events failed)        as failed_count,
       (select min(failed.end_time) from failed_events failed)        as first_failed,
       (select max(failed.end_time) from failed_events failed)        as last_failed,
       round(count(all_passed.acount) / (select timeframe_end - timeframe_start
                                         from test_configuration), 3) as avg_tps,
       round(max(all_passed.value), 3)                                as maximum_rt,
       round(min(all_passed.value), 3)                                as minimum_rt,
       round(avg(all_passed.value), 3)                                as average_rt,
       0.0                                                            as rt_percentile_90,
       0.0                                                            as rt_percentile_50
from passed_events all_passed
ORDER BY transaction_name
;

There. Have a blast!

One caveat: MS Access doesn't seem to like CTEs in my queries, but ucanaccess deals with them just fine.

Resources:

Stack Trace / Thread Dump Analysis

Darius Juodokas — Fri, 02 Sep 2022 12:04:05 +0000

Over my career as a Performance Engineer, I've had the pleasure of working with a number of mid-senior IT specialists (developers included). In many cases I had to explain how come I see a problem in application runtime when they don't - to not to be discarded as a person who is imagining things and wasting their precious time. And quite often my findings are based on StackTraces. Unfortunately, many software developers, when given a StackTrace, did not even know where to begin to look at it. Then it struck me: StackTraces is such an informative data structure and so easy to obtain - it's a shame many specialists do not know how to deal with them.

This post is aimed at you, fellow developers and performance specialists, who want to familiarize yourself with StackTrace analysis, or just to refresh your memory on how to deal with stacks.

What's what

Stack trace (ST, STrace)

A stack trace (ST) is a snapshot of what the process is currently doing. An ST does not reveal any data, secrets, or variables. An ST, however, reveals classes (in some cases) and methods/functions being called.

To better understand an ST consider a simple application:

function assertAuthorized() {
    user = db.findUser(request.getUsername(), hash(request.getPassword()));
    assert user != null;
}

function prepare() {
    assertVariables();
    assertAuhorized();
}

function doJob() {
    prepare();
    task1();
}

If we execute doJob() and capture an ST by the moment it called prepare() → request.getUserName(), we'll get an ST that looks similar to this:

Thread main
  at HttpRequest.getUserName
  at prepare
  at doJob

At the very top of the ST is a header, which, in this case, tells us what's the thread name (main). Under the header is the actual call trace starting with the most recent call that this process has made: getUsername. The getusername function has been called by the prepare(), so that's the next line in the trace. And prepare() has been called by doJob(), so that's what we find immediately below prepare.

Reading the ST bottom-up we can see how "deep" the process currently is. In this particular example, the process has called doJob() (at doJob), inside doJob it called prepare() (at prepare), and inside prepare it called getUsername() of the HttpRequest class. We don't see the db.findUser() call in the trace because it hasn't been made yet. In order to make that call, a process has to first obtain the username and the password hash and push them to the stack.

A different way to visualize a ST would be horizontal rather than vertical. What is the thread currently doing?

doJob() → prepare() → HttpRequest.getUsername()

It's exactly the same thing as before, just laid out differently.

N.B. process/thread stack variables are NOT captured in the stack trace.

Some runtimes can embed more info in a stack trace, e.g. a line number of the source file each function is found on, a library with or without its version that function is a part of, PID/TID/NTID (ProcessID/ThreadID/NativeThreadID), resources' utilization (thread execution time, amount of CPU% consumed by that thread), synchronization primitives engaged (monitors), etc. These additions are all varying from runtime to runtime, from version to version and should not be assumed to always be available.

Thread dump (TD, TDump)

Often times Stack Trace and Thread Dump are used interchangeably. While there is a slight difference, STs and TDs are very alike - almost the same thing even.

Stack trace reflects a sequence of calls a single process or thread is currently making - that's the trace part. These calls, called instruction pointers in low-level languages, are stored in a process memory structure, called stack -- that's the stack part. Sometimes StackTraces are also called CallTraces, although a CallTrace is a more generic term and does not focus on threads/processes/stacks explicitly.

A ThreadDump is a bundle of STs for each thread the application is running. Probably with some additional information, like the timestamp of capture, perhaps some statistics and other related diagnostic info. And that's pretty much the whole difference about it. Capture a bunch of STs - one per each thread the application is running - and jam all of them in a single file. There, you have yourself a ThreadDump!

Levels of stack traces

All the processes have a stack trace. Perhaps with the sole exception of processes in the Z state (i.e. Zombie processes), all the processes are executing some function at any given time. As soon as the last remaining (or the first one that the process ran when it was started) function returns, the process terminates. However, in some runtimes, a process ST might not be the call trace we actually need. Consider the Java applications. They all run on a JVM - a virtual machine of sorts that parses and translates the code to native binary instructions a processor is capable to understand. If we find a JVM processes PID and peek at its call trace, we'd see some gibberish and not a single line would resemble anything java. That's because JVM processes are binary - they are NOT Java. These processes are the actual JVM processes - the ones that run the JVM on their shoulders. And our application is running inside of them. So in order to see our Java threads' call traces, we don't capture process StackTraces. Instead, we use JRE's tools, like jstack, that connect to the JVM and capture your application threads' call traces from inside those native processes - from inside the Java Virtual Machine.

Pretty much the same thing applies to any interpreted or pseudo-interpreted language. The native processes are running the mechanism that runs your application, so if you capture their STs, you won't see what your application is doing; instead, you will see what your runtime engine is doing. To see what's happening inside it and what are your application threads' call traces, use tools provided by your language's runtime engine. You can get stack traces even for bash (see: bthrow)!

What's inside an ST/TD

a strict sequence of method/function calls a process/thread has made
the oldest call is at the bottom
the newest (most recent) call is at the top

What is NOT inside an ST/TD

operations
- while, for, switch, if, other built-into-syntax operations
- arithmetic operations (+ - * / ^ etc.)
- assignment operations (=)
- comparison operations (== > < >= <= !=)
data
- variables
- constants
- objects/instances
- function/method parameter values

When

When troubleshooting applications and their performance, it's a very healthy habit to think about an application from two perspectives: nouns and verbs

nouns - this is the data of the application: variables, buffers, strings, bytes, arrays, objects, heap, memory, memory leaks, RAM -- all these keywords refer to DATA. You care about data when you know you have something wrong with the data: either too much of it (e.g. memory leaks), or the data is incorrect (integrity errors). For these cases, you need either a coredump (for native processes) or a heapdump (e.g. for JVM). Data has nothing to do with threads or call/stack traces.
verbs - these are the actions of the application: functions, methods, procedures, routines, lambdas. These are loaded into the stacks of processes/threads in a very strict sequence, that allows us to trace down the flow of execution.

When it comes to performance, we often see an application misbehave:

functions hanging
threads appear not responding
processes/threads burning CPU at 100%
threads locking each other
slow responses
huge amounts of files/sockets open
huge amounts of file/socket IO operations
...

In all those cases we need either an ST or a TD. Only when we have a snapshot of all the threads/processes of the application's actions (nouns) we are able to tell what is happening inside of it and what is the application doing. In the top of call traces of each of the process/threads in the application, we will see what is the last function/method call each method/fn has made (or is making, as that would be more technically correct). Doing down the stack we would see where that call originates from. This way we can trace down WHICH and WHY are our precious application's processes/threads doing all those things.

How to capture

A Stack Trace / Thread Dump is a snapshot of the application's activity. Normally we don't need to keep an eye on STs. Only when an application is misbehaving do we care to inspect its execution. That being said, capturing STs is only useful at the time of misbehaviour. Not only that, but to have a useful StackTrace we must capture it at the very moment the application is having issues. Not before and definitely not after the application problem goes away. It's like doing an x-ray: we need it done while we are ill, not before and not after the illness. Sometimes it's really hard to capture STs at the right moment, because some issues last only for a second, or even less. ST capturing can be automated and integrated into monitoring/alerting systems as a reactive-diagnostic measure to increase the chances of capturing the right data, or we could be capturing a series of STs every second (or less; or more - depending on the issue), hoping to have at least one ST that reflects the problem.

Capturing a series of STs has another benefit. If these STs are captured close enough to each other (seconds/milliseconds apart), we could treat that bundle as a profile. Just like it takes 25 frames per second to project on a screen and trick our eyes (mind) into thinking that's a live motion, we could take several STs per second and assume this is a "live application's motion", i.e. in a way we could track application's actions in time. The accuracy of such tracking gets worse the further these snapshots are apart. More on that later.

Native (process)

Linux

Getting a StackTrace of a process in Linux is quite easy. However, there's something worth noting: you can access 2 stack traces for any Linux process (though on some distros it's not possible): one in the kernelspace and another in the userspace. That's because each process runs in both regions: it does all the calculations and decisions in the userspace, but when it comes to hardware-related things (IO, sleeps, memory, etc.), a process has to ask an OperatingSystem to do the job for it. And an OS does its part in the kernelspace. To make matters simpler, each process gets a share of memory assigned for stack in both the areas, so it could freely execute user code and kernel code.

Normally, when a process asks a kernel to do something, it becomes unresponsive to the userpace (feels frozen). So if your process is not responding -- have a look at its kernel stack. Otherwise - see its stack in the userspace.

To see a process stack you will need that processes PID.

find a PID of the process you're interested in
- ps -ef | grep "<search string>" -- PID will be in the 2nd column
- ps -efL | grep "<search string>"-- PID will be in the 2nd column; the -L argument makes ps list not only processes but also threads
- pgrep "<search string>" -- PID will be the only value in the output

Now the procedure to access the kernel stack and user stack is a little different.

For a kernelspace stack, run sudo cat /proc/${PID}/stack and you will get a nice stack trace. A sleep 9999 stack looks like this:

[<0>] hrtimer_nanosleep+0x99/0x120
[<0>] common_nsleep+0x44/0x50
[<0>] __x64_sys_clock_nanosleep+0xc6/0x130
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

(notice the nsleep and nanosleep calls - it's the OS making the actual sleep call)

and a process currently running in userspace will not be executing anything in the kernel space, so its stack will look similar to

[<0>] ep_poll+0x2aa/0x370
[<0>] do_epoll_wait+0xb2/0xd0
[<0>] __x64_sys_epoll_wait+0x60/0x100
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

(notice the wait and poll calls - the OS is waiting for tasks to run in the kernelspace).

For a userspace stack, you have 2 basic options:
- use the gdb debugger to attach to a running process (gdb -p ${pid} and run bt inside)
- install tools that can automate that for you, like pstack or eu-stack (comes with the elfutils package)

pstack is usually preinstalled and it's very straightforward to use it -- just give it a PID and see the stack appear in your console. However, I prefer eu-stack, because it has more options, like thread printing, verbosity and other nice features (see eu-stack --help).

~ $ eu-stack -p 715821
PID 715821 - process
TID 715821:
#0  0x00007f75b5c991b4 clock_nanosleep@@GLIBC_2.17
#1  0x00007f75b5c9eec7 __nanosleep
#2  0x0000555a72b09827
#3  0x0000555a72b09600
#4  0x0000555a72b067b0
#5  0x00007f75b5be0083 __libc_start_main
#6  0x0000555a72b0687e

or with all the fancy options on:

~ $ eu-stack -aismdvp 715821
PID 715821 - process
TID 715821:
#0  0x00007f75b5c991b4     __GI___clock_nanosleep - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78:5
#1  0x00007f75b5c9eec7 - 1 __GI___nanosleep - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    /build/glibc-SzIz7B/glibc-2.31/posix/nanosleep.c:27:13
#2  0x0000555a72b09827 - 1 - /usr/bin/sleep
#3  0x0000555a72b09600 - 1 - /usr/bin/sleep
#4  0x0000555a72b067b0 - 1 - /usr/bin/sleep
#5  0x00007f75b5be0083 - 1 __libc_start_main - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    ../csu/libc-start.c:308:16
#6  0x0000555a72b0687e - 1 - /usr/bin/sleep

The nice thing about userspace STs is that you don't need sudo to get them if the process is running under your user.

Windows

Probably the easiest approach is through the magnificent ProcessExplorer, described here and in lots of other places across the internet.

download and install ProcessExplorer
launch ProcessExplorer
find the PID you want an ST of
right-click → properties
select the Threads tab
select the thread you'd like to inspect
click on the Stack button

The stack frames will look cryptic, but if that's good enough for you -- there you go! If you want a clearer ST, you'll have to download and install Debug Tools and load Debug Symbols into the ProcessExplorer as shown in tutorials:

Options → Configure Symbols
C:\Windows\system32\dbghlp.dll as the DbgHelp DLL path (or whatever it is in your system)
SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols as the Symbols Path (or whatever it is in your system)

JVM

Once again, in the case of JVM, there are at least 2 levels of call traces:

native
application

To get a native process ST please look above, as it's already been discussed.

To get an Application StackTrace, JVM provides an option to capture a ThreadDump (TD). There are a few ways to capture TDs of a running application:

jstack -l <pid> - requires a JDK utility jstack to be available on the same machine the JVM is running on. This is the preferred approach in almost all cases. It prints the TD to the jstack's output
jcmd <pid> Thread.print - just like the above: requires JDK utility to be present and prints TD to the jcmd's output. since it's more verbose and may be more cumbersome to remember, it's not a widely used technique
kill -3 <pid> or kill -s QUIT <pid> (both are identical) - does not require any additional software, as kill usually comes with the Linux OS. Sending a SIGQUIT signal to the JVM causes it to print a TD to the JVM's output. And that's the main inconvenience, compared to jstack and jcmd approaches -- you don't get the TD here and now - you have to find the JVM's output (could be either a log file or a console) and extract the TD from there. It can get very messy... But sometimes that's the only option you have.

Capturing a TD in a JVM stops the JVM for a split second and resumes it immediately after. In some cases, the JVM might appear stuck and jstack may recommend using the -F option. As always, handle the force with great care, as it might lock all your application threads for good (ask me how I know it).

Other runtimes

Some runtimes have their own alternatives to JVM's jstack, others - either don't or support plugins for GNU's gdb to acquire TDs through a debugger. Refer to your runtime's documentation to see if they have any means to capture a call trace. Asking Google for a second opinion doesn't hurt either.

Analysis

CallTrace analysis can provide us with a lot of information about what's going on with the application:

how many threads are doing X at the moment
how many threads are there
which actions are "hanging" and at which function calls
which threads are slacking
which threads are stuck in infinite loops and which functions have caused them
which threads are likely to be causing high CPU usage; what are they doing to cause that
which threads are making outgoing long-lasting IO calls; where do these calls originate from in code
which threads are blocking other threads (using synchronization primitives) and at which function calls (both blocked and blocking)
which/how many threads are running in X part of your code (e.g. PaymentService2.transfer()), should you care to know if a particular method/function is being called
which threads are likely to blow the Stack up (cause stack overflows); and through which function calls
how large are your thread pools and whether they are over/undersized
other useful info, depending on what's the problem

Individual STs

Analysing individual plain-text StackTraces might seem challenging at first, but a few stacks later the pattern becomes easier to understand. However, in cases of ThreadDumps, especially with lots of running threads (hundreds-thousands), the manual analysis still is a difficult task to take on.

Find the right threads

The approach of manual ST analysis depends on what issue are we trying to tackle. Usually, we're focusing on busy threads - the ones with long call traces. However, "long" is a relative term. In some applications "long" is 200 stack frames, while in others it could be as low as 30. Allow me to rephrase that:

Idle threads [short call traces] - threads have started up and initialized and now are waiting for tasks to work on; in standard Java web apps I like to take 30 frames as a point of reference and look for threads with call traces ~30 frames long; could be less, could be more. These threads are idle, regardless of their Thread state.
Busy threads [long call traces] - threads have received a task to work on and are currently processing it. While processing a task, their call traces will usually grow significantly larger than idle - could easily be hundreds of frames. These threads are busy, regardless of their Thread state.

Usually, we need to find those busy threads first.

There are no busy threads

It may happen that the application is running slow or "hanging" and there are no busy threads. The most likely reason is a bottleneck somewhere outside the application process. I mean, if the application is not doing anything (idling), then it's not slow -- no one is giving it any work to do. Was the request lost or discarded in transit, even before reaching your application process? Is it getting stuck in transit?

The second most likely reason is that the TD was captured at the wrong time: either before or after the slowness took place. This happens more often I'd like to admit. It will almost always happen if you are not the one capturing the trace and you working using someone else's hands ("Ops team, please capture the TD"; 5 minutes later: "here it is"; the issue went away 4 minutes ago), especially if that someone does not know how/when to capture it to obtain useful diagnostic information.

The third reason could be that the ST was captured from the wrong process. This is especially the case when requests are hitting a LoadBalancer of any kind (a DNS server is also a primitive LB) and there is no way to determine which server the request will land on. Often Ops take an ST/TD from any random server/instance in the pool that belongs to the application pool and assume that's what you've asked for.

And the fourth reason that I've come across is that you've got the idle/busy threshold wrong. In some applications, processing a task could add very few stack frames, as many as 1 to 5. Look at the traces you have, compare them, and see what's at the top of them. Perhaps you've miscalculated what's a "long" stack trace in your particular case.

There are some busy threads

Great, now that you've managed to find busy threads, it's time to see what's in there. Probably the quickest way to see what's happening in a thread is by reading it top-to-bottom - even better - just taking a look at the top 5 or so frames in the call trace. Start from the top. See what the most recent function call was - what's its name? This might hint to you what the thread was doing at the very moment this ST was captured. If the function call doesn't make much sense -- look at the frame below. If still nothing - look below. Keep going downwards until you reach a frame that you can make some sense of. If that frame is a good enough explanation for you of what the thread is doing -- hey presto! If not -- try going either back up trying to make sense of what that method you've recognised the frame of is doing inside, or down trying to find what other function has called it.

Once you know what that thread is doing - go to another. Don't worry, it will only take this long for the first few call traces. You'll get up to speed relatively quickly once you get the idea of what to look for and where.

See how many threads are busy doing more or less the same thing. If there is a handful of threads with very similar stack traces - you're probably on to your bottleneck. The higher in the stack you are - the more accurately you can define the nature of the bottleneck. It's best to be able to understand the top 5 frames in the stack.

Examples

"Camel (camel-8) thread #140 - aws-sqs://prod-q" #235 daemon prio=5 os_prio=0 tid=0x00007f5e40f26000 nid=0x113 runnable [0x00007f5c00dcb000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
        at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1346)
        at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:962)
...

java.net.SocketInputStream.socketRead0() -- means the application itself is waiting for a response over the network from a remote service it depends on. Probably the downstream service is lagging.

"threadPoolTaskExecutor-1" #114 prio=5 os_prio=0 cpu=4226.74ms elapsed=356.38s tid=0x00007f1b5df8cee0 nid=0xd3 waiting on condition  [0x00007f1b568a9000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        - parking to wait for  <0x00001000dbd3afd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.16/AbstractQueuedSynchronizer.java:2081)
        at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:393)
        at org.apache.http.pool.AbstractConnPool.access$300(AbstractConnPool.java:70)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:253)
        - locked <0x00001000db478530> (a org.apache.http.pool.AbstractConnPool$2)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:198)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:306)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)

...

org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(), that soon leads to a org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking() call and parking the thread -- means you have a HttpClient connection pool that's too small to handle the load effectively. You may need to increase that pool and retest.

"threadPoolTaskExecutor-5" #73 prio=5 os_prio=0 cpu=1190.99ms elapsed=5982.61s tid=0x00007f103db6e940 nid=0xb3 waiting on condition  [0x00007f103ac64000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        - parking to wait for  <0x0000100073c178f0> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:194)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.16/CompletableFuture.java:1796)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.16/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.16/CompletableFuture.java:1823)
        at java.util.concurrent.CompletableFuture.join(java.base@11.0.16/CompletableFuture.java:2043)
...

java.util.concurrent.locks.LockSupport.park(), when you see some CompletableFuture.get() calls lying around - probably you are using a common ForkJoin thread pool for your CompletableFutures, which by default is limited to a very small number of threads (count(CPU)-1). Either rewrite the code to use your custom Executor or increase the size of the Common pool.

Flame graphs

What they are

While manual analysis of plain-text STs is tedious, difficult and requires a lot of effort, the skill of reading an ST is very useful to gain in order to use more robust ST analysis techniques. My all-time favourite is FlameGraphs. At first look, they are funny-looking pictures, kind of diagrams, but very weird ones. Kind of skyscrapers with lots of floors. And the colours don't make it look any prettier.

I agree, flamegraphs aren't pretty pictures. Because that's not what they are. Instead, they are a very powerful, interactive, multi-dimensional graphical representation of related tree data, like CallTraces. You don't have to like them to understand them. I like to call them "multiple call traces laid over each other". As if they were transparent sheets of plastic with some similar curves and shapes on them, and you could pile them up and look through the pile and see how different or similar these curves and shapes are.

If all the call traces were identical across all the STs captured, in a flamegraph you'd see a single column with identical stack frames. Simple, because they are all the same, and if you lay them all on top of each other -- they will look the same. No differences.

If we have all traces starting with the same 5 calls and only subsequent calls divided into 2 different sets of function call sequences across all the STs -- in a FlameGraph we'll have a single column 5 frames high, on top of which are 2 narrower columns, representing 2 different sequences of function calls, as stated before.

Now, since STs in busy applications usually have a whole lot of different threads with different sack traces, their flamegraphs will look like a city with skyscrapers.

Dimensions

Each FlameGraph has at least 3 dimensions:

height - that's how deep in the stack the frame is. The higher the frame -- the deeper (the more recent) the call is. Just like in raw StackTraces.
width - how many such identical sequences of stack frames are, compared to other stack frames
- if the bar is as wide as the picture -- all these calls have been seen in all the threads
- if the bar is 50% of the picture width -- 50% of threads had this call in their STs at this same position (with the same sequence of preceding calls); the remaining 50% had different calls at that depth of the stack
colours - that's optional, though a handy dimension: it highlights calls made by functions written by us, and dims down function calls inside libraries. It's just easier to filter out the calls we're primarily interested in from the ones that libraries are doing internally.
text - describes what was the function call (basically an excerpt from the CallTrace)
other - optionally, there might be on-mouseover pop-ups providing even more info on each frame

Each frame in the flamegraph is clickable. By clicking on a frame you can expand/zoom in its tree of sub-calls (i.e. make the clicked frame 100% wide and widen all the upper frames respectively). By clicking on frames below the widened one you can zoom out.

I know, it's a lot to take in. Developers I tried to show flamegraphs to were confused and reluctant to learn using them, thinking it's an overly complex tool, requiring a lot of knowledge/skill/time to get any use out of. IT'S REALLY NOT!!! If you can read CallTraces - reading FlameGraphs will be just as easy, if you bother to play with them for 10 minutes for the first time. You shouldn't need more than 10min to understand the concept.

Why bother?

Aaahh, a fine question. And it IS worth the bother because it reduces the amount of data you have to analyze from 70MB of plain-text TDs down to a single image that all fits on your screen. Moreover, a FlameGraph has most of the analysis already done for you, and visualised it!

In general, when I'm analysing FlameGraphs, I'm looking for the largest clusters of identical CallTraces. They will be depicted as skyscrapers with the flattest rooftops. The higher the skyscraper and the flatter the rooftop -- the more likely that's a bottleneck. But how are buildings and CallTraces related..?

the higher the column - the deeper the thread is in its task execution. Remember, when analysing raw STs we are focusing on busy threads -- the ones with large call traces.
the more threads have that same call trace (up to some particular frame), the wider the tower/column will look

Example

Consider this Clojure example link:

Let me break it down for you. We're looking at separate "towers". Here we can see 4 significantly different "towers"

lots of sharp spikes on this one, meaning several threads are working on this kind of task (the base of the tower is somewhat wide), but while processing those tasks threads are doing all kinds of things (different spikes). There's no 1 thing multiple threads would be stuck on (the same frame across multiple threads would be represented as a wide cell: the more threads - the wider the cell). I don't see any problems there.
The roof of this tower is so wide that it's hard to even call it a tower. More like a plane (geological)... The stack depth is definitely more than 30 frames. Are these busy or idle threads? If idle - we don't care much for them, but if busy - ... That's worth looking into. It's hard to imagine an idle thread that would be waiting for new tasks while calling Object.hashCode(), so these must be busy threads. Why are soooo many threads (because the column is very wide) stuck at calling Object.hashCode()? This definitely looks like a severe performance problem, as normally hash calculation is too quick to even appear in StackTraces. And now we have not one, not two, but over 50% of all the threads stuck there! Does this hashcode function have some custom implementation? Need to look into it.
This tower is the narrowest of 4 and the rooftop has lots of antennae and no flat planes for a helicopter to land on. That looks like a healthy execution - don't see any problems there!
That's a relatively wide tower, with a wide base. But look at the top - it has a few spikes and a big flat surface at the very top! This flat rooftop suggests that several threads are stuck at an identical function call and cannot proceed further. What's written on that rooftop...? I can't see well. Zoom in on that tower by clicking on one of the widest frames on it. There, all better now. Looks like several threads are stuck calling Object.clone(). It could be normal in some cases because clone() might have to traverse a large graph of objects (if it's a deep clone), or it might be another performance issue. Might be worth looking into, right after the elephant named Object.hashCode() leaves the room.

Reverse-roots Flame graph

Once you get a good grip on FlameGraphs, I'd like to introduce Reverse-roots Flamegraph (RrFG) to you. Allow me to do that through an example.

Consider a FlameGraph below. The application is sluggish and takes its time to respond to some requests, while it deals with other requests very quickly. The common denominator between the slow requests is that the application is making downstream network calls in order to process those requests. So we suspect network-related issues. Now in this FlameGraph we see several high columns, some with flat rooftops, others - with spikey ones. How do we know which ones are making network calls? Well, we hover a mouse over each and every rooftop and see what the popup says - whether it's a socketRead0() or not. And this is a tedious and error-prone job to do. See the FlameGraph below -- the circled rooftops are socketRead0() calls. Can you tell how many threads are doing that call? I certainly can't, w/o a notepad, a pencil and a calculator and half an hour to spare. Wouldn't it be nice to have all the similar "roofs" aggregated in one place?

This is where the RrFGs come in. And it does exactly that. It reverses all the StackTraces and draws the FlameGraph as before. However, before reversing, most of the STs started (at the bottom) with Thread.run() and ended with Object.wait(). After reversing, these STs now start with Object.wait() and end with Thread.run(). Not only does this manoeuvre change the order of reading a stack, it also now sums up all the Object.wait() calls into a single cell at the very bottom. Similarly, it aggregates all the socketRead0() calls as well. Now we can see the total number of socketRead0() calls, how they compare to other calls across the threads, and we know where these calls originated from (by reading up through the column).

Series of STs. Profiling

Analysing a StackTrace gives you plenty of useful information to work with, but that is all assuming you have the right ST. In some cases it might be difficult to obtain an ST reflecting the problem in question, especially when it lasts for up to a second, more so - when it's not easy to reproduce it or its occurrences are unpredictable. So instead of fishing sprats with a harpoon, we can cast a net and hope we catch at least one of them. We can do this so by capturing StackTraces of the same process every 500ms or even more frequently and appending them to a file. This increases the chance that we'll catch the application at the time it's struggling and capture the informative ST we need for analysis.

Suppose you're a micromanager. You have a developer under you and you want to know what he's doing. The easiest approach is to come by his desk every 2 minutes and see for yourself.

09:00-09:28 he's absent
09:30-12:44 he's at his desk
12:46-13:18 he's absent again
13:20-14:52 he's at his desk
15:54-15:58 he's absent
16:00-18:00 he's at his desk

Seeing the results we can speculate that the developer was 28 minutes late in the morning. Then he was working uninterrupted until 12:46, when he probably took a lunch break. 13:20 he came back and worked continuously to the EOD with a single short break at 15:54 - probably a bathroom break. What happened between each 2minutes poll - we don't know, maybe he was working, maybe running around naked - there's no way to know. But from the sequence of data we have, it seems like he's a pretty good and hard-working employee.

The same thing we can do with a sequence of StackTraces. If we poll the application's stack often enough, we may get a good enough idea of what it's doing over time: where it is fast, where it is slow, and where it idles. The closer the polls are, the better granularity of the application workings we get, and the more accurate application profile can we create. Of course, polling for 8 hours every 500ms or less might be overkill, not to mention the amount of data created. It's a good idea to only profile an application during the period it's having issues. As for the amount of STs to analyze -- either squash all those STs into a single FlameGraph and see at which functions the application was taking most of the time, or create a FlameGraph for each capture of the STs and compare them with each other. Or, you know, group STs by seconds (or 5sec, or so) and generate FlameGraphs out of each of the groups and compare them visually to each other. Or do all of the above.

Pitfalls

While StackTraces are tremendously useful diagnostic information, we should be careful both capturing and analysing them.

Capturing

It's very important WHEN the StackTrace is captured. In order to be a useful piece of diagnostic data, it must be captured at the right time -- right when the application is experiencing issues. Even 1 second too late or too soon means a world and rules the ST unfit for analysis.

Analysing

Having a tool as powerful as FlameGraph you might be tempted to start using phrases like "this method took longer to execute than <...>". There's no "longer" in StackTraces, just as there's no "quicker" . There is no time duration in StackTraces, just like there's no timespan in a 20-year-old picture of you as a kid. Merely by looking at a picture I would not be able to tell how long were you standing in that pose. And this is especially the case when squashing a ThreadDump with lots of threads running into a FlameGraph. It didn't take Object.hashCode() 50% longer to execute than any other method in the application. The right way to say it would be: "50% of busy threads were seen calling Object.hashCode()". And knowing the swift nature of hachCode() we may speculate that it's not normal for this method to appear in STs unless it takes longer to execute than expected.

We may have a sense of time if we capture a series of STs and squash them into a FlameGraph, because we do not have the timespan - these STs have been captured over time. But be careful, as the dimension of time is mixed with the dimension of threads-per-TD. Each snapshot you take may have a number of threads that are executing some function calls. And as tasks are executed, they tend to be picked up by different threads to execute. And no one says that execution can't last for 500ms sharp -- just long enough for you not to notice any difference between ST polls. And if you don't notice either thread swaps or execution durations (which you probably won't, as you don't know what happens between ST polls), you'll be tempted to assume a continuous execution of some time duration.

Be careful. StackTraces have nothing to do with the duration of execution. It might suggest it, assuming you capture a sequence of STs with fine enough granularity, but there always is a chance you missed an interruption/change of execution between the polls.

References

JVM. Garbage Collector

Darius Juodokas — Thu, 09 Jun 2022 07:19:13 +0000

Meet the JVM Memory Manager (a.k.a. Garbage Collector, a.k.a. GC)

In a native application, your code asks the OS to allocate memory for its work. In a (pseudo)interpreted application you ask the interpreter/engine/VM to give you some memory. If you recall, JVM decouples application code from the platform nuances, so that the developer would not have to worry about the HOWs. The developer should only be concerned about the application code and leave JVM tuning for the middleware specialists.

Now, since JVM has its approach when it comes to memory management (memory pools), it's created a great opportunity to tweak things, but also a great problem - how to manage things internally. JVM has this concept of Garbage Collector (GC). It's "a guy", who overlooks all the memory. Call it the JVM Memory Manager if it suits you better.

Like any other manager IRL, management is a role. You can have many managers, each with its style and approach. Despite their differences, they all have the same goal: to make things happen without a crash. JVM GC is also a role, and there are multiple GC implementations out there - each with its approaches to how they achieve its goals. But they all have the same goals and the same responsibilities.

GC responsibilities

When we are talking about GC, we are talking about GC-managed memory pools. Basically, there's one major pool, managed by GC - it's Heap. Different sources (even Ora docs) disagree about the location of PermGen/Metaspace (classloaders), nevertheless, this region is also partially managed by GC. In this series of posts, I'll be referring to Heap when talking about GC-managed memory, unless stated otherwise.

Memory allocation

All the memory allocations in Heap are proxied through the GC. Your code asks GC to allocate memory for an object, GC picks the best spot for it, allocates, and returns you a reference to that sweet spot. GC is very much familiar with the generational Heap layout (YoungGen/OldGen/Eden/Survivors) and it's entirely up to the GC to choose what, how, and when to do with all that memory in its possession.

Memory cleanup

JVM is a very tight ship, perhaps even understaffed. Every crew member has to do their part outstandingly to stay afloat AND to deliver the cargo undamaged. GC is in charge of memory management, and it's the GC's responsibility to make sure there is enough memory to use when needed. Preferably, GC's actions themselves should consume as few resources as possible and impact the ship's journey as little as possible. The goal is to always have available memory when required, keep the used memory undamaged, and not slow down the ship or the crew.

Now let's return from the analogy. GC is also in charge of memory cleanup. It keeps track of all the objects allocated in the memory and discards them as soon as these objects are no longer needed. However, operating on a references pool that's actively used in real-time is nothing short of heart surgery. To put it mildly, it's a complicated task.

Memory reservation/release (from the OS perspective)

JVM has this concept of "Committed Memory". Imagine memory as a sandbox. You want to build a castle in a hostile sandbox (other kids want to play too!).

You build a sand wall and reserve some space for your castle
You build your castle
You also want to build other buildings, but you're now too close to the wall. So you look around, find an unused spot in the sandbox, and move part of your wall in that direction. Then you rebuild the missing segments of that wall. Hooray! Now you have room for your stables!
You build other buildings
You accidentally tear down part of your castle. Rebuilding it with dry sand is a pain, so you decide - what the hell - let's make a smaller castle.
Now you have plenty of unused space in your territory. Other kids want to play too... So you shrink your territory by moving a part of your wall inwards, this way releasing plenty of space for other kids' toy trucks to move around.

All the kids are playing in the same sandbox. Edges of the sandbox represent system memory bounds. You can't build a castle outside the sandbox.
Your wall is GC's committed memory. It reserves a good share of the system memory and allows you to allocate objects in that region. If you have very few objects, there's no need to have a kingdom this large, so the GC shrinks the committed memory to more manageable and reasonable levels. If your kingdom is back to the golden age - the GC grows the Committed area back up and beyond.

You can tell the GC how to recognize if your application is in the gold age or dark times.

-XX:MinHeapFreeRatio (e.g. -XX:MinHeapFreeRatio=10) will tell the GC to commit more memory if there's only this much (%) left unused. In this example, as soon as 90% of reserved memory is consumed, GC is allowed to grow its committed region.
-XX:MaxHeapFreeRatio (e.g. -XX:MaxHeapFreeRatio=60) will tell the GC to uncommit memory if this much (%) is unused. In this example, if only 40% of reserved memory is actually used, GC is allowed to shrink its committed region, releasing memory to the OS.

Setting those flags doesn't mean the GC will necessarily do as told. Some GCs might ignore those flags, others - use them more as guidelines than rules (just like Hector Barbossa said).

While this additional complexity may seem unnecessary, it actually is an excellent optimization techique. In fact, some GCs usually ignore the flags suggesting when to shrink the Heap and don't shrink it at all. That's because of how applications allocate memory in RAM. Now, RAM is solely managed by the OS. Any application that wants a sliver of memory must ask the OS to be so kind and provide it. Asking the OS to do anything means the application has to issue a syscall. Syscalls are functions of the OS API, that applications can invoke to make the OS do something for them. The problem here is that:

issuing a syscall means crossing from userspace to kernelspace and returning some value back from kernelspace to userspace. Each syscall invocation runs through some checks and validations, which are time-consuming.
all the syscalls are synchronous/blocking, meaning the application has to wait for the syscall to return the value from the OS

Each memory allocation requires at least 1 syscall (malloc()/mmap()), and each time some memory block is no longer needed, it takes 1 syscall to release it (free()/munmap()). Knowing how many objects a JVM creates each second, all these operations could add up to visible delays. In order to prevent that, GCs tend to avoid those syscalls as much as possible: they allocate memory only once and hardly ever release it back to the OS. When an object is collected, instead of calling free(), GC flags that object's region as FREE in its own books with the intention to reuse this memory for other objects. This way the GC ensures that:

there's no unnecessary overhead in calling the OS API
the GC will have an available block of memory for the JVM when it needs one, and other applications leaking memory on the same server will not affect java's performance (i.e. if other processes consume all the available RAM, JVM will still be able to operate within the bounds of memory it has already reserved (committed))

This behaviour oftentimes causes headaches for OPS: the java process memory consumption grows very high and then platoes - it looks like the JVM might have a memory leak or, as most falsely believe, the JVM is incredibly memory-hungry / memory-inefficient. These are but common misconceptions based on a lack of knowledge on how JVM/GC works and why it does things the way it does. I hope I've managed to clear this part out - it's an optimization technique.

GC tuning

As of today, there are 7 GC implementations:

SerialGC
ParallelGC (default up through jre8)
CMS
G1 (defult for jre9 onwards)
EpsilonGC
ShenandoahGC
ZGC

All of them approach the problem very differently and use different algorithms to maintain JVM's memory. For this reason, it is very difficult to simply write down instructions on how to tune the GC properly. Each GC has its own knobs and levers, each GC is tuned differently.

There still are some things in common that could help you tune GCs.

All the generational GCs (not all GC implementations are generational) have a concept of MinorGC (YoungGen collections) and MajorGC (OldGen collections). It's the MajorGCs that are usually the problem, as they slow down or stop the JVM.
As soon as OldGen gets full, a MajorGC is triggered. As soon as either of the YoungGen's pools is full, a MinorGC is triggered.
You can configure JVM aiming for either of the modes (assuming '.....' is application runtime and '##' is GC pauses):
- Throughput mode: aims for better responsiveness of the application. The tradeoff usually is longer GC pauses. Graphically this looks like this:
```
....................#######.................######...............#######....
```
- Short-pause times mode: aims to have as shorter pauses as possible during GCs. The tradeoff is somewhat slower overall responsiveness (e.g. GC is cleaning garbage while the application does its work, slowing the application a little bit). This mode also requires more memory for bookkeeping. Graphically this looks like this:
```
.#....#.....##.....###...#......#.....#.......##.....######.......#.....#...
```
You can choose different GCs for YoungGen and OldGen (although not all the possible combinations are available).
Larger YoungGen causes fewer MinorGCs and fewer objects promoted to OldGen. However, the OldGen itself will be smaller and you're likely to see more MajorGCs unless the objects do not live long enough to be promoted to OldGen.
Too small survivors might cause large objects (e.g. large collections) to be promoted directly to the OldGen. If an application creates lots of large collections and discards them quickly, you might want to keep an eye on that survivors' region.

Challenges

Finding unused objects

An unused object (or a no longer needed object) is an object, which is no longer referred to. The problem is: referred from what? From another object? True. But what does that other project have to be referred to by? See where this is going?

Where is the starting node of the graph?

There are a few. And they are called GC Roots.

Classes loaded by the system class loader (not custom class loaders)
Live threads
Local variables and parameters of the currently executing methods
Local variables and parameters of JNI methods
Global JNI reference
Objects used as a monitor for synchronization
Objects held from garbage collection by JVM for its own purposes

GC traverses all the references starting with each GC root, finds and marks all the objects, that are still being used. This is called the marking phase of collection.

Once all the live objects are tagged, GC scans all the objects again and removes the ones that do not have the mark. This is the sweeping phase of collection.

Since the graph is in use by the application, it's very difficult to mark all the objects in an ever-changing graph. This is why GCs tend to make marking a Stop-The-World phase, during which the application is stopped and only GC threads are running. This is a classical GC behavior, which some GCs implement as-is, while others augment it in some way.

Fragmentation

Suppose you have a memory region. # represents contiguous used memory blocks, and $ represents memory blocks that can be collected.

|#$##$$##    $#$#$#$  #$$$# ##$#$$##      |

After collecting garbage, the same layout looks like this:

|# ##  ##     # # #   #   # ## #  ##      |

It's spotty. It's sparse. The memory became fragmented. Now JVM has to keep track of all the free regions and decide which new objects fit into which slots best.

Another problem with fragmentation is a premature OutOfMemoryError. Suppose, after the collection you want to allocate 2 contiguous blocks of 6x@ size (large arrays). You can place the first block at the end:

|# ##  ##     # # #   #   # ## #  ##@@@@@@|

But where does the second one go? There's plenty of free space in the memory, but there are no slots to fit the second 6x@ long block. This situation yields a premature OutOfMemoryError.

To deal with such situations, GCs should be able to compact the remaining memory blocks, preferably into a single contiguous block. Like this:

|###############                          |

Now you can easily allocate four 6x@ memory blocks without a fuss. This defragmentation in JVM terms is called compaction of the memory. Different GC implementations compact memory in different ways: some use a mark-and-copy algorithm, others - mark-and-sweep, and others.

Another source of fragmentation is Local Allocation Buffers (Promotion and Thread): PLABs and TLABs.

YoungGen (Eden) is fragmented by TLABs: Eden is divided into chunks of various sizes (e.g. 5xsize=1, 17xsize=2, 7x3, 83x4, etc... - the number of chunks of each size is estimated based on statistics), and each thread reserves several such memory chunks for future allocations. If all threads only allocate objects of size=4, they will eventually exhaust all the available chunks and will only be left with smaller ones. While larger ones can be split to fit a smaller object, the smaller chunks are not large enough to fit such objects and cannot be joined together. As a result, Eden still has plenty of free space (smaller chunks), but it is unable to allocate a contiguous memory block for another large object. MinorGC is invoked, and statistics are updated.

OldGen (and YoungGen Survivors) is fragmented by PLABs. With each MinorGC some objects will probably be promoted to the higher memory pool (Eden->survivor; survivor->OldGen). It's all great when the promotion is done with just 1 thread. However, when there are multiple threads, we want them to avoid locking each other. Just like TLABs avoid locking in Eden (each thread has its own isolated set of memory chunks in Eden), PLABs avoid locking in OldGen and Survivors. When an object is promoted, a thread moves it to a region in the higher memory pool, that is dedicated to that particular thread. This way threads do not compete for memory regions and promotion is very quick. However, as each thread has some memory regions preallocated (of different sizes), it's only natural that some regions will remain unused. That's fragmentation.

Generational collection

Generational garbage collections choose which memory regions to clean up first. It's been observed, that >90% of newly created objects are only required "here" and "now", and they are no longer needed soon after. This means that if there was a way to identify them as soon as possible, we could remove >90% of no longer needed memory blocks. And that's what generational collectors do.

Minor collection

New objects are allocated in the Eden - a part of the YoungGen. Once Eden fills up, a GC is invoked. During minor collection the GC performs several steps:

collect Survivor region
- clean up the active Survivor region from dead objects
- find all the objects in the active Survivor region that have survived N minor collections and move them to the OldGen (N can be set with -XX:MaxTenuringThreshold=N)
- move (hard-copy) all the remaining objects from the active Survivor region to the second Survivor region
- clean the active Survivor region
- deactivate the active Survivor region and activate the second Survivor region (swap them)
collect the Eden region
- identify objects that are no longer needed and remove them
- move all the remaining objects to the currently active Survivor region
  - if there is not enough room in the Survivor region, overflow to the OldGen (i.e. whatever doesn't fit in Survivor, move them directly to the OldGen)

MinorGC is very effective, as most of the garbage is collected right there. Whatever is not collected upon Eden collection, is likely to be no longer needed and collected before copying from one Survivor to another. TenuringThreshold gives surviving objects more time to become irrelevant while they are still in the YoungGen. If an object survives more collections than set with the MaxTenuringThreshold parameter, the object is considered as old and is promoted to the OldGen during the Minor GC.

Typically MinorGC is completely or mostly StopTheWorld collection, but the small amount of data makes those pauses short enough and mostly irrelevant.

Major collection

OldGen is a large portion of Heap and it's also collected by the GC. OldGen collections are called Major collections. Normally, it takes time for OldGen to fill up, as most of the garbage is collected with MinorGC. However, as the JVM runs, the following factors tend to eventually fill the OldGen up as well:

Each Minor Collection potentially moves some objects to the OldGen.
Some large objects, that do not fit in Eden, are allocated directly in the OldGen.
Live objects not collected during Eden collection are moved to the active Survivor region and, if the Survivor fills up, it overflows to the OldGen.
Garbage objects in Eden with their finalize() methods overridden, are not released during MinorGC - instead they are moved to the active Survivor region, which is likely to fill up and overflow to OldGen.

Typically, OldGen collections are partially or completely StopTheWorld collections. It's the amount of data that makes Major collection pauses lengthy.

Full collection

This is not an official term. However, it's actively used (and feared) in the industry. A Full collection is nothing but a Minor Collection followed by a Major Collection. Some GCs manage to merge both the phases and execute them concurrently, while others collect one region first and then collect another. Since the Full collection collects everything (YoungGen and OldGen), it's the longest collection. Usually, it's triggered when the Minor collection tries to promote survivors to the OldGen but there is not enough space, so it triggers the Major collection to free up some of the OldGen space.

What's causing the collections?

There's a good enough summary of GC causes in this Netflix GitHub page (since the time of writing they have migrated to Atlassian docs). Should this page ever disappear, here's a copy-paste of its contents:

References

Performance Engineering. Intro

Darius Juodokas — Thu, 13 May 2021 20:32:11 +0000

Everyone can code these days. At the age of the internet, it's very easy to find all the material you need to learn how to write program code. Invest a few months in it and you can start looking for an entry-level developer's position. No need for formal education, no need to waste years in universities or colleges - anyone can get a golden collar at this golden age of developers!

While anyone can learn how to write code, not everyone can write quality code. Furthermore, even fewer can write efficient code or make efficient design decisions. And it is all-natural, it is all expected. As time goes, the IT field (and any other, for that matter) becomes more and more diverse. We know more, there are more topics to cover, to know in detail. True full-stack developers breed is getting closer to the brink of extinction every day. They are replaced by diverse developers, devops, noops, servops, secops, perfops, *ops. In this post, I'll try to explain the need of performance engineers (perfops): what we are, what we do, how we do it and why do you need us.

What is performance?

Let's look at the basics first. What is it we are doing with our applications? Why do we write them? Usually, that's because we want to automate some part of the domain. This automation requires computing resources (it used to require human resources before!): CPU time, memory, disk storage, network throughput. And our application translates those resources to the results of some work done. These results are required for application users. So application users (indirectly) demand computing resources to get some work done.
And there's also funds. We sell our resources to the users because we are getting paid for the work our application does for them. However, we have to spend some funds to purchase some resources for our application to run on. Naturally, we want our expenses to be as low as possible and our incomes are as high as possible - we want high profits.

Problem is, resources are always limited. The more/better resources we purchase, the more they cost to us. However, there's a good chance we'll be able to serve more requests with more expensive resources. It's a common misconception, that we can always buy more resources we need them. Yes, we can, but there is a point at which each new portion of resources we purchase and add to the system will have a questionable benefit, or even worse - will bring us harm.
Another problem is that demand for resources is always UNlimited. It means that there is a potential for unlimited profits. However, since resources are limited, our profits will also be limited.

To squeeze the most of this situation we must find a way to use our resources as efficiently as possible: spend as little on resources and satisfy as many requests as possible.

This efficiency is what performance is all about: using limited resources to satisfy unlimited demands as efficiently as possible. And efficiency is (directly or indirectly) measured by... funds. After all, money is what makes this world go round :)

Now, because our applications are usually very complex, it is not that easy to tune them for optimal efficiency. Even worse - they are always changing: new chunks of code are introduced every sprint, infrastructure is always changing (networks, internet, servers, disks, etc.). And demand patterns are never the same: is it efficient to have 20 servers running at noon when load peaks - probably yes; is it efficient to keep them running overnight when no one's using them - probably not; is it efficient to have servers in the AMER region when business is expanding to APAC and EMEA - it depends. And there are many, many more variables to it. We want to juggle them right to have the biggest profit possible - that's what we need performance engineers for. We, performance engineers, juggle all those variables, tilt all the scales we have at our disposal one way or another to utilize our resources as efficiently as possible and satisfy as many requests as possible. As my colleague likes to say, "we want the work done good, fast and cheap" :)

Why do test performance?

Profits

As I've already laid out in the previous chapter, money is what makes the world go round. Consequently, the goal of PE is to change/maintain the closed system of the application so that:

the application generates more income than it requires expenses
either of the two happens:
- the application generates more income
- the application requires fewer expenses
the application generates more income AND requires fewer expenses

over a period of time. Once we have 1 accomplished, we want to make it better by either pushing down the profits or pulling up the expenses side of the scale. Once we achieve that, we lock ourselves in a loop 3 and work on the growth of the profits by both pushing and pulling the scales.

Changes

Even though our system is closed, it's always changing:

we accumulate more data,
we deploy code updates,
we change our configurations,
we (or someone else) make infrastructure changes,
we want to migrate to another infrastructure,
we get more/fewer requests to serve,
we get requests from other parts of the world,
wind changes direction (and breaks a tree which cuts down power on our main datacentre, so we have to failover to our DR site ASAP),
<...>

There are many forces at play. All these forces change the efficiency of our application - usually to the worse. On the other hand, new technology solutions are released quite often, and they tend to be cheaper than their predecessors, which is on par with our goals. However, applying those new technologies itself introduces changes in our closed system.

Changes can be either good or bad for our application efficiency. Considering, performance, changes can introduce:

slowdowns
speedups
new/more errors under pressure

Interestingly enough, both slowdowns and speedups can be either gain or loss when it comes to performance. Read on.

When do we test performance?

I like to divide the "WHEN" into 3 parts of the application lifecycle:

Before GO-LIVE - the application is under active development and it has not been released to the public yet
GO-LIVE - we have finished developing the first decent version of the application and we are about to release it to the public
After GO-LIVE - the application is already released to the public and it's being actively used

Before GO-LIVE

At this phase, the application code is constantly mutating. It's a green-field project that's in the progress of getting shape. It hasn't been clearly structured yet - even though we know what it has to do, it's still a shapeless mass that somehow does what it's expected to. This shapeless mass is constantly mutating and slowly gaining notions of what it's supposed to look like.

Even though the application is still in chaos mode, we do still care about its performance. The go-Live deadline is coming and we cannot leave perf testing to the end of this phase. What if performance is poor? What if we can only serve 10 requests at a time, while we need 10000? Will we have to rewrite almost everything or just several parts of the code? Have we made a poor architecture choice?

At this phase, we want to test application performance periodically. We don't need frequent tests, as there's too much going on already. There's a good chance that some performance problems will be resolved without being ever noticed. However, we want the core of our system to be stable and we want to know that we are going in the right direction with our changes. By testing performance before Go-Live we want to answer questions:

is application architecture/libraries/code/infrastructure capable to deliver reasonable performance?
will you have to rewrite 80% of your app to fix performance?
is feature design/libraries/code capable to deliver reasonable performance?

Naturally, we don't want to rewrite 80% of our code, and that's why we should care about performance this early in the cycle. We want to spot our mistakes as early as possible so that we don't make inefficient code a building block of other parts of the code. Rewriting a slow function is easy. Redesigning a slow solution, on the other hand, is not - and it's expensive. And it's VERY stressful when we have to do it days before the due date.

GO-LIVE

Our application now has a firm shape and we know what it does. We know it performs well and meets or exceeds our SLAs. Great! Now what? We buy new servers, deploy the application and make it public, right? Not yet...
At this phase, we want to know how many resources we require to maintain application efficiency in PROD. Prod is a different infrastructure, could be a different location in the world, could be a different IaaS vendor, could be different ISP, different load patterns, etc. The production environment is going to be different from what we had in our sandbox. So we need to know how/if these differences impact our performance and how to adapt to them so the application is as efficient in prod as it was in our dev environment.

Testing performance when going Live should provide answers to questions:

how large a PROD infrastructure do you need? (horizontal measure)
how powerful PROD infrastructure do you need? (vertical measure)
what middleware (JVM, Nginx, etc.) settings to choose?

only when we have those answers can we go ahead and configure our Prod environment, deploy our application on it and proceed with the go-live.

After GO-LIVE

Champagne corks are popping, glasses are clinking, everyone is celebrating - we are LIVE! But there's no time to relax. The application was passing our real-life simulation tests, but were they accurate enough? Have we covered everything? Before Go-Live we had 6 testers. Now we have what - thousands? millions? billions of testers? Oh, mark my words - they will find ways to give you a hard time by using your application! Features used wrong, use-cases not covered by tests, random freezes, steady performance degradation, security vulnerabilities,... You were playing in a sandbox. Now suddenly you find yourself in a hostile world, that plays by similar rules to those you are used to, but not quite the same.

While you're busy patching bugs here and there, don't forget to include performance testing. We need performance testing after go-live to tell us:

do new code/version/components’ changes slow us down?
why is the application randomly freezing?
why is the application slower than it was before?
can we reduce our infra to save €€€ and still deliver SLA?
do we need more/larger/different infra to deliver SLA?

And the application will keep on slowing down every day. That's mostly because of the data you will be accumulating. While databases are quite good at coping with growing datasets, you were focusing on the short TTM before go-live and not on the most efficient data processing algorithms. That is normal and expected. It must also be normal to have means in place to detect such inefficiencies - perf testing with prod dataset, after go-live.

How to test performance?

People from other projects often come to us: "Hey, look, we have a performance SLA with our client and we are not meeting it. We need performance testing. Where do we begin? What do we need?" It's all okay, a developer does not necessarily have to know how to introduce perf testing. FWIW, a dev may not even need to know how to test application performance in the first place. There are performance engineers for that! (however, I'd like developers to at least have notions of how to write efficient code...)

Performance testing can be introduced before or after go-live. It can be introduced even years after the go-live.

Prepare performance testing plan

Now you have to sit down with the client and agree on the performance testing plan. Hear out the client's expectations for the system, communicate out all the requirements. Prepare for this meeting in advance, because after it you will have to know what, how and when you are going to test, and what are you to expect to see/put in the results/reports.

SLI, SLO, SLA. Decide with your client what test metrics you want to monitor (SLI), what observed values are acceptable (SLO) and what fraction of all the requests must be in the acceptable range (SLA). For SLIs you can consider transaction response times (RT), transactions per second (TPS), error rate (ERR), number of completed orders, number of completed e2e iterations, etc. SLOs should define what is expected max RT of each transaction, at least how many TPS should be made during a test, what number of errors is tolerable per-test or per-transaction. SLAs could be: "All RTs' 90th percentile does not exceed 2 seconds"; "We make at least 200TPS during Steady phase"; "We make 1500 orders during ramp-up and steady test phases combined", etc.
Load patterns. This part is a lot easier after go-live because you have historical data to base load patterns on. Anyway, the goal here is to simulate the real-world load. It's normal, that throughout the day/week/month/seasons load varies, and you might want to design different load patterns to simulate each load variation. Or at least a few of them. When designing load pattern, consider:
- how many users will be using the system at the same time?
- ramp-up -> steady -> ramp/down patterns: how long will you be warming your system up for; at what rate will you simulate new users logging in? How long will the test last? How long will your simulated "load dropping" phase be?
- stress, endurance tests - do we want to know how much the system can handle? Do we want to know whether its performance is stable under pressure for long periods of time?
- testing schedule - how often will you be testing? At what time (wall clock; consider real users using the system in parallel while you're testing)?
- anything else comes to mind?

Decide on e2e user flows to be tested

Performance tests should simulate real-life users on a real-life application, running in a simulated (or not) real-life environment with simulated real-life data. In real life, users don't usually keep on pressing the same button tens of thousands of times in a row. Users tend to log into the system first, then they might want to browse some catalogues, maybe get lost in the menus, open up some pictures until they find what they like (or leave). Then they add that item to their cart, maybe add something else too. Oh, did the user change its mind? Sure, why not! Let's clear out the cart and start anew, with some other product in the shop! Then <...>.

It is very important to choose real-life-like user flows for your tests. Even better - make several flows and run them in parallel - after all, not all the users are going to do exact same thing. How do you design your flows? Well, application access logs will help you a lot here. Analyze users behaviour per session, find the most common patterns and embed them in your test flow. You might find some Analytics data very useful too.

When it comes to e2e flows in performance tests, we usually need more than one. I like to have

main flows - the most obvious, most followed flows in the production environment
background flows (noise) - additional actions in the system, could be chosen randomly, that generate background noise. You might want to embed some rarely used application calls here. This way you add additional stress on the system AND test how less used features behave under pressure.

Acquire test data (users, doctors, rooms, SKUs, etc.)

Your e2e flows will require some test data. What accounts are you going to use? You will need lots of them in your test. What entities will all these virtual users work on? What entities will they have to share? If you are testing an e-shop with products, you might want to not run out of products in your stock during (or after) the test. And you might not want to actually pay for or ship them... So securing test data is a very important, tedious and difficult task. Work closely with developers to choose the right items for the test.

Mock external calls

During your test, you might provoke an application to make calls to external systems, like shipping, tracking, stock, vendors. You might not want some of such calls, because:

you are not perf-testing external systems - you are testing your system;
external services (vendors) might not like being perf-tested and you might be throttled or banned
you might make state changes you cannot undo (e.g. create a shipping request at FedEx)
...

For this reason, you might want to stub out some external calls - either completely or selectively, so that only your test requests are routed to mocks. This means the application (or proxies, like Nginx) might need some changes to capture your test requests and treat them slightly differently.

A word of caution - don't push too far with stubbing. Testing stubs is not what you want, so don't be too liberal with them. Only stub out what has to be short-circuited, nothing more.

How to test performance?

So now you have your testing plan, you know what flows to test and you have data and mocks in place. Now what?

Write test scripts - translate e2e flows and test data into test scripts. One script is supposed to reflect a single e2e flow.
Assemble scripts into tests - combine multiple scripts into tests. In a single test, you might want to run different e2e flows with a different number of users. Set all those parameters to make your tests reflect real-life load on the system. E.g. how many users are browsing the catalogue? How many users are running the complete login-order flow? How many users are getting lost? How many users are generating background noise? etc. A piece of friendly advice: create Sanity tests as well (run them for several minutes before the actual tests). Don't assign many VUsers to them. They are needed to warm up an environment and make sure everything is running smoothly and is ready for the load test.
Provision environments this requirement often takes clients by surprise. For perf testing you require 2 isolated environments (unless you are testing in the prod - then you only need 1 additional environment):
- testable prod-like environment (PLAB) with prod-like data (quality and quantity) - this is where your application is deployed and running; all your virtual users (VUsers) will be sending requests to this environment
- testing environment – load generators (1 server for 1’000 VUs) - this is "the internet" - your VUsers' requests will be sent from this environment. This must be a separate environment because generating load (sending requests) also puts a load on the infrastructure (networks, hypervisor, servers, etc.). Generating load from the same infrastructure that you are testing will always yield false test results.
Set up telemetry, persistent logging - I cannot stress this part enough. Set up monitoring everywhere you can. Monitor everything you can. NO, this is not excessive and I am not exaggerating. You will thank me later. It is always better to have data you did not need than need data you didn't think you will. Monitor everything you can: networks, servers, application, memory, CPU, request sizes/counts, durations, depths,... Everything you can. Don't be cheap on monitoring, because this is your eyes. Load tests will only be able to tell you that you have performance degradation. Monitoring will help you identify where and why. I have experienced cases where we spent months pin-pointing an issue because the client didn't think it's worth spending an additional 7$/month for monitoring one more metric...
Run tests - now you have your tests in place, and you know the schedule when to run them. When the time comes - execute your tests. You don't have to look at the results all the time - just glance at the screen every now and then to see if the test is still running and the application hasn't crashed. Don't chase every tiny rise and fall. They will even out in the end.
Collect, analyze and compare run results - after a successful test collect test results (SLIs) and compare them against SLAs. If applicable, compare to previous tests' results. Comparative analysis suits here very well.
Draw conclusions (better/worse/status quo) - do your SLIs stay within SLAs? Do you have performance degradation when compared to previous test runs? Was this run, perhaps, a better one? Are there any suspiciously good response times? Maybe you didn't validate the response body/headers/code in your scripts and the application returned you with errors? Don't trust unusually good results right away.
Use telemetry, logs, dumps, live data from OS, other data to identify causes of degradations - if you have degradations, correlate your metrics on the same timeline along with test results: see, what happened when your response times peaked or errors occurred. If required and possible, try a re-run and capture thread/heap dumps of the application to carry out more extensive analysis. Are your thread pools too small? Does synchronization slow your threads down? Are responses from external component/system too slow? Are you hitting network limits? You can find answers in thread-dumps. Memory-related answers lie in memory telemetry (server, JMX) and heap-dumps (core-dumps).
- be creative – know how to tickle the application to make it cry. To provoke some performance problems you may have to alter your scripts or tests, or perhaps intervene in the environment manually during the test. Before running a test prepare a testing plan - what you want to achieve, what tests you need to run to achieve that and what either outcome of either test mean in your troubleshooting plan. Don't run useless tests or tests of little benefit. Don't waste resources, time and money. Decide on what tests you need in advance and stick to your plan.
Come up with an approach to fix it - the fix can be anything that cuts it: additional caching, reducing the number of calls (code), server/service relocation to different provider/infra/geographical location, version update, code fix, architecture modification, configuration change, cluster rebalancing, SQL plan change, DB index creation/removal, etc. Literally, it can be anything that solves the problem.
- fix it yourself - if you can or know-how (in some projects/teams you might not have accesses, approval or competency to apply fixes yourself)
- recommend domain teams/client how to fix it - THAT you can always do: prepare a testing report and include recommendations to dev/infra/dba/other teams on how to fix the performance problem. You can even include a PoC proving that your proposals do in fact alleviate the problem or eradicate it completely. Describing the etymology of the issue might help the client to understand it better and perhaps choose a different fix - one that suits them better.
- And retest - once you have the fix in place, run the test again to confirm the issue is no more. Fixing one problem is likely to surface some other problems. Always retest after applying a fix to be sure everything is in order.

PPP

The work of a performance engineer can be summarized into 3 Ps. PPP sums up the explanation of what we do and why clients need us.

Predict (requirements, pitfalls)
Protect (from degradations)
Progress (maintain performance with less expenses; improve performance with reasonable/unchanged/less expenses)

The secret sauce of PE

Performance engineering is a challenging role, requiring one to possess extensive programming, infrastructure, middleware, DB, algorithms theory, architecture design knowledge and skills. Over the years in the industry I have learned many things, but working as a PE has taught me some things that might sound controversial and unintuitive. This is why I like to call them "the secret sauce".

There will always be bottlenecks in any given system

All the systems have bottlenecks. No matter the scale, size, location or resources available - there always, ALWAYS are bottlenecks. Some most common bottlenecks are:

network - network latency, limits, connection handshakes - all these are slow
database - databases are extraordinary "creatures" as they manage to maintain more or less stable response times regardless of the amount of data they contain. However, this comes at a price - memory, computing power and concurrency. I have seen the most powerful database in the industry (Oracle) fall on its knees and there was nothing left we could tune. We'd hit the physical limits.
disk IO - be it logging, caching, database or anything else - disk speeds have always been, are and probably will be one of the infamous bottlenecks
number of CPU cores - is a problem when you need to synchronize them in between. The more cores you have, the more synchronization will become a bottleneck for you. The best way to maintain the critical section is not synchronization - it's avoiding the critical section in the first place. There are many methods around it, both in local (code - threads) and distributed (microservices, distributed monoliths) systems. However, each of them has its drawbacks - there is no holy grail.
CPU, speed - this doesn't need an awful lot of explanation.

PEs job is to know how and when to adjust them

We have now established that there always are bottlenecks, in any given system. We might as well say, that any given system is a system of bottlenecks. There is no way to solve all of them. The best we can do is to change how severely they impact us. Usually, we only want to address bottlenecks that are giving us a hard time and leave the others be.

reduce or remove - this is quite intuitive. Reducing or removing a bottleneck is expected to improve application performance.
increase or create - this is counter-intuitive, isn't it. However, that doesn't make it any less true. Some bottlenecks are more severe than others, some are causing more degradation than others. Removing one bottleneck may cause more load to be applied to another bottleneck - one, that was not worth your attention before, now became the worst performance hog in the system. Applying the same principles we can reduce some severe bottlenecks by introducing new ones (yes, Thread.sleep(20) is a perfectly viable solution to some performance problems), or make existing ones be more aggressive (e.g. reduce the thread pool size to prevent contention - the price you pay for heavy synchronization).

Performance engineers, once they have test results, telemetry metrics and a good understanding of the architecture and domain, can make an educated call to alter either of the bottleneck(s) in the system to improve performance. Knowing which ones and how to shift them is why we are getting paid :)

Sometimes less is more, sometimes more is less

This sounds like a playfully worded phrase, but it is very true in PE. When people encounter a performance problem, they immediately think "let's add more resources", "let's buy a bigger server", "let's increase the pool size", "let's add more CPUs", "let's add more memory", ... They are not wrong... usually. But the "bigger and stronger" approach only gets you this far. There are many types of resources and each type (even more - each combination) has its own threshold, after which adding more resources gives you less performance benefit, or even introduces performance loss. In such cases, we want to have fewer resources in some particular areas of the system. Or, perhaps, we've been inflating the wrong resource all this time - maybe reducing CPU and adding more memory would give us more benefit!

Sometimes less is more.

PE is all about balancing the scales. Always. Everywhere.

This is a sum-up of all the previously listed secrets. Performance engineering is a challenging job, requiring a good understanding of various aspects of the system: how it works, how it's made. There are oh-so-many variables to juggle with.

Balancing funds and resources. As I have already stated at the beginning of this post, resources are always limited, but demand for them is always unlimited. Physics and funds are the main factors limiting resources: physics set hard limits for vertical and horizontal scaling of a given resource while funds define soft limits - how close can we approach the hard limits. The closer to the limits - the more expensive the resources are, but they also, if used well, have a potential for more incomes.

Balancing bottlenecks. Each system has plenty of bottlenecks in it. Some are more severe, others are less. Shifting the configuration of those bottlenecks in order to improve performance is also like balancing the scales: bottlenecks on the left, bottlenecks on the right and performance is the level of the scales.

Should I use a library for that?

Darius Juodokas — Tue, 20 Apr 2021 19:21:28 +0000

Libraries and frameworks are here to ease our lives. They bundle up tons of logic and present us a remote control with very few buttons and knobs. A button says "blowItUp" and it blows the whole thing up. A button says "get500BitCoin" and it gets you 500 bitcoins. Well not exactly like that, but the point still stands. Libraries and frameworks know how to do THIS or THAT and they do their jobs well. So you don't have to worry about all the nuts and bolts that you may need to get the job done. Libraries and frameworks have thought of all of that for you. Just use them!

Libraries and frameworks are great!

I must admit, the idea of libraries is brilliant: someone had a problem,figured out a way to solve it and shared the solution for others to reuse to solve the same (or similar) problems. Let's see what we love them for.

Quick & easy solution

If you have a problem you have to solve, it might take hours, days or even weeks to solve it yourself. Ever looked at the implementation of the TCP library? There's a lot going on. Imagine if you needed to implement it yourself. That would take several months and still the implementation would be buggy.

What you can do instead, is to look for a library that solves that particular solution and add it to your project in minutes. And that's it - your problem is solved. In minutes, rather than hours, days, weeks or months.

Now off to the other problem (i.e. library lookup)

Less boilerplate

... and less code to support. This means you are not responsible for the code in the libraries and you don't have to support it. If something doesn't work - just report it as a bug and let maintainers to worry about how to solve it. If you feel like it, you can probably propose your own solution and earn some karma points (and the great feeling of knowing that other projects will be running your solution). With libraries, you go fast. The TTM is short. And you develop less silly bugs. Not to mention you have to write less :)

Not reinventing the wheel

Most of the problems you're running into have been already solved by someone else. Probably many, many times. If these folks solved those problems, then why should you repeat what they did? You can simply take their solution and use it in your project. There's also a good chance that by reinventing the wheel you'll introduce bugs you haven't thought of, and your solution will be less efficient. Let's not waste our time and use what's already out there.

They are professionals - they can do it better than I can

Libraries and frameworks are often times developed by experienced professionals. If the maintainer is a professional, we expect him/her to know a great deal about the topic and only use the best practices in the solution of the problem. We want to learn from the best, we want to use what's the best, so we simply like to use what these PROs decided is the best. In many cases it's a no-brainer.

Standartized usage

This is one of the key aspects that people like libraries for. If I use a library for the job that is popular in the industry, whoever comes after me will most likely know what we're talking about. They will know how to use it, they will know the caveats and possible points for improvements. This is the benefit of using any popular tool, not just libraries, not just frameworks, not just... software. It's easier to maintain the project as a whole when features follow some well-known standards or even are publicly documented and used by others elsewhere.

However...

This blog post wouldn't exist if there wasn't the "however" part. While it seems that libraries solve all your problems and might even be solid building blocks to the complete application you're after (you only need to arrange them properly et voila - the project is done), there are things to consider before diving into the dependency world (or hell).

Foreign code

Any code you take in as a part of your project should be considered potentially harmful. It's very easy to modify your pom.xml or gradle.properties and add that one line adding a library that solves the problem for you! However, your build tools are downloading that library from external sources which you have no control over. Who wrote that library? Did they inject a backdoor in it? Will the authors hack my application if I use their library? I don't know - the library code is far too complex for me to digest and catch all the possible trojans (if any). And it's a tremendous waste of time! It would take weeks or even months... Do I want to spend this much time on that?

Do you think I'm overly paranoid? Sure you do! Have you seen how many attempts there were to inject backdoors into the Linux kernel? :) If you had, you would be too. And the Linux kernel is no different than any other open-source project in this sense!

Even if the maintainers/authors of the library had no intentions to harm you, they might be installing backdoor without their own knowledge. Have you heard of the very recent SolarWinds hack? Boy did that cause a mess... EVERYWHERE! Government, Homeland Security, treasury, energetics, corporations (think: Microsoft),... everywhere! This hack started a long time ago - someone injected a backdoor in the codebase - and all these companies used the affected SolarWinds tools. As a results, all these companies willingly (though unknowingly) installed a backdoor in their systems, granting an attacker a wide spectrum of accesses to huge amounts of services and data. Which could have been exploited for further hacking... SolarWinds is just an example, illustrating that things like this are happening at various scales. If Ubuntu maintainers managed to expose their repository passwords, who says your library maintainers haven't shared them somewhere by accident?

Even if the vendors manage to protect their codebase, they usually distribute their products as compiled bundles. For convenience. And that's all fine. What should trigger your red flags is that you almost never download those bundles from the vendor. You download them from another 3rd-party, which specializes in storing those bundles, e.g. Maven repository, Artifactory, DockerHub, etc. That's another segment in the chain that can be potentially accessed illegally and tinkered with. Should any unauthorized party get access to such hubs, they could replace those bundles with their own versions of the library (most likely a modified original library with a backdoor injected). Checksums make that a tad more difficult, but it's not impossible to bypass them too. So that's one more attack vector where somebody could be spinning up a backdoor in your application without you or vendor or maintainers having a clue. If you once again think I'm too paranoid, you should know that these things happen. Repositories get hacked and binaries get replaced.

I cannot find a reference now, but I recall reading in one blog about a guy who carried out an experiment: he added a potential backdoor (a stub of a backdoor - it will only call-home instead of connecting to CnC for further instructions) in his open-source node library. It sat there for months (or years?) and no one ever noticed it. He collected statistics after a while and summarized that he could have easily leveraged that backdoor for his own benefit to exploit millions (or was it billions?) of different projects. How many companies is that? :)

Whenever you are adding a 3rd-party library, consider it as a potential threat. It most likely is harmless, but there are ways it could become your and your employer's worst nightmare.

Version upgrade

You probably have plenty of libraries in your project. As time goes (and your library versions don't change), it's very likely your project will have more known library-related bugs:

security
functionality
performance Naturally, you'd like to upgrade your library to a newer version,hoping those bugs have been resolved. It's easy - just update the version number in the pom.xml and reload the project! Or is it...

Most likely the library will have its contracts changed. Now your code doesn't compile, because some library classes got moved/renamed, others have methods with different signatures, some others got deprecated and removed or their fields/methods deprecated and removed! Not to mention cases where the general usage changes. It's hell! Now you have to scan your WHOLE code and look for spots where the usage of the library no longer compiles. Or even worse - it compiles, but it's used wrong in the context of this new version! You have to adapt your code to the new version of the library. So... you wanted to resolve a single issue (or at lease see whether it was resolved in this new version), but now your code no longer compiles, not to mention the correctness of its work. This is often the case with large frameworks, like Liferay or Spring or Hibernate (in Java's terms). How long will it take you to test if that bug was fixed? If it wasn't, you'll have wasted all this time for nothing. Does that sound right to you?

Deprecated

If you think version upgrade causes a mess, I've got a better treat for you! Suppose there is a zero-day security vulnerability revealed in a library you are using, but that library is not maintained for the last... 8 years. Noone has forked it, no one has owned it - it's simply dead. Now you either have to live with that 0-day vulnerability (unacceptable), or patch the library code (have fun!), or replace the deprecated library with an alternative.

Patching is tedious, because it is not your code, you don't know it well and you are likely to introduce other bugs with your patch (assuming your patch fixed the 0-day properly in the first place!). Patching also means, that you will keep on living with a severely outdated and dead code, that no one looks after any more - no one but you. So instead of developing one project you now have two. One patch after another and eventually you'll have rewritten large portions of the foreign code. You might even say you have recreated that library anew. Which is more expensive than creating a new library, because (a) you had to learn the foreign code and (b) you had to fix it iteratively, i.e. not damaging the rest of the code.

Replacing the library is also not the most tempting idea, because an alternative library will have different contract, different classes, different methods, and, most likely, different behaviour. If you could use your old library by invoking a single static method, now you might have temporal coupling in place, requiring you to prep the library, initiate something, persist something and then call something on something. This, like a library version upgrade, introduces a massive scan of the code and lots of code replacements, sometimes even refactoring or change of flows. Might also require new infrastructure units. Now, once again, you have to adapt your code to the library.

It no longer fits my needs perfectly

You have adopted some framework because it promised to tick all the checkboxes in the project requirements' sheet. And it offered an amazingly fast TTM (TimeToMarket) by covering most of the code you'd need to write! Amazing!

2 years later you find yourself in a position with dozens of framework entities/services extended and overridden, plenty of nasty hacks to keep all the overrides in order. Adding a new feature probably introduces yet another hack to adapt your code to the framework. The problem with hacks is that they have a tendency to introduce unexpected bugs, which are a pain to debug. Another problem is that the project becomes barely maintainable: estimates are loo-ooo-ooong (and yet many of them are too short), and more often than before you close the feature requests with "Won't do: not possible" closure code. You may even have introduced a dedicated Jira label: "Not Possible"!

A sane thing to do would be to eradicate the framework (or parts of the framework that are riddled with hacks), but that means you will have to invest lots of time to reinvent the wheel - it's still going to be round, but it will fit your carriage better than the wheel you have now. Who's going to pay for such a months-worth investment? Only once in a blue moon, a client cares for those details and understandingly agrees to pay for such maintenance.

On the other hand, would it perhaps be cheaper and faster to rewrite the whole project? And quite often people do choose this option. The framework is rooted SO deeply in the project, that it seems faster to rewrite the whole thing than to sort out all the hacks and remove the framework. Boy is that expensive..

I need one more feature

You have adopted some library to do the job for you. And it works miracles! However, months later, another business request comes in - make a feature [that uses THAT library] also do THIS. Uh-oh... But that library doesn't do that. You browse all the docs, forums, blog posts looking for instructions that would tell you how to do something at least close to what you need -- nil. You get nothing. Now you either have to extend that library and implement the feature you want, or you have to find another tool for the job. A good recent example I came across is caching. The project used EHCache for in-memory caching. The application needed plenty of data cached, but it only needed it for short periods of time - several seconds. After that, the data is no longer useful. Even worse - the expired data used up memory that was needed for other jobs. So you either boost your RAM by several gigs because you'll need them for cached objects for several seconds per hour (because expired TTL does not mean data is removed from RAM immediately), OR you limit your cache size risking many of the objects won't fit in it, causing slow response times. You would think EHCache has some way to enable a cleaning job that scans the cache periodically and evicts expired entries... But it doesn't (there are projects that extend EHCache and introduce that feature, but that's yet another library!).

What are your options now? Either augment the library or replace it with a more feature-rich library that covers your requirements... until next time. And switching libraries, as you have already read, is not always that easy!

There is a bug! But how do I smash it?

Suppose you have a large framework (e.g. Liferay) you are building your app on. It's great, it works as expected, or even better! Time for a security audit! Auditors scan your application and find severe security problems. You fix most of them, but you struggle with the rest because they are the framework's bugs. You fix whatever can be fixed by summoning the power of manuals, blogs and forums, or even support (if you have a subscription). But what about the rest - the ones, where the support says "you'll need to upgrade the framework to a newer major version to have this fixed"? To those who have used Liferay, this option is a clear no-go, because it's easier to rewrite the whole thing than to upgrade the Liferay's version. You're stuck. It's probably time to introduce some kind of reflection-based hack that patches the security bug, hoping it doesn't open another one.

And if you report the bug to the vendor and the vendor says: "thank you, this will be fixed in the next release... in the next 8 months". What do you do all those months? Sit ducks and hope no one finds that flaw in the exploit-db and doesn't exploit it in your system? Here come the hacky patches! And the poor maintainability that comes with them! Even if you managed to live long enough without getting hacked with your patch and the new version got released, now you either forgot that you wanted to upgrade it (the client definitely has! He won't want to bring this back up as a possible expense), or you're now in the library upgrade hell.

Code I don't need

If you're into embedded or mobile development, you might be familiar with the problem of too many classes. You don't even need libraries to run into this problem - just use Dagger2 extensively, and it will generate you loooots of classes - more than you'd like. Which causes compilation/packaging or deployment problems.
But even if you don't use Dagger2, or don't develop for Android, bear in mind, that you usually invite a foreign library to your codebase to help you out with problem P, while the library is designed to solve problems E, R, G, Y, O, J, B, B1, B2, B6, M, etc. Naturally, you bring far more code into your project than you actually need. Alternatives don't solve your problem completely, so you prefer this 120MB library for a solution, that actually is no more than 5MB. This is a great way to explode your PermGen (or MetaSpace) with stuff you don't need and have more OutOfMemoryErrors. It's also a nice way to bloat your application with excessive dependencies, excessive code. And make your deployments (and applications) slower.

When it comes to security, the rule of thumb is: "don't have stuff you don't need". This situation clearly violates the rule. Now you have a lot more moving (and potentially harmful) parts in your code. Even worse, if you had to make changes to adapt your code to run well with the library code you don't use (might be the case with Spring's beans).

Indirect libraries

Even if you are using libraries from trusted sources and only libraries you truly need, bear in mind, that these libraries/frameworks most likely depend on some libraries themselves. As a result, your innocently looking library introduces even more code in your code than you thought. Definitely the case in npm-related development. The dependency graphs are enormous!

Not only the indirect libraries are a potential threat to your project, but they might also introduce compatibility issues: if the framework only works with an older version of some utility library AND you want to use a newer version of that utility library, in the end (at least in java's jars) only one version will be used. Which one? Noone knows. Something is definitely going to break.

Should I use a library then?

KISS

The right answer is it depends. Don't be a library whore (like here). Also, it doesn't pay enough to spend nights reinventing the wheel on and on. Find the middle ground. If possible, set some guidelines in the project: when are you going to introduce a library, and when are you going to implement the thing yourself. I hear you gasping at "implement yourself" :) But that is a legitimate approach to consider. If you need an in-memory caching with an active TTL, after which entries are removed from the memory - it takes up to an hour to decorate a HashMap with synchronized put() methods and a thread that scans the map every minute and removes all the entries that have their TTL expired. You don't need a fancy library for that. And your custom implementation is no worse than any in-memory caching implementation out there. Remember: KISS!

I like to live by this rule: "If it takes me up to 2 hours to implement the solution, I'll implement it myself rather than use a library". The reasoning is simple: if I have to adopt some library, it will take me far more than 2 hours to

carry out the market analysis (what libraries are out there? Which one fits my case best?)
add the library to my code
configure and use it right (means reading the docs)
suffer from all of the points (and work around them) in the "However" section above.

If I eventually need the library, I can turn my solution into an adapter (see: Adapter pattern) for that library, without changing the signatures of my methods - a perfect decorator (see: Decorator pattern), which means I don't really need to change anything else in my code.

Native abstraction of foreign code

And the above brings me to the practice that I've come to like the most. This practice solves many of the problems, regardless of whether I write my own solution or use a library. And while it doesn't solve the rest of the problems, it makes their mitigation easier and very non-invasive to the project.

Whenever you are introducing a library to your code, write an abstraction for it. If you want to introduce iText (a PDF generation utility), write an interface that has a PdfFile toPdf(DocumentToConvert doc); method. Implement both the data structures and implement that method - make it use the iText library for the job. And your code should NOT be using iText directly. Another example is JSON serialization. There are 2 major players out there (with others not far behind): Jackson and GSON. Instead of calling them directly, hide them behind an interface (contract layer)

public interface JSONSerializer {
    T fromJson(String json, Class<T> type) throws SerializationException;
    String toJson(Object pojo) throws SerializationException;
}

and write an implementation that uses either Jackson or GSON (or any other library).

This way you decouple foreign code from your product code. As a result, your code becomes

more testable
less dependent on the actual libraries
less fragile (doesn't use library features you can live without)
easier to maintain (extendable)
more up-to-date, because you can upgrade/swap any library without a sweat (assuming you have created SOLID abstractions)

What I like the most about it, I can swap out libraries as often as I like, or even depend on multiple libraries (or custom implementations) of the feature, without the rest of the code ever needing to know about any of this.

This approach always pans out if you're unsure what library to choose for the job. You might need to try out multiple libraries before you find the one that works best for you. Or, perhaps, you might not be satisfied with any of them and write your own implementation. Or a hybrid implementation... doesn't really matter as long as the rest of the code doesn't need to know about any of this. Just swap the implementations and try them out easily. You won't need to adapt your application code for the change.

This is more difficult when it comes to frameworks because they are more integrated into your code. However, you can achieve a good enough setup with such abstractions that even make frameworks look like one of the features your code has.

Notice the highlighted parts of the blog saying that you have to adapt your code for xxx. You should not adapt your code for libraries. If anything, you should write your code assuming there is a simple utility that does the job. This way you write a library to enrich your application code rather than writing your code to be able to use that library you want. Libraries are tools. They should serve YOU, not the other way around.

Divide et impera

Some libraries and frameworks are huge and do many things. Especially frameworks. They tend to cover lots of areas, solve lots of problems. Hiding a framework behind an interface would most likely be silly. The interface would have dozens or hundreds of methods and maintaining such a contract would be cumbersome. Here comes the interface segregation principle (SOL*I*D). Although, instead of splitting that enormous interface, you might want to first split the framework logically. What domains does it cover? Suppose it's some e-commerce framework. Can I extract an interface dealing with carts? Can I extract one for orders? For products' listings? For promotions? For anything else? The more fine-grained interfaces you extract, the easier it will be to maintain the abstraction. Noone says you have to write different implementations for all the interfaces - in Java a class can implement multiple interfaces. You can also use the Singleton pattern to back all implementations by the same instance of your e-commerce framework.

This approach applies to any Jack-of-spades library/framework. Divide its responsibilities into smaller sets of features and implement them using the same library if you want. Or your own implementation. Or whatever fits your bill. This segregation gives you the freedom you'll eventually want.

Don't think of a framework as of an almighty know-it-all. Think of it as a collection of features bundled into one. And you can use those features separately if you like.

Cut the losses - let your profits run

If you are a long-time framework user in some particular project and you notice that you spend more time adapting your code for that framework than working out the actual solution, perhaps it's time to leave that framework behind? It's always a choice on the table. If you've introduced that framework in your codebase as suggested above (nicely segregated abstractions), you can get rid of this bottleneck of a framework in no time. Just write your own implementations of those interfaces, write them iteratively if you like to. And eventually, you'll have eradicated that framework completely. And, once done, you are relieved of your duty to keep on adapting your code for the framework. You're now free to use that framework for the interfaces that you wouldn't benefit from rewriting and use your own implementations where you used to experience most of the maintenance bottlenecks.

Summary

So should I use a library for that? The answer is: what do your project guidelines say about it? Do the pros of the newly introduced foreign code outweigh potential hazards and maintenance hell? Will it be easier to maintain the code with the library or with a custom implementation? Is it reasonable to write a custom implementation in the first place?

Define that in your project guidelines. If you want to, you can define your project to be one huge dependency graph with your code as the glue holding the parts together. However, such a project will most likely be unmaintainable (pretty good for PoC though). If you like to, you can write all the features yourself without any libraries. It will burn a lot of time, will create a lot of bugs and you will reinvent a wheel; but you'll have a complete control on all the aspects of the code. Or don't be a radical and choose what model suits best for you. If you asked me, I'd say use a library if it would take you >2 hours to write your own implementation to solve that problem; but regardless of whether you're using a library or a custom implementation, write an abstraction for it and only use abstraction in your code.

I find frameworks very useful to start a project with - later on I might phase them out of the codebase. Libraries are great for PoC and similar code writeups requiring extremely short TTM - I tend to reevaluate a need for them soon after.

References

https://sandofsky.com/architecture/third-party-libraries/

JVM. Memory management

Darius Juodokas — Fri, 09 Apr 2021 10:50:41 +0000

Indirect memory access

If you are familiar with low-level programming languages, like C or C++ or ASM, you might recall how memory management is done there. Basically, your application code has to estimate how much memory it will require for a data structure, ask the OS to allocate that memory as a contiguous memory block, access "cells" of that block separately (prepare/read/write) and, when no longer needed, release (free) that memory block explicitly. If you forget to release that memory block, you have a memory leak, which will eventually consume all the RAM you have.

In JVM life is a lot easier. Writing Java code you only have to worry about WHATs.

I want to create WHAT? A String. (I don't care HOW)
I want to copy WHAT? A List of Objects. (I don't care HOW)
I want to pass a WHAT to this function? An array. (I don't care HOW)
...

The JVM takes care of all of the HOWs (except for arrays - you still need to know how many items (not bytes though) they are to contain). It will estimate how many bytes you require and allocate those memory blocks for you. Will it allocate contiguous memory blocks? You don't know. Will it allocate memory in CopyOnWrite fashion? You don't know. And you don't need to worry about those details. Worry about WHAT will you do in your code next instead.

Memory pools

JVM has its own understanding of what the memory layout should look like. JVM divides all the memory it requires into pools. Note the highlighted "requires". It does not take up all the memory or most of the memory. It only consumes the memory it requires. And demands for each pool are different. Some are accessed more often, others - less often. Some change their contents completely several times per second, others - keep the same content until the application exits. Some contain gigabytes of memory, while others only a few megabytes.

At runtime, your code will ask the JVM to access all those pools for different reasons: calling a method (function), creating a new object, running the same algorithm over and over, some pools' saturation reached 100%, and so on. It's a somewhat complex system, doing its best to keep your code comfortable, not worrying about HOW (or WHEN) to get or release memory.

Memory pools are divided into 2 groups: Heap and non-Heap.

Heap is where all your application objects are stored. It is the larger of the two and it's usually the Heap that causes you memory trouble.
Non-Heap memory is everything else. You can think of Heap as the stage in the theatre, and off-heap as all the maintenance rooms and corridors, staircases, cupboards and other places you need to prepare for the show and keep it going.

Memory pools' layout

Consider the image below. It displays the principal layout of JVM memory pools.

Labels hint in what cases data is allocated in each pool.

Heap

Heap is where your data structures (objects) are stored in. It is usually the largest portion of the JVM memory. Heap itself is divided into 2 main sections:

Young generation, which is also divided into
- Eden
- survivor0
- survivor1
Old generation (also called Tenured space)

YoungGen

Eden

Eden is where objects are initially created. They stay in Eden for some time - for as long as there is room in this pool - and then the decision is made to either keep them or discard them releasing their memory. If the object is discarded - poof - the memory is no longer allocated and it's freed (in a way). If, however, the object is still used in your java code, it will be tagged with the "survivor" label and moved over (promoted) to the next memory pool - Survivors.

Survivors

There are 2 survivor regions: s0 and s1. One of them is usually empty, while another one is being filled up, and when it fills up, they switch who is empty and who is not. Suppose objects from Eden were promoted to S1. All the other objects that survive Eden will come to stay at S1 for as long as there is any room in S1 pool. When S1 fills up, an inspector comes and checks each and every object in the S1, labelling objects that are no longer required in your java code. All the objects that got tagged are immediately released. The remaining objects get a medal for surviving the S1 and are moved over to the S0 region. In the meantime, all the objects that survive Eden are now promoted to S0 too. When S0 becomes full, the same inspection takes place, and then survivors are moved to S1 with one more medal. When a survivor gets enough medals, it's ranked as a long-living survivor and it is promoted to the ultimate memory pool, where only long-living objects are stored: the OldGen.

OldGen

OldGen is usually the largest pool in the Heap. It contains long-living objects, veterans, if you will, that have been serving for a long time. They are so-called mature objects. As more and more objects mature, the OldGen will become more and more populated.

Non-Heap

This is the mysterious part of the JVM memory. It's difficult to monitor it, it's small and people usually don't even know it exists, not to mention knowing what it's used for.

PermGen / Metaspace

PermGen (Permanent Generation) is a region that was used until java8. It was renamed as Metaspace since java8 and got its limits lifted. PermGen stores classes' metadata (hence the metaspace). Not the Class<?>.class instances (they are stored in Heap, along with other instances), but just the metadata. Class metadata describes what's the class called, what's its memory layout, what methods it's got, static fields and their values. Everything static is stored in PermGen/Metaspace. When a class loader is created, it gets a memory region allocated in PermGen, for that classloader's classes' metadata. Each classloader has its own region, that's why classloaders do not share classes.
PermGen is an area that is fixed in size. You have to specifically tell the JVM (unless you are satisfied with the defaults) in advance how large PermGen you would want, and, once reached the limit, an OutOfMemoryException is thrown. Also, it was permanent - classes, once loaded in, could not be unloaded. This became a problem when more and more applications, libraries and frameworks began to rely on bytecode manipulation. Since Java8 this region was renamed as Metaspace and the limit was lifted. Metaspace can grow as large as it likes (or as long as there is memory on the host platform). This growth can be limited with JVM parameters.

JIT Code Cache

As the JVM runs, it keeps on executing the same parts of the code over and over again - some parts of the code are hotter than others (HotSpots). Over time, JVM notes down the sets of instructions that keep on recurring and are a good fit for optimization. These sets of instructions might be compiled into native machine code, and native code no longer runs on the bytecode interpreter - it now runs directly on the CPU. That is a great performance boost, improving the performance of those methods at magnitudes of 10s or 100s or, in some cases, even more. The compilation is done by the JIT (Just-in-Time) compiler, and the compiled machine code is stored in the JIT code cache region.

GC

This is a tiny region that the Garbage Collector uses for its own needs.

Symbol

This space contains field names, method signatures, cached numeric values and interned (cached) Strings. Numeric compile-time literals (5, 50L, 3.14D, etc.) are cached and reused by the JVM in order to preserve memory (literals are immutable, remember? They are static final). A similar thing happens with Strings too. Moreover, Strings can be interned manually, at runtime: if the String.intern() method is called, it will be cached too. Next time a string with the same contents is referenced, the interned String instance will be used instead of creating a new String object. If this region starts growing too large, it might mean that your java code is interning too many different strings.

Shared class space

This is a small memory region that stores .jsa files - java classes' data files, prepared for fast loading. This region is used to speed up JVM startup, particularly the part of the startup where system libraries are loaded. It doesn't have much of an impact during runtime.

Compiler

This region is used by the JIT Compiler. It's a working area for the compiler - it does not store the compiled machine code.

Logging

Unfortunately, Java docs are not very wordy when it comes to this region. They only say that this region is used for logging. We can only assume it's used actively during runtime, but it's unclear what problems can occur when this region is overutilized.

Aguments

This is also a tiny region, that stores command-line arguments of the java.exe command. This memory does not play any significant role at runtime, as it's mainly populated during boot-time.

Internal

Quoting the Java docs: "Memory that does not fit the previous categories, such as the memory used by the command line parser, JVMTI, properties, and so on"

Other

Quoting the java docs: "Memory not covered by another category"

Thread (JVM Stack)

This region can potentially cause problems, especially in heavy-duty applications. This area contains threads' meta info. Whenever a thread calls a method, JVM pushes the called method's signature (stack frame) to that thread's Stack, located in the Thread (JVM Stack) area. References to all the passed method parameters are also pushed along. The more threads you have and the deeper the stacks these threads have the more memory will they require in this region. Stack size can be limited per thread with JVM parameters, but there is no way to limit the number of threads. This effectively means, that uncontrolled proliferation of threads might exhaust your system memory (it's off-heap, remember?).

NMT

This is a tiny region used by the java's NativeMemoryTracking mechanism, for its internal needs. NMT is a feature you want to be enabled if you have memory usage concerns. It's a feature of the JVM that allows us to see what is actually happening off-heap, as there are no other ways to reliably observe off-heap memory usage. However, enabling NMT adds ~10% performance penalty, so that is not something you might want to use in a live production system on daily basis.

Native allocations

If off-heap is a memory that is not stored in Heap regions (and not restricted by Heap limits), the Native Allocations region can be seen as off-off-heap. It is a part of the memory that the JVM does not manage at all. At all. There are very few ways to reach this memory from your Java code (DirectByteBuffer or ByteBuffer.allocateDirect()). This part of memory is extensively utilized when developing hybrid Java applications, using JNI - java applications, that are also using components written in C/C++. This is often the case in high-throughput java applications and Android development, where some components are developed in native code to boost performance.

Memory pools' sizes and limits

Heap

Default MAX size
- jdk1.1: 16MB
- jdk1.2: 64MB
- jdk1.5: Math.min(1GB, totalRam/4)
- jdk6u18:
  - if total RAM is <192MB: totalRam/2
  - if total RAM is >192MB: totalRam/4 (some systems: Math.max(256MB, totalRam/4))
- jdk11 [up until jdk16]: totalRam/4
Configuration (algorithm, verification: java -XX:+PrintFlagsFinal -version)
- -Xmx (e.g. -Xmx10g) can be used to set custom maximum heap size
- -XX:MaxRAMPercentage (e.g. -XX:MaxRAMPercentage=75) (since jdk8u191; default: 25) can be used to adjust max Heap size in percent-of-total-ram manner
- -XX:MaxRAMFraction (e.g. -XX:MaxRAMFraction=2.5) (since jdk8u131 up to jdk8u191; default: 4) is another way to configure what part of total RAM can Heap allocate. It's basically an x in a formula: maxHeap = totalRam / x. In a machine with totalRam=16GB, MaxRAMFraction=1 is equal to setting -Xmx16g, MaxRAMFraction=2 is equal to -Xmx=8g, MaxRAMFraction=8 is equal to -Xmx=2g, and so on.
- -XX:MaxRam (e.g. -XX:MaxRam=1073741824) normally JVM asks the OS (or cgroups) what's the totalRam on the machine. MaxRam can override this ask - with this flag you can make the JVM think there's 1073741824 bytes (in given example) available in the system. The JVM will use this value to calculate memory pools' sizes dynamically. If -Xmx is passed, MaxRam has no effect.

YoungGen

Some configurations might not work OOTB, because the Adaptive Size Policy might be overriding them. To disable ASP use -XX:-UseAdaptiveSizePolicy.

Default MAX size
- NewRatio=2 (2/3 of Heap is OldGen, 1/3 is YoungGen)
Configuration
- -Xmn (e.g. -Xmn2g) sets the size (both, min and max) of the YoungGen to some particular value.
- -XX:NewRatio (e.g. -XX:NewRatio=3) defines the youngGen:oldGen ratio. For example, setting -XX:NewRatio=3 means that the ratio between the young and tenured generation is 1:3. In other words, the YoungGen (combined size of the eden and survivor spaces) will be 1/4 of the total heap size. This parameter is ignored if either NewSize or MaxNewSize is used.
- -XX:MaxNewSize (e.g. -XX:MaxNewSize=100m) sets the maximum size of the YoungGen.

Eden and Survivors

Default MAX size
- SurvivorRatio=8
Configuration
- -XX:SurvivorRatio (e.g. -XX:SurvivorRatio=6) defines the eden:survivors ratio. In this example, the ratio is 1:6. In other words, each survivor space will be 1/7 the size of eden, and thus 1/8 the size of the young generation (not one-seventh, because there are two survivor spaces). Survivor size can be calculated with this formula: singleSurvivorSize = youngGenSize / (SurvivorRatio + 2)

Off-Heap

Most of the regions are uncapped, meaning they can grow without any limits. Usually, it's not a problem, as most of those regions are used by internal JVM mechanisms and the memory is very unlikely to leak. However, the Native Memory pool, used by JNI and JNA as well as direct buffers in the Java code, are more likely to cause memory leaks here.

PermGen

Up to jdk8

Default Max Size
- MaxPermSize=64m on 32-bit systems, and 85.(3) in 64-bit machines
Configuration
- -XX:MaxPermSize (e.g.-XX:MaxPermSize=2g) sets the max size of the PermGen memory pool

Metaspace

From jdk8 onwards

Default Max Size
- unlimited
Configuration
- -XX:MaxMetaspaceSize (e.g. -XX:MaxMetaspaceSize=500m) sets the max size of the Metaspace region.

JIT Code Cache

Default Max Size
- jdk1.7 and below: ReservedCodeCacheSize=48MB
- jdk1.8 and above: ReservedCodeCacheSize=240MB with TieredCompilation enabled (by default). When -XX:-TieredCompilation (disabled), ReservedCodeCacheSize drops to 48MB
Configuration
- -XX:ReservedCodeCacheSize (e.g. -XX:ReservedCodeCacheSize=100m) can set max size of the JIT code cache.
- -XX:UseCodeCacheFlushing (e.g. -XX:UseCodeCacheFlushing or -XX:-UseCodeCacheFlushing) to enable or disable JIT cache flushing when certain conditions are met (full cache is one of them).

GC

Uncapped

Symbol

Uncapped

Shared Class Space

Uncapped

Compiler

Uncapped

Logging

Uncapped

Arguments

Uncapped

Internal

Uncapped

Threads' stacks

Default Max Size
- -Xss1024k in 64-bit VM and -Xss320k on 32-bit VMs (since jdk1.6)
Configuration
- -Xss (e.g. -Xss200m) or -XX:ThreadStackSize (e.g. -XX:ThreadStackSize=200m) will limit size of a single thread. However, avoid using ThreadStackSize. Setting ThreadStackSize to 0 will make the VM to use system (OS) defaults. There is no easy way to calculate how large stack you may need, so you may want to adjust the Xss when the default is not enough.

Monitoring

Heap

Heap is rather easy to monitor. Heap usage is tracked closely by default, you just need tools to access that information. See here for more info.

jmap -heap <pid> or jhsdb jmap --heap --pid <pid> displays sizes of each Heap region and PermGen.
jmap -histo <pid> (warning: might slow down the JVM for some time; the output might be lengthy, so dump it to the file) takes a histogram of all the classes in the Heap. It's like a summary of a HeapDump. Here you can find all the classes, a number of their instances and how much heap each utilizes.
jstat -gc -t <pid> 1000 will print usage of each Heap region every 1000 milliseconds (i.e. every second) - useful for live monitoring.

Off-Heap

It's difficult to monitor off-heap memory. It's advised to monitor this 'dark side of the JVM' only when required, i.e. when there are memory-related problems you are trying to debug. That is, because the method, that gives you the best visibility of off-heap memory, NMT, adds 2 words to each malloc(), which approx ends up in 5-10% overall performance penalty.

NMT can be enabled upon JVM startup either by passing -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail parameter to the java program. Unless you need very detail information, summary should suffice (amounts of detail output might be unnecessarily overwhelming). When JVM is started with NMT enabled, you can use jcmd to

see current off-heap statistics' summary (jcmd <pid> VM.native_memory summary),
see detail current off-heap statistics (jcmd <pid> VM.native_memory detail),
establish a baseline of NMT records (jcmd <pid> VM.native_memory baseline),
track diffs over periods of time (jcmd <pid> VM.native_memory detail.diff).

References

JVM. Intro

Darius Juodokas — Fri, 09 Apr 2021 10:43:11 +0000

Java Virtual Machine

"Java eats up all the memory!", "java is bloated", "java is slow", ...
All of the above is true if are using the JVM wrong. The JVM (Java Virtual Machine) is itself an application, like any other on your computer. In GUI applications (Notepad, VLC, Chrome,..) you can ask an application to do things for you by clicking buttons, links, entering data manually. You can ask most of the applications to do things by passing them command-line parameters. As for JVM - it's also an application, that can be asked to do things: either via command-line parameters, via environment variables or via long configuration files written in java bytecode, called java class files. As it is with most of the applications, configuration files are passed to the java.exe application when you start the process.

It's not exactly the traditional perspective, is it? :) Nevertheless, it is correct. Java application code is nothing but a configuration file for the Java Virtual Machine. JVM, once started, reads that java (byte)code and does what it's asked to do. That also holds true for most (if not all) (pseudo)interpreted programming languages: python, groovy, javascript, PHP, bash, etc.

Why the middle-man?

That is a good question. Well, there are several reasons. The middleware (the middleman) allows you to write code only once and run it on all the platforms that the middleware can run on. Consider Python - you can install Python on nearly any device and when you write python code, you don't need to worry about the CPU architecture, memory (de)allocation, memory layout, system libraries, etc. If you managed to install Python on 15 different systems, it means you will be able to run the same .py file on all of them.

The middle layer also decouples the "HOWs" from the "WHATs". Your application code doesn't have to cover HOW you want things to be done. You only have to code WHAT you need to be done. All the rest will be taken care of by the middleware.

And the latter is the reason why I love Java the most. Not only it decouples application code from platform intricacies, but it also allows you to tweak them separately, also, in a platform-independent way! It allows you to configure your network stack, memory management, optimization techniques, security, monitoring and many, many other nuts and bolts are at your disposal.

In other words, you can configure HOW the middleware works. You have all the means to optimize or kill the JVM's performance.

Okay, but middle-man means it's slower, right?

Yes, and no. JVM was significantly slower in the '90s and 2000s. The hardware was very limited, sluggish, the JVM was slowish as well -- all that added up to significantly slower responsiveness, when compared to native applications (low-level code, like C, C++, ASM, ..., that, once compiled, is run directly on the CPU). However, since those days Java has implemented plenty of optimization tricks to boost its performance. It's still not as fast as compiled C code, but in most cases, it is so close that you might not feel the difference. You know, how CPUs have the branch prediction hardwired, trying to guess what the application will try to do next? Well, the JVM has very similar techniques as well, on top of the branch prediction.

How to use the JVM "right"?

Usually, the default settings are enough for most applications. The rule of thumb is: "keep defaults unless they are no longer enough". Each application requires different system resources in different amounts and patterns, so it's only natural that defaults for some applications will not be enough. Others run in containerized environments, or on hardware that allows for performance optimization - you might need to change the defaults there too. If you see that your JVM is growing in size beyond levels you can tolerate, it means that defaults are no longer enough and it's a good time to change them. Complaining, that java is bloated, just because default settings are no longer enough for you, is like complaining, that cars are useless at night because they come with their headlights off by default.

"Avoid Round-Robin in PROD" or "The tale of the bad raspberry"

Darius Juodokas — Fri, 19 Mar 2021 15:27:48 +0000

Why do I care?

It's all good while it's good. When things go south, you might want your balancing mechanism to reliably keep the show on the road. Round-robin works in many cases, but there are cases when RR will slow your service down to a complete halt. And adding more instances won't do you any good. Believe me, this is not a corner-case :)

What is round-robin

Round-robin is an algorithm that decides which item to choose next from the list. It's the second simplest algorithm (the first one would be "always select *N*th item", and the third always choose a random item). Suppose you have 5 raspberries lined up on your table. These are freshly picked, red, big, juicy, sweet raspberries. Oh, I know you want them! Which one will you take first? Which one after that? And then?
Willingly or not, you will apply one or another algorithm to the sequence you are picking those raspberries off the table and om-nom-nomming 'em real nice. You can take them randomly, sequentially (left-to-right or right-to-left), middle-out, out-middle, biggest first, or any other algo.

Since RR doesn't really apply to items that aren't reusable, let's assume I'm a generous person. Once you take one of the 5 raspberries, I put another one in the place of the one you just took. Sounds good?

RR looks like this. You assign indices to your raspberries 1 to 5. Say the left-most is 1 and the right-most is 5.

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
you take the 3rd raspberry [and I put another one in its place]
you take the 4th raspberry [and I put another one in its place]
you take the 5th raspberry [and I put another one in its place]

and then round we go

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
you take the 3rd raspberry [and I put another one in its place]
you take the 4th raspberry [and I put another one in its place]
you take the 5th raspberry [and I put another one in its place]

and then round we go

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
...

See, it's a very simple algorithm. You always take raspberries in the same sequence, one-by-one, until you reach the last one. And then you begin the same sequence anew.

Scale up the example

Scaling the berry eaters (consumers)

Let's enhance our example. Now you are not eating them yourself. You are in a room full of people and they keep on coming for the raspberries. A person approaches your desk, you pick the next raspberry and give it to that person. I replace the berry on your desk with another one and the person walks away.

Nothing really changes, right? You can still apply RR algorithm. You know, which berry you took last and you know which one you'll pick next.

Scaling the berry distributors (producers)

Now the example becomes a bit more complex. We're scaling YOU. To make it go faster, we're now assigning 2 more peoples to distribute the raspberries. Now there are 3 people working in a round-robin manner. This doesn't change anything really, I'm scaling YOU just to make the example more realistic. If you were alone, it would be difficult for you to serve several people at the same time. Now since there are 3 of you, multitasking becomes very realistic.

However, you are still picking the same 5 raspberries. You don't have 3 different sets of berries. You only have one set.

One bad berry

I have made a decision. Every time you pick a berry at spot #3, I'll no longer replace it. Instead, you will have to come to take it from the bucket yourself and put in on the table. This will slow you down considerably. If you could serve 2 people in 1 second before, now you'll find it hard to serve 1 person in 5 seconds. Your throughput dropped 10-fold: from 2pps to 0.2pps (people-per-second).

But that's alright since there are 3 of you and there still are 4 berries "cached" on the desk all the time!

Not really though...

The halt

Do not expect you three will always be picking berries in the same order. One of you is faster, another one is slower. You will work at different paces. And the people - they will come randomly: sometimes the queue will be 20 peeps long, other times there will be only 2 folks, both of which you can handle at the same time (there are 3 of you, the distributors).

And this is the reason why at times 2 of you (or even all 3) will be handing over the same 3rd berry. While you were running towards me to get that berry, other 2 or 3 folks completed the full cycle (4→5→1→2→3) (or twice!) and now they are running after you - to get the 3rd berry from me.

What's happening at the client side of the desk? People are waiting. They are getting anxious, because there are 3 of you, there are 4 berries on the table and you are running around to get that 1 berry that isn't there.

Then you all come back, serve the 3rd berry, complete the cycle and again you go running. And again people are waiting.

Raspberries in PROD

They can be anything you are iterating over in RR manner. Be it DNS records, servers in the pool, the next hops in the LoadBalancer, etc. If you have multiple consumers and a set of items you are serving in RR pattern, and one of the items is considerably slow, all the consumers will notice the slowdown. And every consumer will be slowed down equally. Because the slowdown is not alleviated by the number of items in the list. If the consumer gets the instance - it gets the slowdown.

The only thing that is alleviated by adding more items is the frequency, how often the slowdown will occur.

If raspberries were servers

Suppose there are 60 web servers in the pool. Normally a webserver responds in 100ms. Great! One of the 60 servers' MEM% reached 100% and it's currently swapping. CPU% immediately sky-rockets to 100% too. It still is capable to serve requests, but... very, VERY slowly. It now takes like 30sec to serve 1 request. And your liveness probe timeout is 40 seconds, so the server responds to health check polls on time. It can still accept new requests.

What happened to the raspberries happens there too. All the requesters eventually iterate over all the 60 well-working servers and end up at the one that's terribly slow. Since a browser is making many calls to load a single webpage, it's very easy to complete 1 iteration over the set of 60 servers! And if your webpage fetches 120 items to load the page, the same page load will probably hit the bad server twice, if no one else is using the system. Hit that server once - you'll wait for 30 seconds+ to load the page. Hit that server twice - you'll wait 60seconds+ to load the page. And so on.

How many users are willing to wait 30+ seconds for your webpage to load?

Why am I telling you this?

Because we've stepped on that very landmine. As I said in the beginning, RR is a very simple and good algorithm as long as everything works. Heck, it's even a great algorithm!

But it takes ONE bad berry to halt your RR for good. If you can remove the bad berry from the items set - great! RR is now running all fine again. But if you can't, or it's still kicking and doesn't want to go away...

We are running an application scaled rather widely horizontally. Think hundreds of instances. And we are running load tests generating solid amounts of requests in a very short period of time. During one of the tests, I decided to make a heap dump of one of the JVMs in the cluster. HD halts all the threads for half a minute or so but doesn't kill the JVM. And then I noticed the phenomenon: even though there were hundreds of other servers working in parallel and I was only freezing one of them, load on ALL the servers dropped completely (from 80% CPU% to ~5% CPU%). So freezing a single server froze the entire application. For good! Now, what if I was taking an HD in the PROD cluster? Users' browsers would stop loading the page.

Another phenomenon: heavy workers attract more work

The problem I had

I recall now why I was taking that heap dump. That JVM's memory usage was higher than on other JVMs in the cluster. And the CPU% was higher too. It didn't make a lot of sense: all the instances are the same, why is THAT one getting more load?

It looked like this:

Notice how the red server load is significantly higher than on the other servers. Then it drops and another server immediately takes over. Now look at the beginning of the test: the load is ramping up on all the servers, and then that red server goes rogue and the load drops on all other servers. While the load generators keep on generating the same amount of load.

I looked everywhere: LB distribution, proxies' access logs, application logs, configurations, thread dumps, GC logs, stickiness... Nothing was off. According to the setup, all the instances should be receiving the same amount of workload. Yes, yes, some requests are heavier than others. But we are talking about tens of thousands of parallel requests and hundreds of servers in the pool. I'd expect more than one server to exhibit that behaviour!

I thought hard. I was modelling different request paths in my head for several days, and then it hit me: what if the CPU% is the cause and not the effect? Let's try it out. I ran several loops like below on one of the well-performing servers:

while :; do :; done

to increase cpu% usage and effectively slow down the throughput in that JVM. And it worked. I got control of the phenomenon: I found a way to break an instance. See here:

I could deliberately make either of the instances bad. This is a step to the right direction. Does this only work with the CPU%? Let's try a SIGSTOP.

Oh my... That I did NOT expect. You can clearly see where I issued a SIGSTOP ON A SINGLE INSTANCE. All the instances halted. SIGSTOP followed by a short pause of several minutes and with a SIGCONT, to keep the app alive.

As you see, freezing (not KILLING, not restarting, not shutting down, not removing from the network, but freezing) a single instance in a cluster halts all the other instances. It doesn't happen immediately - there's a delay of several seconds (could be minutes: the more requests are coming, the shorter the delay will be). And it doesn't really help to have thousands of instances in the pool... The same result will happen. Just the delay might be slightly longer.

Why did I have that problem

It might seem like there are two problems in my case above, but... really, it's the same one. It's round-robin load balancing and a bad raspberry in the cluster.

Single node freezes the whole cluster

Remember the raspberry example? If I slow down one node (or if I freeze you when you try to take raspberry #3), the whole flow eventually aligns up at the slower node and all the requests slow down at that point. If the node is halting the requests, then all the requests will halt. They won't be rejected or dropped - they will just... stop. For as long as the bad node is standing still.

Heavy worker steals work

Now for the initial problem. It was a head-scratcher. The problem was that the server was slightly slowing down when processing one of the requests. Let's call that request POST_X. Processing of POST_X request caused 100% cpu% usage for a very short time, which slowed all the transactions a tiny little bit. However, that little spike slowed the JVM just enough for another two POST_X requests to reach that server. Now the 100% CPU usage was twice as long. Which caused another bunch of POST_X requests to get trapped in that server. And so on and so forth. And eventually, that instance was doing nothing but processing POST_X requests (and a few others). It's easy to imagine that the CPU% was 100% all the time. It became a bad raspberry. Because it was a slow server, it eventually attracted all sorts of requests, not just POST_X. This explains why all the servers lost their load and that one bad berry attracted most of the requests sent to the cluster.

There was only one bad berry. Other instances also had to process POST_X requests and they also used to slightly spike their cpu%. However, the server that got bad first acted as a bottleneck - requests got held in that single server and less requests-per-second reached all the other servers. Meaning, that in other servers POST_X-induced JVM pause was not long enough for another POST_X request to arrive before the peak ended (there were fewer requests floating around, as e2e flows were stuck in the bad server).

See the change of winds on the graph? Sometimes the bad berry jumps on to another server. I haven't checked that, but I assumed it could be JVM GC on one (or several) of the good servers that kicked in, held some of the requests (these e2e flows didn't iterate over to the current bad berry) and gave some time for the bad berry to cope with its current workload. As it did, its CPU% dropped. As soon as the GC ended, someone had to become a bad berry. If we have 100 nodes in the cluster, every node had a 1/100 probability of becoming a bad berry. It could be the same node, it could be the one that GCed, it could be any other node.
However, that's just a hypothesis I haven't confirmed (nor denied) reliably by factual data.

The berry became bad because of POST_X requests accumulating, but soon enough there were lots of other requests jamming that same CPU. POST_X was a firestarter.

The problem was fixed by changing the application code and making the POST_X less CPU-intensive (something about drools and a negative one.. don't ask).

What's better than RR?

Well, RR is an amazingly simple and easy algorithm and it seems to be just enough for nearly every use-case. However, it's only true as long as things work well.

In my particular case, it would be better to load balance between nodes either by applying the least_cpu_load or even better - my favourite - least_active_connections policy. These are more complex, more sensitive policies and require more precise maintenance, but they should prevent your cluster from halting completely if one of the nodes freezes.

If you want (or have) to stick with RR, make sure you have more than adequate liveness monitoring. For instance, if a server fails to respond to a h/c request in 5 seconds - remove it from the request-receivers pool and let it drain its workload while other nodes handle all the new requests. When that node manages to respond to h/c requests in under 5sec - put it back online. If the node responds to a h/c with an error code or an ECONNRESET or anything other erroneous -- remove that node from the request-receivers pool and kill it as soon as its workload drains (if it does). Kubernetes does its routing using iptables, and iptables (netfilter) has a concept of conntrack. Removing the node from request-receiving pool is as simple as updating the iptables rule matching its IP from

-j ACCEPT

-m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

which will allow the server to only accept traffic in already established connections, but won't accept any new connections. The server will also be able to send responses.

Performance and security testing. Why bother?

Darius Juodokas — Thu, 18 Mar 2021 21:01:41 +0000

Anyone can write an application

Application development is a process during which a product is created or enhanced feature-wise. Usually, clients are so locked on the goal to "have an application that does THIS and THAT", that they forget the application needs a lot more than just have features. Or even worse - they know that and assume it's a part of coding!
What does an application require besides code?

Infrastructure. Where will it run? Cloud? On-prem? How expensive will that be? Who will maintain it all?
Stability. What's the estimated app outage per year? How likely is it to break out of the blue? How likely is it to break when the userbase peaks? Who will recover the application? How fast? Will there be a DR?
Security. Will the application be capable to enforce access control? Will there be ways to bypass that control? Will developers put security bugs in the code? Or will developers perhaps be unaware of security issues in (non)commercial tools they will have to use for the app? Or perhaps no one will be aware of such bugs until the app is live and exploited, and your developers will be the ones who fix them for the tools' vendors? Or maybe security will be flawed not in the code, but the infra? Or at the infra provider level? Or at any other point..?
Performance. How many requests will the application be able to handle? How many resources does the app require to GO-LIVE and be capable to handle all those requests? How will the app performance degrade over time? Who will notice it besides the end-users? Will there be proactive performance monitoring, performance control, so that OPS can take actions before end-users notice performance degradations?
...

There are a few more aspects to consider besides just "write the damn code". Anyone can write the code. Even the client itself could write the app if he/she spent a few months learning the language. Does that mean the result is good for the go-live? Would you release it as soon as you think it works? No? Why not?

Hire the professionals

Yes, the answer to any business needs - just pay someone who knows how to do it. They are professionals - they know how to do it! "I am paying you, so you make sure you do it well!"

make it fast
make it secure
make it pretty
make it have 4.17 gazillion of features
make it maintainable
make it stable
make it without errors
make it... flawless
make it cheap
...

The list of requirements continues. And that is expected. The business is non-technical people (usually), they know they want the product to be of high quality and feature-rich. And they are paying for it for the professionals - the people who claim to be good enough to deserve to be paid for the job. It's only natural to expect the best results!

The pickle

Yeah, there are a few problems with the "hire the professionals" approach.

No matter how good the developer is, you still require a QA. A developer NEVER creates bugs intentionally. No matter how good the developer is, app development is but mental labour. The developer might be tired, moody, feeling sick, stressed, etc. FWIW, a developer might make a (code) design decision that ends up being suboptimal after a few new features. And fixing the code design usually ends up in bugs. Unit tests do help here a lot, but they are far from enough.
Say, you got the app created. Say, you've released it to LIVE. What's next? Will it just sit there doing nothing? Will the world stop evolving from that point on? No. The application users will find ways to intentionally or unintentionally break the system. It is impossible to predict every possible flaw in the development time. Believe it or not, even fish in the ocean, wind, solar bursts and the very fact that we are on a rock flying in the space around the Sun can be exploited to harm your application. All these factors can harm your application even without a user in the equation - it can just happen.

What I'm trying to say is that no matter how good the professionals you hire, they will not be able to create and release a perfect product for you. The best you can get is good enough. The closer you want to get to that, the more funds and time it will cost you.

Application is alive

There's a reason why they call it a go-LIVE. Once released to the public, your application begins living. And living requires constant maintenance: supply of resources, means to clean up the waste and means to protect itself from hostile forces. And all of those can be both physical and virtual.

Consider a database with a table containing billions of records. How long would it take you to find a user "Andrew1121" in that registry? Why can't you find it any faster? "I just can't, I'm a human, not a machine" is an almost valid answer. Yes, a machine can find it faster for you. But why can't it find it even faster than that? "Because it can't, it's just a machine, not a {...}". Even machines have limits that, once reached, slow them down considerably. There will come a time when your database is too big to be fast. Your app will become more and more sluggish over time, your app maintenance will grow, expenses will grow and usage (think end users) will start to decline because of the slowness. Your app was blazing fast in the beginning and its performance degraded over time. The professionals you hired did the job and made a well-performing product, but the life of the application had its way.

Consider a webserver. It's the part of your application through which end users are accessing all the features of your web app. It's also like the skin of your application. If anyone tried to harm your app, they would most likely have to get past it, probably by finding some weak spot (a wound, mole, a pore, etc.). The skin heavily depends on its surroundings: atmospheric pressure, temperature, humidity, radiation, etc. Your webservers also heavily depend on the environment: temperature, humidity, radiation, DNS servers, routers' firmware stability, vendors (ISP, cloud, hardware, firmware, software,...), physical network fibres, and everything that can affect them. There are so many variables at play that it's only a matter of time before either one of them is successfully exploited to violate the webserver's security.

Once released to the public, the application becomes alive and subject to hostile forces and entropy. So just building, deploying and releasing the application is not enough to keep the service alive. Maintenance is mandatory.

Performance and Security is not just QA

QA usually is a part of the release pipeline in the development cycle. The developer writes code and builds the application, QA tests it to make sure it meets the quality requirements and then the product is shipped. The thing is, QA is usually oriented towards the functionalities of the application. And IMO that's how it's supposed to be. Functionalities are static and the core of the application: the only place they can be changed is the developer's workstation. Once developed/changed, they must be tested to make sure the features are working as expected and don't break the business flows.

Performance and security are usually considered a part of QA. And that is not right. These factors can ONLY be tested in the LIVE environment. Yes, the code can be challenged performance or security-wise in non-prod environments too, but the results will be off. Why is that so?

Security

The problem

In PROD we rely on substantially more external forces than in non-prod environments. Different network infrastructure, different servers, locations, different load patterns,.. and the fact, that in PROD we potentially have billions of security testers: both human and automated. Do we have those factors in non-prod? Nope. If our pre-prod security testing gives us green results, can we conclude that our application will be secure in PROD? No way. The best we can conclude is that our non-prod application is protected from the attack vectors we have tested for (i.e. we could think of) and that prod is likely to be protected from the same set of attack vectors. What about the other vectors? The ones we haven't thought of, but a 16yo prodigy living in his parents' basement has? What about attacks on the ISP? Or the IaaS? Or human errors in the LIVE environment? Or vendors' employees deciding to profit by downloading your information they can access illegally and selling it in the black market? Or your concurrents?

We can test security in pre-prod environments, but don't make a mistake and think that these tests reliably reflect your production environment's security. Testing security in the QA phase is better than nothing, as it can catch the most common security issues before releasing them to prod. But it doesn't mean other issues (the ones we haven't tested for) won't be shipped to prod and it doesn't mean the issues we've fixed or haven't found in non-prod surroundings won't be exploitable in prod conditions.

The solution

It is impossible to make an application 100% secure. The best you can get is good enough, and the enough part is a matter of the SLA. Business and tech people are to agree on security assurance terms and define

how often and when/where at the lifecycle app security will be tested
both testing in the QA phase AND periodic tests in a live PROD environment are strongly advised
how extensive will the tests be
only automatic, only manual, or hybrid? None of the approaches will assure 100% security, but the more extensive the testing, the more expensive it will be and the close to the good enough app security will be.
what security flaws are tollerable
sometimes a business might have to make compromises: lose some users for better security or afford a lower security standard; have higher expenses for better security or save some $ hoping no one finds a way to exploit the problem.
Sometimes technicians might have to make compromises too: cover a more serious vulnerability with a less serious one, because the serious one is impossible to fix ATM or ever (e.g. Intel Spectre, Rowhammer, NAT Slipstreaming,...), mitigate a vendor or a third-party dependency vulnerability with less vulnerable workarounds, and so on.

There is no actual solution to be secure. You will never be secure. The best you can do is to test for as many attack vectors as you can both, manually and automatically: features, infrastructure, middleware, databases, web servers, caches, network. And prepare for a possible breach: protect your assets, data and prepare a detailed security incident management plan.

Performance

The problem

While some security testing can be automated as a part of QA, performance testing is even more stubborn.
If the code has security flaws, these flaws will be present in all the environments if deployed. If the infrastructure has security flaws, then it might not be possible to catch early.
If either code or infrastructure has performance problems, it's impossible to reliably test them automatically. In performance testing, there's another component that always plays a part - under what conditions. This component is less expressed in security testing, while in performance testing it must be always taken into account. Always.

If we are testing the performance as part of the QA phase - under what conditions? Will we be able to recreate PROD conditions in a non-prod environment? Are we willing to spend $ for a copy of the PROD environment to test its performance? Will we be able to generate enough load to mimic PROD load? Will we be able to mimic PROD-like e2e flows?

It's possible, but it's expensive. And difficult. And, more often than not, performance tests show non-conclusive results, because conditions are always dynamic, so we need to differentiate between conclusive and non-conclusive tests.

Performance micro testing is as good as unit testing. It's a good way to catch the most common problems early in the development phase, but it will not account for the system as a whole (multiple components, infrastructure, load patterns,...).

The good side

Performance testing results, however, tend to be more sustainable than security testing. While it's impossible to know all the possible variables in security testing, we usually know all the variables in performance testing. These are:

CPU: frequency, cache sizes, core count
memory: frequency, capacity, limits
maintenance tasks: Garbage Collection (duration, frequency), state cleanups (TTL, amount freed, flow changes caused by changed state), ...
IO: type, frequency of occurrence, added latency, payload size
storage: capacity, number of sequential reads (and size and latency), number of scattered writes (and size and latency)
network: call rate, payload size, hops' count, protocol overhead
limits

And all the possible combinations of all of the above. Usually, applications operate in a closed system, which means we know all the components we are working with and we have control over most (if not all) of them. This provides us with an ability to reliably measure latencies, durations, resources' utilization -- all of them under certain conditions. I didn't say it's easy to reliably test performance. I said it's possible.

The solution

The only reliable way to test performance in a non-prod environment is to:

have a copy of PROD (no more than a day old caches and databases) in a non-prod environment; identical infrastructure (resources, configuration, limits, versions)
ability to restore the environment state after each test (restore databases, caches)
have another environment to generate load from
have test scripts for most of the e2e flows (should be a result of prod access logs and analytics tools' statistics analysis)
set up several different load patterns, e.g. 100 users adding items to the cart while another 70 are browsing in the store listings; 5'000 users adding items to the cart while 400 are checking their carts out and 7'000 are browsing products listings and 2'000 new logins are being processed, and so
use as many different items as possible, ar different properties of the items can trigger different specific flows, have larger payloads

And yet this kind of test results will be only good enough to claim that the prod application will most likely perform as well/poor as it did in non-prod. This is because

the non-prod data is falling behind prod at every moment
we are not testing every possible flow and combination of flows, a combination of parallel flows
we are not testing the application with every item (e.g. product, accessory) and combination of items that are possible in prod.

If we had such a setup in non-prod and ran a well-written suite of tests, we could be almost sure we'll know how PROD will perform.

I'd like to stress the importance of having a copy of as recent as possible PROD database. Databases are complex mechanisms that usually are a bottleneck. I've seen many cases where performance was flawless in PLABs but it was significantly worse in PROD. Even though both the environments had nearly identical setups, similar amounts of data. Why is that? Database query execution plans. Most databases keep an eye on each table and its sizes. They also keep a track of how long SQL executions related to each of the tables are. As tables grow in size and SQL executions grow in duration, the DB engine can change the query execution plan. Usually, the newly assigned plan is faster and less resource-intensive. However, there are cases when new plans are slower or significantly slower. There are cases, when new plans need some DBA input, like creating a new index (or removing/changing an existing one) to boost the performance of the new plan. The more data tables have in the tables, the more likely some SQL execution plans are to be changed. The large the data amount difference in databases (prod and non-prod) is, the less aligned their performance will be.

The question is: which business is willing to invest this much in performance testing? I am yet to find one. And this is why clients usually agree to test performance in a live PROD environment (which is no longer a part of the QA pipeline) along with some preliminary testing in a non-prod (Performance Lab (PLAB) environment, which has a similar setup to PROD, but not identical; and cheaper). The latter could be classified as QA, while the former - not really. Even more, performance testing can be carried out before the QA phase, i.e. a developer has several possible solutions to a problem but performance is the main factor deciding on which to implement. In such cases, perf testing can be carried out ad-hoc. And this is probably a grey zone between QA and non-QA.

Monitoring

Since Security and Performance are not exactly QA, it's a good idea to set up monitoring for them.

For security - monitor logs for failed authorization/authentication assertions, request-per-IP rates, deviations from normal e2e flows, limit breaches (request rate, payload size, etc.), and errors in general. Keep a track of access logs with original IP addresses, so you could perform a post-mortem analysis in case of an incident.

For performance - monitor metrics: response times, transactions per second, error rates, infrastructure resources' utilization: as many and as fine-grained as possible; middleware resources' utilization (e.g. JVM GC, heap, off-heap, threads, JMX metrics,...; nginx connection count, latencies, etc.). In fact, monitor as many metrics as possible, as fine-grained as possible and store them for at least a month. Monitor PLABs and live environments.

That's a lot of investment in monitoring. And, frankly, this price is only a lesser part of the cost of an alternative:

poor performance and months wasted on investigations;
damaged image;
fines for data protection violations (data leaks), not to mention individual lawsuits if security breaches go public.

I always urge my clients to never be cheap when it comes to monitoring. It ALWAYS pays off. Not in increased revenue, but in significantly fewer expenses for maintenance and incident handling.

It's an expensive investment. Is it worth the trouble?

Yes. For the most of it - yes. Of course, each client, each business must decide whether performance and security are important to them. If they are, they must be separate bullet points (with sub-points) in the contract. Performance and security are not a part of coding the application. These are completely different services from app coding, and they both require a great deal of investment. So if either security or performance or both are critical for the business, there must be no cutbacks here. Most attempts to "make perf/sec testing cheaper" will rule results of those tests inconclusive and a completely useless waste of funds.

Computer networks. Routing

Darius Juodokas — Thu, 18 Mar 2021 20:57:50 +0000

TL;DR; Router is a guy who knows a guy who knows a guy who ..... who knows where your destination is. Although "should know" is more realistic than "knows".

What, why, where

This post is hosted in GitHub servers somewhere on the planet Earth (or so I'm told). Your computer is at your house. How do bits and bytes of this post get delivered to your computer? One way is to find out where these servers are, drive/swim/fly/walk over there, get into the data center, find the right server, plug your USB flash drive in, copy the computer-networks-routing.md file in your drive, go back home, plug your USB drive in your computer and open that file up with your favourite markdown editor/viewer. All that assuming you get clearance to enter the DC (or to know where it is) and plug your USB drive into that server, copy files from it.

OR..

or you could enter this post's HTTP URL in your browser's address bar and hit enter. The post will magically appear on your screen. But how does the browser knows where the DataCentre is located? How does it travel there? Does a browser have a map of all the DCs?

In this post I will be talking about the "map" part of the packet's journey. I'll try to observe how the packed is chosen the right route on the map and sent over to the destination. A computer networks' GPS system, if you will.

The route

If we're planning a trip, naturally we need 3 things to get straight:

destination of our journey - WHERE we want to end up
route from our current location to the destination - which way do we go
means of transportation - on foot, car, plane, bike, etc.

The same is also the case in computer networks. If we want to send a packet somewhere, we need to know WHERE we are sending it, we need the ROUTE and we need the CARRIER (transportation).

The carrier is usually the least important. The packet doesn't care whether it's travelling through the switch, router, LB, proxy or another hop.
The destination is also not that difficult of a problem. Usually, we already know the address, e.g. google.com:443, 192.168.1.1:80, and so on. So the destination is any device that identifies itself with the mentioned coordinates.
This raises a problem: if I set up my device to identify as google.com:443, how will you find the real google server, rather than my laptop? We need to know the right route to the real google server.

The route in computer networks is a little different and a route from Paris to Hawaii. If you're planning a trip to Hawaii, you plan your route ahead and then follow it until you get to your destination. In CN you do not know the route in advance. Even if you did, computer networks are so wildly dynamic, that the route can change many times in one second. So instead of planning ahead, a network packet travels from one hop to another. A hop knows which direction is the destination and sends the packet to another hop that way, that hop does the same thing and this hopping is repeated until either the destination is reached, or the packet gets too old and dies on the way.

This fits well our Paris->Hawaii analogy too. Imagine you don't have a map, you don't have a GPS, nor a compass. How do you get from Paris to Hawaii? You go to the nearest person and ask him that question. The person will point yo to the nearest information centre, so you go there. The Info Centre will direct you to the airport and point a finger in the airport's direction. You walk in that direction until you meet someone and ask that person for the way from there. And you keep hopping through the random people, Info centres, airports, harbours, other hops until you reach Hawaii. Or die along the way... whichever happens first.

This is how packets travel in Computer Networks. Packets do not know the route in advance. But they know where to start, they know when they reach their destination, and they know who to ask where to find another hop, where they could ask for another hop, where they could ask for another hop, where..... You get the picture.

Oh, and the death of the packet. Each packet has a concept of TTL (TimeToLive). It's a number, usually either 64, 128 or 256. This is how long the packet can live. This number is decremented in each hop. So if the initial TTL is 64, the packet can travel through 64 hops at most. By looking at the packet's TTL at any time along the way you can guess how many hops has the packet passed through already. I'm saying guess because if we're observing the packet in-transit, we do not know what was the initial TTL. And if the packet dies, the hop it died in calls to the packet's home and informs us of its death. In CN terms - the hop sends the ICMP packet to the packet's source address informing of an expired TTL. This might come in handy when investigating network issues. (I probably should say "might send the ICMP packet" rather than "sends the ICMP packet", as ICMP is often blocked in the networks).

The route could be pictured as

__________     _______     _______               _______     _______________
| Laptop | --> | Hop | --> | Hop | -->  ...  --> | Hop | --> | Destination |
----------     -------     -------               -------     ---------------

The router - "the hop"

The hop is a router. It picks the next hop on the route and sends the packet there. A router is a smart network device and is capable to operate in Ethernet (OSI-2) IP protocol (OSI-3). Usually, routers are smart enough to also handle higher-level protocols, like TCP/UDP (OSI-4), some might be capable to deal with protocols above OSI-4, but a home user will hardly ever have any business with such routers.

A router accepts ethernet frames that have their destination set to either ff:ff:ff:ff:ff:ff (broadcast, meaning the frame has been sent to all the devices in the network), or that router's MAC address. You probably don't want to broadcast your HTTP packets with your Facebook login information, so you must know the router's MAC address.

Once the frame reaches the router, an IP packet is extracted from the frame's Data section. The router looks at the destination IP address and tries to figure out the next hop. To do that, the router has to:

1. verify whether that IP address in a part of that router's network, i.e. if both the router and the ip.dst_addr are in the same network.
2. If the ip.dst_addr is in the same network:
- 2.1. look that IP up in the ARP table to find out which device has that IP assigned
- 2.2. if the ARP record is found:
  - 2.2.1. rewrite the Ethernet Frame's eth.src_addr with the router's MAC address
  - 2.2.2. rewrite the Ethernet Frame's eth.dst_addr with the MAC found in the ARP table
  - 2.2.3. send the modified frame to the destination device through the right physical network port
  - 2.2.4. DONE! The packet has been delivered
- 2.3. if the ARP record is not found:
  - 2.3.1. if ARP discovery is enabled:
    - 2.3.1.1. broadcast (frame.dst_addr=ff:ff:ff:ff:ff:ff) an ARP request, asking for the owner of that IP address to raise a hand by replying with its MAC address (basically to say "yes, my IP address is that and my address is this")
    - 2.3.1.2. if any IP owners raise hands:
      - 2.3.1.2.1. update the ARP table
      - 2.3.1.2.2. go to 2.2.
    - 2.3.1.3. if no device identifies with this address
      - 2.3.1.3.1. extract the ip.src_addr from the IP packet
      - 2.3.1.3.2. send an ICMP response back to the sender saying that the destination is unreachable
      - 2.3.1.3.3. FAIL and drop the packet
  - 2.3.2. if ARP discovery is disabled (static ARP table only):
    - 2.3.2.1. go to 2.3.1.3.
3. if the ip.dst_addr is not in the same network:
- 3.1. look at the router's routing table and try to determine which route matches that IP address
- 3.2. if the route is found:
  - 3.2.1. rewrite frame.src_addr with router's MAC address
  - 3.2.2. forward the frame to the device configured in that routing rule
  - 3.2.3. DONE! The next hop goes to 1.
- 3.3. if the routing rule is not found:
  - 3.3.1. if the router has a gateway configured (default route):
    - 3.3.1.1. take the gateway's MAC address from the ARP table
    - 3.3.1.2. rewrite frame.src_addr with router's MAC address
    - 3.3.1.3. rewrite the frame.dst_addr with that MAC address
    - 3.3.1.4. send the frame to that router's gateway
    - 3.3.1.5. DONE! The gateway goes to 1.
  - 3.3.2. if the router has no gateway to promote the packet to
    - 3.3.2.1. go to 2.3.1.3.

The above is the core rule for any router out there. The 2.3.1 might be disabled by network specialists if the router is to only rely on a static ARP table.

Gateway

Gateway is nothing but another router. Pretty much any router can be a gateway. Routers usually have their own gateways configured, but some might not have any. A network can function perfectly without a gateway - a network router, would be enough (however, in that case, a simple switch might also suffice).

Consider this network

                      ... [internet?]
                          ↑ ?
                     | Router C |
                         ↑  ↑  192.168.0.1
       __________________|  |_________________
       | 192.168.0.2                         | 192.168.0.3
  | Router A |                          | Router B |
       ↑ 192.168.1.1                      ↑  ↑  ↑ 192.168.2.1
       |                  ________________|  |  |_________________
       | 192.168.1.2      | 192.168.2.2      | 192.168.2.3       | 192.168.2.4
| Application |         | PC |           | Laptop |          | Smart TV |

Looks like a person is hosting an application in his home network. There are 3 networks depicted: netC, net A and netB. NetC network has 3 members: all the 3 routers. NetA only has 2 members: RouterA and Application server. And netB has 4 members: RouterB, PC, Laptop and a SmartTV.

In netA the gateway is RouterA.
In netB the gateway is RouterB.
In netC the gateway is RouterC.

Gateway is a router positioned in a special place in the topology. Simply put, a gateway is the middle-man between networks. It oversees its own network and can delegate packets to its gateway if the addressee is not in the same network. In a home network, a gateway is something that gates your access to the internet and also hides your network from the internet. In industrial networks, the same sentence applies, just replace "internet" with "higher-level network".

Let's analyze this as a set of examples:

PC sends a message to the RouterB. They are both in the same local network, they are even connected directly to each other. PC does not need a gateway to reach RouterB - it already knows how to reach it.
PC sends a message to the Laptop. While they are both on the same LAN, they are not wired together, so PC needs its message to be routed to the Laptop. RouterB acts as a router.
PC sends a message to the Application. 192.168.2.2 and 192.168.1.2 are on different networks, so PC needs a gate to the other network. Gateway 192.168.2.1 (RouterB) will relay that message to its parent network - to the RouterC. RouterC doesn't know anything about the 192.168.1.0 network, so it will also send that message to its gateway. If the gateway above RouterC is your ISP, chances are they are rejecting all the requests with ip.src_add in private IP address ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), so eventually you will get a Network Unreachable ICMP response. If, however, RouterC gateway is another router in a home network, it is possible it might know where 192.168.1.2 is, but in this case, it will most likely be some other device than we expected, i.e. not the Application server from our example topology.

The application network is isolated from the home appliances' network, so the PC cannot send packets to the Application and vice versa. If we wanted to make the Application accessible from the appliances' network, we should configure RouterC and add a custom rule to the routing table, which should say:

or a more strict rule

That would be rather enough for a NAT-enabled router. However, if we don't have NATting available, such a setup would only allow TCP traffic from PC to Application, and the Application would not be able to reply. Why? There's no route for the Application to reach PC! Let's add another route:

Now the IP packets that originate in the Application server and are sent to the PC IP will be relayed to the router that manages the network the PC is in.

Why do we need this? We wouldn't if we were communicating in UDP. UDP doesn't have any echo. But TCP or ICMP, on the other hand, does. "echo" in this context means that any message sent in one direction will cause another message sent back. TCP is a verbose protocol and involves lots, LOTS of message exchanges back and forth. It requires a bidirectional communication channel to open a connection alone, not to mention the actual data exchange.

In this case PC's packet, once it reached RouterC, will be routed to the RouterA. And we know RouterA knows how to find 192.168.1.2. We have now successfully joined two isolated networks.

Routing table

A routing table is a set of rules instructing a router on how to behave when an IP packet is received. Routing rules can be as simple as "route all packets that are addressed to B to C" or as complex as "route all packets from subnet A sent to host B to host C via interface eth3 with a priority of 100 and cost of 17" or more complex. Normally simple rules are perfectly enough.
A routing table usually has a Default Route. It's a fallback route - it's used when no other routing rules match the packet. The default route has a destination address set to 0.0.0.0, meaning it will catch all the traffic that's left after traversing the routing table.

Multiple rules can match any packet, it is perfectly legal. And, in fact, one will hardly ever find a device with just one rule matching a packet. That is the case, because a default route matches ALL packets, and other rules match just a subset of them. The narrower the rule, the higher is its priority in the table. For example, consider 3 rules:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.2.1     0.0.0.0         UG    100    0        0 enp6s0
192.168.2.3     0.0.0.0         255.255.255.255 U       0    0        0 enp6s0
192.168.2.0     0.0.0.0         255.255.255.0   U       0    0        0 enp6s0

If we were sending a message to 192.168.2.3, the second rule would be picked up, because its mask is the strictest, i.e. it only matches 1 host. The third one matches the whole subnet, while the first matches all the IPs, which makes them looser than the second one, so their priority is lower. Think of this priority mechanism as fallbacks: try the next least loose rule until it matches the packet.

NAT router

Normally routers do not alter the packets a lot - they only change up to 2 fields of each packet: frame.src_addr and frame.dst_addr during the routing process. These are both OSI-2 fields. However, in larger and/or more isolated networks this might not be enough. Imagine if you sent this packet to google.com:

frame
  src_addr: aa:aa:aa:aa:aa:aa {ignore}
  dst_addr: bb:bb:bb:bb:bb:bb {ignore}
  ip_packet
    src_addr: 192.168.2.3
    dst_addr: 216.58.208.206
    ...

AND I sent the same packet, and I happen to have the same LAN IP address as you: 192.168.2.3.
Who would get the response? I or you? Or, perhaps, some server at google.com LAN, that so happens to have the same IP as we do?

This is wrong, confusing and susceptible to errors. A better approach is to assign ip.src_addr to some IP address that google.com can identify uniquely, without any chance of confusion. This is where the private and public IP address separation comes in handy.

We know our ISP assigns us an IP address. Some get a static one (the same IP forever), others - a dynamic one (IP changes over time). But at any point in time, each IP address is unique. If you have a NAT router (which you probably do), all the packets that originate in the LAN and leave the router through the WAN port, i.e. all the packets sent to the internet, get ip.src_addr set to your public IP address. This way, even though your LAN IP address is 192.168.2.3, your packets sent to google.com will be sent as if from 35.36.37.38 (or whatever your public IP address is). But there's a caveat. If google.com wanted to send you a packet, they should use 35.36.37.38 as the ip.dst_addr. NAT effectively hides your LAN IP address and allows you to be uniquely identified by the target. Any packets sent to your external IP address will be sent to your router rather than your Laptop/PC/SmartTV

This applies to new connections originating from the internet, and it does not apply to connections made from within the lan to the internet, i.e. if you connect to google.com and send an HTTP request - you will still receive a response to your request. This is all thanks to connection tracking built into your router.

PAT router

Just like NAT, PAT (Port Address Translation) also does some translation between public and local spaces.

Consider sockets. A socket is a triplet of

protocol (tcp/udp/icmp/...)
IP address
port

A socket represents either end of a connection: either local or remote. A socket is a cup in a cups-and-string-telephone. You speak to one cup (socket), the data is transferred through the wire/thread, and it's emitted through the cup (socket) on the other end. In a connection, you have one cup and google has another. Each socket in either of the devices must have a unique triplet. That is, if you're sending an HTTP request to google, your Laptop will create a socket [216.58.208.206:443:tcp], and google server will create a socket [{your_external_ip}:{your_port_number}:tcp]. Now NAT takes care of the {your_external_ip} part. What's with the {your_port_number}?

When you create a TCP socket for HTTPS communication, your OS (Windows/Linux/macOS/other) will assign a port 443 as the remote port number and any random number between 0 and 65535 (in reality the range is a lot narrower, something like 49152 to 65535) as local port. This says "I will be sending data to port 443 and expecting to receive data at port {random_number}".

Consider the same topology in the ASCII-picture above. You have a PC, Laptop and SmartTV governed by RouterB. If you are sending a packet to google from your Laptop, RouterB might apply NAT rules and rewrite your packet's ip.src_addr to your router's external IP address. It will leave the same ip.src_port that your laptop has set. Then, it will forward your request to RouterC, which, if it's NAT-enabled, will do the same thing. And this continues on and on until your packet reaches the google server you were aiming for. The remote server creates a socket using TCP as protocol, your public IP address for IP address and your Laptop's assigned private (local) port number for the port number. It's all nice when you only use the laptop for communications with google. You might not even notice anything wrong if you used all the 3 devices simultaneously: laptop, PC and SmartTV. However, what would happen if you had 70'000 devices in your network, sharing the same public IP address? They would all NAT to the same public IP address. And they would all generate a private port number in a range [0;65535]. There are 70'000 devices. Naturally, some of the private port numbers will overlap. And this introduces a problem - several devices will be using the same protocol, IP address and port number to connect to google.com. Google server (or firewall or LB or any other device in the way) will get confused: "hey, I already have this triplet in an already established connection. I cannot register another, identical to that one! ECONNREFUSED"

The problem is that several devices in the same network might generate the same random private port number. The larger the private network, the higher the probability of private port number collisions. This might be a problem in your network as well, not necessarily in a remote server: if your internal routers are NATting and governing large networks, your packet might find a collision in either of your network devices too, e.g. routers, firewalls, ..., because your network devices are also tracking connections (sockets) internally, to maintain conversations.

A solution is PAT. A PAT-capable router will generate its own private port number for each connection originating in your network and will rewrite the TCP/UDP packet's src_port with it and send it out. It will also keep the mapping between the original port number and the newly generated port number for incoming packets in the same connection and rewrite the tcp.dst_port accordingly. This way you make sure that even though your servers happen to generate the same private port number, the outer world will not see collisions, because PAT router will make sure they are unique.

Port Forwarding

Now, what if you want to allow google.com to connect to your Laptop or SmartTV? What if you are having trouble using GMail and you call google and ask aa technician to connect to your Laptop using RDP (Remote Desktop (Windows))? You will give them your public IP address - that's for sure. But how does your external router knows that it should redirect the google technician's RDP request to your laptop, and not to your SmartTV or PC or, perhaps, drop the packet? This is where OSI-4 features of a router come in handy.

Most routers can also operate in OSI-4, i.e. they can interpret TCP, UDP, other OSI-4 packets (segments). TCP and UDP are wrapped inside IP packets, and they also have a protocol-specific identifier. Picture it like this:

Identifiers:
- Ethernet: src_mac, dst_mac
- IP:       src_ip,  dst_ip
- TCP/UDP:  src_port, dst_port

This allows us a high level of multiplexing, where higher OSI layers grant us even more contexts in the same TCP/UDP communication. This also allows us a more fine-grained routing to be set up. In this case, we could set up our router so that it would accept traffic at 35.36.37.38:3389/tcp and forward it to 192.168.2.3, which is your Laptop. This way, a google support person would be connecting to 35.36.37.38:3389/tcp and your router will know that this traffic is meant for your Laptop. This is called Port Address Translation, or Port Forwarding.

Port Forwarding even allows us to remap port numbers. Suppose you are running an Nginx server on your laptop at 80/tcp, and you want to show your website to your friend remotely. You are in different home networks, so you set up a rule in your router:

From port      To port      Local address      Protocol
80             80           192.168.2.3        TCP

This is all standard and simple, no remapping is being done. Simply all the 80/tcp traffic to your public IP address is forwarded to your laptop's LAN IP.

Now imagine, you have another version of that website, just on your PC. You are also running on Nginx 80/tcp. The router is already instructed to route all the :80/tcp traffic to your laptop, how do you also tell it to route other traffic to your PC? It's quite simple - you remap the port number. You add another rule, and now your port forwarding table looks like this:

From port      To port      Local address      Protocol
80             80           192.168.2.3        TCP
81             80           192.168.2.2        TCP

This means, that your friend can go to 35.36.37.38:80/tcp, and it will be forwarded to your laptop (192.168.2.3:80/tcp), or he could go to 35.36.37.38:81/tcp and he will be forwarded to your PC (192.168.2.2:80/tcp). You are remapping external (From) and internal (To) port numbers.

References