Forem: Anmol Sarma

File Creation Time in Linux

Anmol Sarma — Sun, 23 Jun 2019 13:54:22 +0000

The stat utility can be used to retrieve the Unix file timestamps namely atime, ctime and mtime. Of these, the benefit of mtime which records the last time when the file was modified is immediately apparent. On the other hand, atime¹ which records the last time the file was accessed has been called “perhaps the most stupid Unix design idea of all times”. Intuitively, one might expect ctime to record the creation time of a file. However, ctime records the last time when the metadata of a file was changed.

Typically, Unices do not record file creation times. While some individual filesystems do record file creation times², until recently Linux lacked a common interface to actually expose them to userspace applications. As a result, the output of stat (GNU coreutils v8.30) on an ext4 filesystem (Which does record creation times) looks something like this:

$ stat .
  File: .
  Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 803h/2051d Inode: 3588416 Links: 18
Access: (0775/drwxrwxr-x) Uid: ( 1000/ anmol) Gid: ( 1000/ anmol)
Access: 2019-06-23 10:49:04.056933574 +0000
Modify: 2019-05-19 13:29:59.609167627 +0000
Change: 2019-05-19 13:29:59.609167627 +0000
 Birth: -

With the “Birth” field, meant to show the creation time, sporting a depressing “-”.

The fact that ctime does not mean creation time but change time coupled with the absence of a real creation time interface does lead to quite a bit of confusion. The confusion seems so pervasive that the msdos driver in the Linux kernel happily clobbers the FAT creation time with the Unix change time!

The limitations of the current stat() system call have been known for some time. A new system call providing extended attributes was first proposed in 2010 with the new statx() interface finally being merged into Linux 4.11 in 2017. It took so long at least in part because kernel developers quickly ran into one of the hardest problems in Computer Science: naming things. Because there was no standard to guide them, each filesystem took to calling creation time by a different name. Ext4 and XFS called it crtime while Btrfs and JFS called it otime. Implementations also have slightly different semantics with JFS storing creation time only with the precision of seconds.

Glibc took a while to add a wrapper for statx() with support landing in version 2.28 which was released in 2018. Fast forward to March 2019 when GNU coreutils 8.31 was released with stat finally gaining support for reading the file creation time:

$ stat .
  File: .
  Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 803h/2051d Inode: 3588416 Links: 18
Access: (0775/drwxrwxr-x) Uid: ( 1000/ anmol) Gid: ( 1000/ anmol)
Access: 2019-06-23 10:49:04.056933574 +0000
Modify: 2019-05-19 13:29:59.609167627 +0000
Change: 2019-05-19 13:29:59.609167627 +0000
 Birth: 2019-05-19 13:13:50.100925514 +0000

The impact of atime on disk performance is mitigated by the use of relatime on modern Linux systems. ^[return]
For ext4, one can get the crtime of a file using the stat subcommand of the confusingly named debugfs utility. ^[return]

Network Redirections in Bash

Anmol Sarma — Sat, 04 May 2019 15:29:15 +0000

A few months ago, while reading the man page for recvmmsg(), I came across this snippet:

$ while true; do echo $RANDOM > /dev/udp/127.0.0.1/1234;
     sleep 0.25; done

And as advertised, it sends a UDP datagram containing a random number to port 1234 every 250 ms. I didn’t recall ever seeing a /dev/udp and so was a bit surprised that it worked. And as it happens, ls was not able to access the file that I had just written to:

ls: cannot access '/dev/udp/127.0.0.1/1234': No such file or directory

Puzzled and intrigued, I echoed Foo Bar Baz to /dev/udp/127.0.0.1/1337 and reached for strace:

...
2423 socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
12423 connect(4, {sa_family=AF_INET, sin_port=htons(1337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
12423 fcntl(1, F_GETFD) = 0
12423 fcntl(1, F_DUPFD, 10) = 10
12423 fcntl(1, F_GETFD) = 0
12423 fcntl(10, F_SETFD, FD_CLOEXEC) = 0
12423 dup2(4, 1) = 1
12423 close(4) = 0
12423 fstat(1, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
12423 write(1, "Foo Bar Baz\n", 12) = 12
...

Seemingly, a normal UDP socket was being created and written to using the regular sycall interface. That refuted my initial suspicion that some kind of a special file backed by the kernel was involved. But who was actually creating the socket?

A peek at Bash’s code answered that question:

redir.c:

/* A list of pattern/value pairs for filenames that the redirection
   code handles specially. */
static STRING_INT_ALIST _redir_special_filenames[] = {
#if !defined (HAVE_DEV_FD)
  { "/dev/fd/[0-9]*", RF_DEVFD },
#endif
#if !defined (HAVE_DEV_STDIN)
  { "/dev/stderr", RF_DEVSTDERR },
  { "/dev/stdin", RF_DEVSTDIN },
  { "/dev/stdout", RF_DEVSTDOUT },
#endif
#if defined (NETWORK_REDIRECTIONS)
  { "/dev/tcp/*/*", RF_DEVTCP },
  { "/dev/udp/*/*", RF_DEVUDP },
#endif
  { (char *)NULL, -1 }
};

So, redirection involving /dev/udp/ is handled specially by Bash¹ and it uses BSD Sockets API to create a socket:

lib/sh/netopen.c:

/*
 * Open a TCP or UDP connection to HOST on port SERV. Uses the
 * traditional BSD mechanisms. Returns the connected socket or -1 on error.
 */
static int
_netopen4(host, serv, typ)
     char *host, *serv;
     int typ;
{
  struct in_addr ina;
  struct sockaddr_in sin;
  unsigned short p;
  int s, e;

  if (_getaddr(host, &ina) == 0)
    {
      internal_error (_("%s: host unknown"), host);
      errno = EINVAL;
      return -1;
    }

  if (_getserv(serv, typ, &p) == 0)
    {
      internal_error(_("%s: invalid service"), serv);
      errno = EINVAL;
      return -1;
    }

  memset ((char *)&sin, 0, sizeof(sin));
  sin.sin_family = AF_INET;
  sin.sin_port = p;
  sin.sin_addr = ina;

  s = socket(AF_INET, (typ == 't') ? SOCK_STREAM : SOCK_DGRAM, 0);
  if (s < 0)
    {
      sys_error ("socket");
      return (-1);
    }

  if (connect (s, (struct sockaddr *)&sin, sizeof (sin)) < 0)
    {
      e = errno;
      sys_error("connect");
      close(s);
      errno = e;
      return (-1);
    }

  return(s);
}

Which means we can actually make HTTP requests using Bash:

exec 3<> /dev/tcp/checkip.amazonaws.com/80
printf "GET / HTTP/1.1\r\nHost: checkip.amazonaws.com\r\nConnection: close\r\n\r\n" >&3
tail -n1 <&3

No curl needed! /jk

Apart from Bash, in the versions and configurations packaged in Ubuntu 18.04, only ksh supports network redirections – ash, csh, dash, fish, and zsh do not.

I don’t think I will actually have any use for network redirections but this was a fun little rabbit hole to dive into.

NOTE: Code snippets from Bash are licensed under GPLv3, the snippet from the man page is licensed differently

At least on Linux, the other special patterns handled by bash like /dev/fd and /dev/stdint actually are special files backed by the kernel. The Bash manual notes that it may emulate them internally on platforms that do not support them. ^[return]

Single-stepping through the Kernel

Anmol Sarma — Sun, 03 Feb 2019 13:27:45 +0000

There may come a time in a system programmer’s life when she needs to leave the civilized safety of the userland and confront the unspeakable horrors that dwell in the depths of the Kernel space. While higher beings might pour scorn on the very idea of a Kernel debugger, us lesser mortals may have no other recourse but to single-step through Kernel code when the rivers begin to run dry. This guide will help you do just that. We hope you never actually have to.

Ominous sounding intro-bait notwithstanding, setting up a virtual machine for Kernel debugging isn’t really that difficult. It only needs a bit of preparation. If you just want a copypasta, skip to the end. If you’re interested in the predicaments involved and how to deal with them, read on.

N.B.: “But which kernel are you talking about?”, some heathens may invariably ask when it is obvious that Kernel with a capital K refers to the One True Kernel.

Building the Kernel

Using a minimal Kernel configuration instead of the kitchen-sink one that distributions usually ship will make life a lot easier. You will first need to grab the source code for the Kernel you are interested in. We will use the latest Kernel release tarball from kernel.org, which at the time of writing is 4.20.6. Inside the extracted source directory, invoke the following:

make defconfig
make kvmconfig
make -j4

This will build a minimal Kernel image that can be booted in QEMU like so:

qemu-system-x86_64 -kernel linux-4.20.6/arch/x86/boot/bzImage

This should bring up an ancient-looking window with a cryptic error message:

You could try pasting the error message into ~~Google~~ a search engine: Except for the fact that you can’t select the text in the window. And frankly, the window just looks annoying! So, ignoring the actual error for a moment, let’s try to get QEMU to print to the console instead of a spawning a new graphical window:

qemu-system-x86_64 -kernel -nographic linux-4.20.6/arch/x86/boot/bzImage

QEMU spits out a single line:

qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]

Htop tells me QEMU is using 100% of a CPU and my laptop fan agrees. But there is no output whatsoever and Ctrl-c doesn’t work! What does work, however, is pressing Ctrl-a and then hitting x:

QEMU: Terminated

Turns out that by passing -nographic, we have plugged out QEMU’s virtual monitor. Now, to actually see any output, we need to tell the Kernel to write to a serial port:

qemu-system-x86_64 -nographic -kernel linux-4.20.6/arch/x86/boot/bzImage -append "console=ttyS0"

It worked! Now we can read error message in all its glory:

[1.333008] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
[1.334024] Please append a correct "root=" boot option; here are the available partitions:
[1.335152] 0b00 1048575 sr0 
[1.335153] driver: sr
[1.335996] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[1.337104] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.6 #1
[1.337901] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[1.339091] Call Trace:
[1.339437] dump_stack+0x46/0x5b
[1.339888] panic+0xf3/0x248
[1.340295] mount_block_root+0x184/0x248
[1.340838] ? set_debug_rodata+0xc/0xc
[1.341357] mount_root+0x121/0x13f
[1.341837] prepare_namespace+0x130/0x166
[1.342378] kernel_init_freeable+0x1ed/0x1ff
[1.342965] ? rest_init+0xb0/0xb0
[1.343427] kernel_init+0x5/0x100
[1.343888] ret_from_fork+0x35/0x40
[1.344526] Kernel Offset: 0x1200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[1.345956] ---[end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)]---

So, the Kernel didn’t find a root filesystem to kick off the user mode and panicked. Lets fix that by creating a root filesystem image.

Creating a Root Filesystem

Start by creating an empty image:

qemu-img create rootfs.img 1G

And then format it as ext4 and mount it:

mkfs.ext4 rootfs.img
mkdir mnt
sudo mount -o loop rootfs.img mnt/

Now we can populate it using debootstrap:

sudo debootstrap bionic mnt/

This will create a root filesystem based on Ubuntu 18.04 Bionic Beaver. Of course, feel free to replace bionic with any release that you prefer.

And unmount the filesystem once we’re done. This is important if you want to avoid corrupted images!

sudo umount mnt

Now boot the Kernel with our filesystem. We need to tell QEMU to use our image as a virtual hard drive and we also need to tell the Kernel to use the hard drive as the root filesystem:

qemu-system-x86_64 -nographic -kernel linux-4.20.6/arch/x86/boot/bzImage -hda rootfs.img -append "root=/dev/sda console=ttyS0"

This time the Kernel shouldn’t panic and you should eventually see a login prompt. We could have setup a user while creating the filesystem but it’s annoying to have to login each time we boot up the VM. Let’s enable auto login as root instead.

Terminate QEMU (Ctrl-a, x), mount the filesystem image again and then create the configuration folder structure:

sudo mount -o loop rootfs.img mnt/
sudo mkdir -p mnt/etc/systemd/system/serial-getty@ttyS0.service.d

Add the following lines to mnt/etc/systemd/system/serial-getty@ttyS0.service.d/autologin.conf:

[Service]
ExecStart=
ExecStart=-/sbin/agetty --noissue --autologin root %I $TERM
Type=idle

Make sure to unmount the filesystem and then boot the Kernel again. This time you should be automatically logged in.

Gracefully shutdown the VM:

halt -p

Attaching a debugger

Let’s rebuild the Kernel with debugging symbols enabled:

./scripts/config -e CONFIG_DEBUG_INFO
make -j4

Now, boot the Kernel again, this time passing the -s flag which will make QEMU listen on TCP port 1234:

qemu-system-x86_64 -nographic -kernel linux-4.20.6/arch/x86/boot/bzImage -hda rootfs.img -append "root=/dev/sda console=ttyS0" -s

Now, in another terminal start gdb and attach to QEMU:

gdb ./linux-4.20.6/vmlinux 
...
Reading symbols from ./linux-4.20.6/vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0xffffffff95a2f8f4 in ?? ()
(gdb)

You can set a breakpoint on Kernel function, for instance do_sys_open():

(gdb) b do_sys_open 
Breakpoint 1 at 0xffffffff811b2720: file fs/open.c, line 1049.
(gdb) c
Continuing.

Now try opening a file in VM which should result in do_sys_open() getting invoked… And nothing happens?! The breakpoint in gdb is not hit. This due to a Kernel security feature called KASLR. KASLR can be disabled at boot time by adding nokaslr to the Kernel command line arguments. But, let’s actually rebuild the Kernel without KASLR. While we are at it, let’s also disable loadable module support as well which will save us the trouble of copying the modules to the filesystem.

./scripts/config -e CONFIG_DEBUG_INFO -d CONFIG_RANDOMIZE_BASE -d CONFIG_MODULES
make olddefconfig # Resolve dependencies
make -j4

Reboot the Kernel again, attach gdb, set a breakpoint on do_sys_open() and run cat /etc/issue in the guest. This time the breakpoint should be hit. But probably not where you expected:

Breakpoint 1, do_sys_open (dfd=-100, filename=0x7f96074ad428 "/etc/ld.so.cache", flags=557056, mode=0) at fs/open.c:1049
1049 {
(gdb) c
Continuing.

Breakpoint 1, do_sys_open (dfd=-100, filename=0x7f96076b5dd0 "/lib/x86_64-linux-gnu/libc.so.6", flags=557056, mode=0) at fs/open.c:1049
1049 {
(gdb) c
Continuing.

Breakpoint 1, do_sys_open (dfd=-100, filename=0x7ffe9e630e8e "/etc/issue", flags=32768, mode=0) at fs/open.c:1049
1049 {
(gdb)

Congratulations! From this point, you can single-step away to your heart’s content.

By default, the root filesystem is mounted read only. If you want to be able to write to it, add rw after root=/dev/sda in the Kernel parameters:

qemu-system-x86_64 -nographic -kernel linux-4.20.6/arch/x86/boot/bzImage -hda rootfs.img -append "root=/dev/sda rw console=ttyS0" -s

Bonus: Networking

You can create a point to point link between the QEMU VM and the host using a TAP interface.

First install tunctl and create a persistent TAP interface to avoid running QEMU as root:

sudo apt install uml-utilities
sudo sudo tunctl -u $(id -u)
Set 'tap0' persistent and owned by uid 1000
sudo ip link set tap0 up

Now launch QEMU with a virtual e1000 interface connected the host’s tap0 interface:

qemu-system-x86_64 -nographic -device e1000,netdev=net0 -netdev tap,id=net0,ifname=tap0 -kernel linux-4.20.6/arch/x86/boot/bzImage -hda rootfs.img -append "root=/dev/sda rw console=ttyS0" -s

Once the guest boots up, bring the network interface up:

ip link set enp0s3 up
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe12:3456/64 scope link 
       valid_lft forever preferred_lft forever

QEMU and the host can now communicate using their IPv6 Link-local addresses. After all, it is 2019.

Copypasta

# Building a minimal debuggable Kernel
make defconfig
make kvmconfig
./scripts/config -e CONFIG_DEBUG_INFO -d CONFIG_RANDOMIZE_BASE -d CONFIG_MODULES
make olddefconfig
make -j4

# Create root filesystem
qemu-img create rootfs.img 1G
mkfs.ext4 rootfs.img
mkdir mnt
sudo mount -o loop rootfs.img mnt/
sudo debootstrap bionic mnt/

# Add following lines to mnt/etc/systemd/system/serial-getty@ttyS0.service.d/autologin.conf
# START
[Service]
ExecStart=
ExecStart=-/sbin/agetty --noissue --autologin root %I $TERM
Type=idle
# END

# Unmount the filesystem
sudo umount mnt

# Boot Kernel with root file system in QEMU
qemu-system-x86_64 -nographic -kernel linux-4.20.6/arch/x86/boot/bzImage -hda rootfs.img -append "root=/dev/sda rw console=ttyS0" -s

# Attach gdb
gdb ./linux-4.20.6/vmlinux 
(gdb) target remote :1234

DCCP: The socket type you probably never heard of

Anmol Sarma — Tue, 13 Dec 2016 17:40:50 +0000

TL;DR: DCCP is a relatively newer transport layer protocol which draws from both TCP and UDP. Jump straight to the example C code.

Background

Historically, the majority of the traffic on the Internet has been over TCP which provides a reliable connection-oriented stream between two hosts. UDP has been mainly used by applications whose brief transfers would be unacceptably slowed by TCP’s connection establishment overhead or those for which timeliness is more important than reliability. However, the increasing use of UDP for applications such as internet telephony and streaming media which transfer a large amount of data can lead to significant network congestion. Since unlike TCP, UDP provides no inherent congestion control mechanism, an application can send UDP datagrams at a much higher rate than the available path capacity and cause congestion along the path. Increased congestion may lead to delays, packet loss and the degradation of the network’s quality of service.

Applications and protocols that choose to use UDP as their transport must, therefore, employ mechanisms to prevent congestion and to establish some degree of fairness with concurrent traffic so that the network remains usable. A prominent example of such a congestion control scheme is LEDBAT employed by BitTorrent. However, implementing a congestion control scheme is difficult, time-consuming and error-prone. Multiple non-standard implementations also make it difficult to reason about how applications would respond to network congestion. DCCP - Datagram Congestion Control Protocol is intended to mitigate this problem as a transport for unreliable datagrams with built-in congestion control.

From an application programmer’s perspective, DCCP differs from UDP by providing four additional features:

Explicit connection establishment between hosts
Selectable congestion control schemes
Path MTU discovery to avoid fragmentation
Service Codes for identifying applications

DCCP makes use of Explicit Congestion Notification but it is transparent the application. DCCP is designed to leave additional functionality such as reliability or Forward Error Correction (FEC) to be layered on top, as and when required rather than at the protocol level itself.

Explicit connection establishment

The connection establishment semantics of DCCP mirror those of TCP with a client that actively connects to a server that is passively listening on a port. DCCP connections are bidirectional. Logically, however, a DCCP connection consists of two separate unidirectional connections, called half-connections. Each half-connection is a one-way, unreliable datagram pipe. The rationale for this explained in the next section.

Selectable congestion control schemes

TCP implements congestion control entirely transparently to the application. While it is possible to configure the host to use a specific variant, there is no way for the application to discover which congestion control scheme is in force, let alone negotiate one. DCCP, however, can cater to the different needs of applications by allowing applications to negotiate the congestion control schemes. In fact, each of the half-connections can use a different scheme, allowing for greater control.

Congesting the network by sending data at a rate that is faster than the slowest link between the endpoints will overwhelm it. This may lead to packet loss leading to retransmissions which may, in turn, lead to further congestion. The solution to this problem is to start transmitting data at a slow rate on a new connection and to then ramp up the speed until packet loss is detected. The transmission rate may then be scaled back until no further packet loss occurs. The optimum speed at which to transfer data will change with network conditions over the life of the connection. Congestion control schemes differ in how packet loss is estimated and the rate at which is the transmission speed is ramped up or scaled back. DCCP congestion control schemes are denoted by Congestion Control Identifiers - CCIDs. Currently, three CCIDs have been formally specified:

CCID 2 - TCP-like Congestion Control: A quick reacting scheme modelled after TCP which will rapidly ramp up speed to take advantage of available bandwidth and also rapidly scale back when congestion is detected. Suitable for applications that can handle large swings in transmission rates.

CCID 3 - TCP-Friendly Rate Control (TFRC): A slower reacting scheme intended to be friendly to concurrent TCP flows in the network. Provides a relatively smoother sending rate at the expense of possibly not utilising all available bandwidth. Suitable for media streaming applications that prefer to minimise abrupt changes in the sending rate.

CCID 4 - TCP-Friendly Rate Control for Small Packets (TFRC-SP): An experimental scheme for applications that use a small datagram size and those that change their sending rate by varying the datagram size.

In addition, the Linux kernel’s DCCP Test Tree contains an experimental implementation of a scheme modelled after TCP CUBIC. There is also a mode that disables congestion control altogether for UDP-like behaviour.

PMTU discovery

Data between two internet hosts is transferred transmitted as a series of IP packets that pass through intermediate links. Each of these links has a maximum packet size or maximum transmission unit (MTU) that it can transmit without having to break it up into smaller fragments. The largest packet size that does not require fragmentation anywhere along a path is referred to as the path maximum transmission unit or PMTU. Applications can usually get better error tolerance by producing packets smaller than the PMTU. DCCP defines a maximum packet size (MPS) based on the PMTU and the congestion control scheme used for each connection. DCCP implementations will not send any packet bigger than the MPS and instead return an appropriate error to the application. The application can query the DCCP stack for the current MPS and restrict itself from sending datagrams larger than this value and thereby avoid fragmentation.

Service Codes

DCCP defines a 32 bit Service Code to disambiguate between multiple applications associated with a single a server port. The client specifies the Service Code it wants to connect to and this is used to identify the intended service or application to process a DCCP connection request. Essentially, Service Codes provide an additional level of indirection for connection multiplexing. A server listening on a port may be associated with multiple Service Codes but a client may have only one Service Code, indicating the application it wishes to connect to.

Usage

The mainline Linux kernel has included DCCP support since 2.6.14 and mainstream distributions like Ubuntu enable it by default. However, to get the newer experimental features, you will have to build the kernel from the DCCP Test Tree. Or you can also grab the latest stable kernel release merged with the experimental DCCP changes from here. Be sure to enable all the CCIDs in the kernel configuration in Networking Support –> Networking Options –> The DCCP Protocol –> DCCP CCIDs Configuration. Like the Debian Installation Guide Says, “Don’t be afraid to try compiling the kernel. It’s fun and profitable.” For now, Linux is the only operating system supporting native DCCP, unless you count the patch for an ancient version of FreeBSD.

Example in C

The server and client look almost exactly the same as their TCP counterparts with the exception fo the socket type and setting of the service code. The client uses getsockopt() to read the current maximum packet size. Reading the available CCIDs on the host is shown in probe.c. As libc doesn’t still have a netinet/dccp.h header, you will have to get the required constants from the kernel sources or directly use the dccp.h header below. Download Code

server.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <errno.h>

#include "dccp.h"

#define PORT 1337
#define SERVICE_CODE 42

int error_exit(const char *str)
{
    perror(str);
    exit(errno);
}

int main(int argc, char **argv)
{
    int listen_sock = socket(AF_INET, SOCK_DCCP, IPPROTO_DCCP);
    if (listen_sock < 0)
        error_exit("socket");

    struct sockaddr_in servaddr = {
        .sin_family = AF_INET,
        .sin_addr.s_addr = htonl(INADDR_ANY),
        .sin_port = htons(PORT),
    };

    if (setsockopt(listen_sock, SOL_SOCKET, SO_REUSEADDR, &(int) {
               1}, sizeof(int)))
        error_exit("setsockopt(SO_REUSEADDR)");

    if (bind(listen_sock, (struct sockaddr *)&servaddr, sizeof(servaddr)))
        error_exit("bind");

    // DCCP mandates the use of a 'Service Code' in addition the port
    if (setsockopt(listen_sock, SOL_DCCP, DCCP_SOCKOPT_SERVICE, &(int) {
               htonl(SERVICE_CODE)}, sizeof(int)))
        error_exit("setsockopt(DCCP_SOCKOPT_SERVICE)");

    if (listen(listen_sock, 1))
        error_exit("listen");

    for (;;) {

        printf("Waiting for connection...\n");

        struct sockaddr_in client_addr;
        socklen_t addr_len = sizeof(client_addr);

        int conn_sock = accept(listen_sock, (struct sockaddr *)&client_addr, &addr_len);
        if (conn_sock < 0) {
            perror("accept");
            continue;
        }

        printf("Connection received from %s:%d\n",
               inet_ntoa(client_addr.sin_addr), ntohs(client_addr.sin_port));

        for (;;) {
            char buffer[1024];
            // Each recv() will read only one individual message.
            // Datagrams, not a stream!
            int ret = recv(conn_sock, buffer, sizeof(buffer), 0);
            if (ret > 0)
                printf("Received: %s\n", buffer);
            else
                break;

        }

        close(conn_sock);
    }
}

client.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <errno.h>

#include "dccp.h"

int error_exit(const char *str)
{
    perror(str);
    exit(errno);
}

int main(int argc, char *argv[])
{
    if (argc < 5) {
        printf("Usage: ./client <server address> <port> <service code> <message 1> [message 2] ... \n");
        exit(-1);
    }
    struct sockaddr_in server_addr = {
        .sin_family = AF_INET,
        .sin_port = htons(atoi(argv[2])),
    };

    if (!inet_pton(AF_INET, argv[1], &server_addr.sin_addr.s_addr)) {
        printf("Invalid address %s\n", argv[1]);
        exit(-1);
    }

    int socket_fd = socket(AF_INET, SOCK_DCCP, IPPROTO_DCCP);
    if (socket_fd < 0)
        error_exit("socket");

    if (setsockopt(socket_fd, SOL_DCCP, DCCP_SOCKOPT_SERVICE, &(int) {htonl(atoi(argv[3]))}, sizeof(int)))
        error_exit("setsockopt(DCCP_SOCKOPT_SERVICE)");

    if (connect(socket_fd, (struct sockaddr *) &server_addr, sizeof(server_addr)))
        error_exit("connect");

    // Get the maximum packet size
    uint32_t mps;
    socklen_t res_len = sizeof(mps);
    if (getsockopt(socket_fd, SOL_DCCP, DCCP_SOCKOPT_GET_CUR_MPS, &mps, &res_len))
        error_exit("getsockopt(DCCP_SOCKOPT_GET_CUR_MPS)");
    printf("Maximum Packet Size: %d\n", mps);

    for (int i = 4; i < argc; i++) {
        if (send(socket_fd, argv[i], strlen(argv[i]) + 1, 0) < 0)
            error_exit("send");
    }

    // Wait for a while to allow all the messages to be transmitted
    usleep(5 * 1000);

    close(socket_fd);
    return 0;
}

probe.c

#include <stdio.h>
#include <sys/socket.h>
#include <netinet/in.h>

#include "dccp.h"

int main()
{
    int sock_fd = socket(AF_INET, SOCK_DCCP, IPPROTO_DCCP);

    // Check the congestion control schemes available
    socklen_t res_len = 6;
    uint8_t ccids[6];
    if (getsockopt(sock_fd, SOL_DCCP, DCCP_SOCKOPT_AVAILABLE_CCIDS, ccids, &res_len)) {
        perror("getsockopt(DCCP_SOCKOPT_AVAILABLE_CCIDS)");
        return -1;
    }

    printf("%d CCIDs available:", res_len);
    for (int i = 0; i < res_len; i++)
        printf(" %d", ccids[i]);

    return res_len;
}

dccp.h

/* This file only contains constants necessary for user space to call
 * into the kernel and thus, contains no copyrightable information. */

#ifndef DCCP_DCCP_H
#define DCCP_DCCP_H

// From the kernel's include/linux/socket.h
#define SOL_DCCP 269

// From kernel's include/uapi/linux/dccp.h
#define DCCP_SOCKOPT_SERVICE 2
#define DCCP_SOCKOPT_CHANGE_L 3
#define DCCP_SOCKOPT_CHANGE_R 4
#define DCCP_SOCKOPT_GET_CUR_MPS 5
#define DCCP_SOCKOPT_SERVER_TIMEWAIT 6
#define DCCP_SOCKOPT_SEND_CSCOV 10
#define DCCP_SOCKOPT_RECV_CSCOV 11
#define DCCP_SOCKOPT_AVAILABLE_CCIDS 12
#define DCCP_SOCKOPT_CCID 13
#define DCCP_SOCKOPT_TX_CCID 14
#define DCCP_SOCKOPT_RX_CCID 15
#define DCCP_SOCKOPT_QPOLICY_ID 16
#define DCCP_SOCKOPT_QPOLICY_TXQLEN 17
#define DCCP_SOCKOPT_CCID_RX_INFO 128
#define DCCP_SOCKOPT_CCID_TX_INFO 192

#endif //DCCP_DCCP_H

Caveats and Conclusion

DCCP is not mainstream. It is not widely deployed or even supported. Documentation is sparse. Although Linux DCCP NAT is functional, many intermediate boxes will probably just drop DCCP traffic. DCCP is the Fixed-gear bicycle of Layer 4, it is the ultimate hipster transport.