Forem: Neven Miculinic

Kubernetes on ARM: a case study

Neven Miculinic — Tue, 30 Apr 2019 13:01:47 +0000

At KrakenSystems we're working with various IoT devices. They are our main infrastructure for collecting data and sending them to further aggregating pipelines. For now, they are implemented as Beaglebone black devices, armv7l hard float CPU, AM335x 1GHz ARM® Cortex-A8 and only 512MB RAM. In this blogpost we're covering use case and rationale for using kubernetes on such underpowered devices.

Those devices perform simple services. Reading Modbus registers, or xbee protocol, or attaching to OBD (On-Board Diagnostic for vehicles), parsing the data, serializing in the protobuf format and sending on the message bus.

Design criteria & implementation

Deployment format

We want to deploy software as an immutable binary/container. Due to C++ build process arcane setup at KrakenSystems, and plethora of shared library dependencies container setup makes the most sense for this use case. The static binary is also a viable alternative, but that would require refactoring C++ our current build system written as bash/Makefile script collection running for about 15min from 0, and about a couple of minutes on CI after caching.

Another solution was deploying services bare metal. In this legacy setup there was a dedicated shared library folder per service and we did LD_LIBRARY_PATH trickery for shared library version management, defeating having shared libraries in the first place. Yet due to build system current state making static binary was (dev) time-consuming.

Kuberentes with its container management solution fits perfectly to our use-case. Nomad or plain old docker/cri-o/rkt would also satisfy this design criterion. Static binaries with systemd are also a satisfactory choice if it were simple to do in the present codebase state.

Monitoring

Node & service aliveness monitoring is critical. We require some agent running on the node and sending I'm alive to some system, together with a mature alerting pipeline. Consul is one solution. Kubernetes has this out-of-the-box, and together with Prometheus alerting rules seemed like a natural fit. We're also using Prometheus/grafana/alertmanager throughout our infrastructure which made this option more appealing.

Additionally, liveness and readiness health check aren't particularly useful for the edge devices since the process crashing signals the issue. They are not server component requiring accepting client connections.

Nevertheless, in the future, we plan to introduce liveness checks on the services as a failsafe mechanism in case service isn't sending data on the message bus -- its main purpose.

The remaining Ascalia infrastructure is on the kubernetes, thus it made sense reusing those same tools and setup for our edge devices. Less different moving parts is always better and leads to operational simplicity despite kubernetes being not simple to operate.

Updates

The edge devices aren't static islands forever resting int the Pacific Ocean. The code changes often, and configuration even more frequently.

The services are designed for simplicity. Their configuration is saved as YAML file under inotify watch for changes. Thus any update mechanism is possible in the future as sidecar, but keeps the development complexity in check. Furthermore, it's easier to debug.

Per edge device configuration is stored in the RDBMS, Postgres in this instance. Having 100s or 1000s edge devices polling the RDBMS for simple key/value pairs wouldn't end up nicely. Furthermore, there's no push style notification from the RDBMS on key update. Thus we need some additional layer in between.

We're reusing the kubernetes API server and it's backing key/value etcd store. We've defined each edge device as CRD (custom resource definition) object supporting rich and domain-specific information. The kubernetes also server as primitive inventory management supplementing the real Django backed for the operations (i.e. I don't care what Django does as long as updates the right REST endpoints in the kubernetes API)

In the future it's possible edge services shall watch the backing key/value store itself, whether it's kube api server, etcd, consul, riak, redis, or any other common key/value implementation.

Finally, we require async updates. The devices could be offline at the update application time. This rules out all non-agent based configuration management solutions. Ansible, our favorite configuration management tool for its simplicity and power is only used for initial setup, not update procedure (service update that is).

Wireguard VPN setup

Since we're using wireguard VPN solution we need to keep client server IP/public key list in sync asynchronously. This entails having an additional agent on the edge device you have to monitor, track and make sure it's alive.

We also need storing the offline device's public key and easy inspection for those keys/settings. The kubernetes CRDs are the natural fit for this role. We reuse the etcd backing store, have nice RBAC on those object and we've defined custom printer columns for easier VPN node management.

We used the following open-source inhouse tools:

Long story short we bootstrapped the wireguard VPN with wg-cni ansible role. This also installed wireguard based CNI for use in our kubernetes cluster.

The wg-cni role created our custom CRD manifests representing client/servers in the wireguard VPN topology.

After applying the manifests we started the wireguard operator daemonset keeping nodes in sync with further additions/removals.

Initial deployment

It wasn't without issue. We used kubespray as mature kubernetes deployment solution. It's the only complete solution for bare metal deployment. Being ansible based we're familiar with it and can easily extend it if necessary....and it was necessary.

We encountered myriad of problems:

missing support on ARM
Default pause image not supporting arm
missing cpuset (kernel update to 4.19 LTS solved it)
Run into space issues a few times
Flannel missing multi-arch support in kubespray (( before we transitioned to wireguard CNI for good ))
...

Most of these are tracked in the following issue/PRs:

Issues:

PRs:

After successfully applying the default container runtime, docker, it was time for basic performance analysis.

Initial performance analysis

Basic checklist

eMMC is mounted without atime
using armhf binaries ( readelf -A $(which kubelet | grep Tag_ABI_VFP_args)
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

USE

debian@bbb-test:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0   6088  19916 271064    0    0    25    20   17   32 18 12 69  0  0
 0  0      0   6120  19916 271064    0    0     0     0 1334 4098 13 20 67  0  0
 0  0      0   6120  19916 271064    0    0     0     0 1554 4046 13 19 68  0  0
 0  0      0   6120  19924 271056    0    0     0    16  929 2443 10  8 81  1  0
 0  0      0   6120  19924 271064    0    0     0     0 1611 4128 24 20 56  0  0
 0  0      0   6120  19924 271064    0    0     0     0  919 2443  6 11 83  0  0
 0  0      0   5996  19924 271064    0    0     0     0 1240 3312 29 28 42  0  0
 0  0      0   5996  19924 271064    0    0     0     0  958 2417 13  9 77  0  0
 3  0      0   5996  19924 271064    0    0     0     0 1915 5693 28 25 46  0  0
 0  0      0   5996  19924 271064    0    0     0     0 1089 3296 12 18 70  0  0

debian@bbb-test:~$ pidstat 30 1
Linux 4.19.9-ti-r5 (bbb-test)   02/25/2019      _armv7l_        (1 CPU)

04:59:26 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command

04:59:56 PM     0     26749    3.54    1.62    0.00    5.16     0  dockerd
04:59:56 PM     0     26754    0.44    0.37    0.00    0.81     0  docker-containe
04:59:56 PM     0     26784    0.00    0.07    0.00    0.07     0  kworker/u2:2-flush-179:0
04:59:56 PM     0     28814   10.08   10.79    0.00   20.88     0  kubelet
04:59:56 PM     0     29338    0.51    1.15    0.00    1.65     0  kube-proxy
04:59:56 PM   997     29734    1.42    0.37    0.00    1.79     0  consul
04:59:56 PM     0     30867    0.03    0.00    0.00    0.03     0  docker-containe
04:59:56 PM     0     30885    0.47    0.67    0.00    1.15     0  flanneld
04:59:56 PM  1000     31776    0.30    0.07    0.00    0.37     0  mosh-server

About 30% CPU is on the kubernetes without any meaningful work.

iostat -xz 1
sar -n DEV 1
sar -n TCP,ETCP 1

Don't show significant network pressure. Speedtest-cli shows 30MBit download/upload speeds which are more than sufficient for our use case.

In summary, there's high CPU usage with low disk, memory and network usage.

Performance analysis

Stracing kubelet shows about 66% is spent in the locks:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.66    1.671983        5006       334        60 futex
 11.77    0.299775         967       310           epoll_wait
  9.24    0.235263         364       647           nanosleep
  2.58    0.065766          31      2136           clock_gettime
  1.75    0.044623          38      1180        68 read
...
------ ----------- ----------- --------- --------- ----------------
100.00    2.546516                 10290       356 total

Though using pprof and tracing profiles showed more useful information:

debian@bbb-test:~$ wget http://127.0.0.1:10248/debug/pprof/profile?seconds=120
debian@bbb-test:~$ wget http://127.0.0.1:10248/debug/pprof/trace?seconds=120

From there we concluded:

25% time is spent in housekeeping
Changing --housekeeping-interval=10m from default 10s
Increasing node update period didn’t considerably affect CPU usage

This housekeeping is mostly for container metrics, which we don't really need them every 10s, once in a while is perfectly fine for our use case.

GODEBUG=gctrace=1,schedtrace=1000
gc 80 @1195.750s 0%: 0.070+217+0.19 ms clock, 0.070+57/63/59+0.19 ms cpu, 24->24->12 MB, 25 MB goal, 1 P

SCHED 1196345ms: gomaxprocs=1 idleprocs=1 threads=18 spinningthreads=0 idlethreads=6 runqueue=0 [0]

There are no big issues with go's GC nor scheduler in the kubelet process, thus haven't analyzed this further.
o

debian@bbb-test:~$ sudo perf stat -e task-clock,cycles,instructions,branches,branch-misses,instructions,cache-misses,cache-references
^C
 Performance counter stats for 'system wide':

         16,203.12 msec task-clock                #    1.000 CPUs utilized
     4,332,572,389      cycles                    # 267393.223 GHz                    (71.42%)
       911,023,486      instructions              #    0.21  insn per cycle           (71.40%)
        98,098,648      branches                  # 6054350.923 M/sec                 (71.41%)
        30,116,184      branch-misses             #   30.70% of all branches          (71.44%)
       885,259,275      instructions              #    0.20  insn per cycle           (71.45%)
         6,967,361      cache-misses              #    1.836 % of all cache refs      (57.16%)
       379,417,471      cache-references          # 23416495.155 M/sec                (57.14%)

      16.202385758 seconds time elapsed

We observe 30+% branch misprediction rate in the kubelet process. After further analysis this is system-wide. This cheap ARM processor has horrible branch prediction algorithms.

Improvements

We performed the following improvements:

nicked the docker and replaced it with CRI plugin. Concretely we used containerd
increased the housekeeping interval from 10s to 10m
throw away flannel for wireguard CNI (that is native routing mostly)

Average:        0         9    0.00    0.14    0.00    0.24    0.14     -  ksoftirqd/0
Average:        0        10    0.00    0.17    0.00    0.35    0.17     -  rcu_preempt
Average:        0       530    0.00    0.03    0.00    0.00    0.03     -  jbd2/mmcblk1p1-
Average:        0       785    0.00    0.07    0.00    0.21    0.07     -  haveged
Average:        0       818    0.03    0.00    0.00    0.00    0.03     -  connmand
Average:        0       821    4.64    4.47    0.00    0.00    9.12     -  kubelet
Average:        0      1416    0.14    0.07    0.00    0.03    0.21     -  fail2ban-server
Average:        0      1760    0.42    0.69    0.00    0.35    1.11     -  kube-proxy
Average:        0      3436    1.70    0.90    0.00    0.00    2.60     -  containerd
Average:        0      4274    0.07    0.03    0.00    0.07    0.10     -  systemd-journal
Average:        0     17442    0.00    0.38    0.00    0.17    0.38     -  kworker/u2:2-events_unbound
Average:        0     19070    0.00    0.03    0.00    0.00    0.03     -  kworker/0:2H-kblockd
Average:        0     26772    0.00    0.24    0.00    0.28    0.24     -  kworker/0:1-wg-crypt-wg0
Average:        0     28212    0.00    0.31    0.00    0.28    0.31     -  kworker/0:3-events_power_efficient

And in the steady state, we have ~15% CPU usage overhard for the monitoring benefits.
Still, quite a bit, though livable. Maybe cri-o would have lower overhead, though containerd's is pretty slim too.
We'll investigate how can we optimize the kubelet for even lower resource consumption by turning off unneeded features.

Summary

To summarize everything, is running kubernetes on the edge devices sane choice? Maybe.

For us so far so good, everything works with some considerable, though livable overhead.

Trying to only install Prometheus node_exporter, for example, shoots your CPU every scrape, and slows everything to a crawl for those few 100s milliseconds.

This hardware is quite underpowered and with bad branch prediction makes any software running on it weaker than on comparable armv8 or x86_64 architectures.

In the future we'll try to optimize things even further, hopefully reducing kubelet CPU overhead to a more reasonable percentage. We've tried rancher's k3s without a big difference (( actually worse performance since we couldn't change housekeeping interval ))

There's also KubeEdge project which looks promising for kubernetes on IoT.

References

Golang race detection

Neven Miculinic — Thu, 28 Mar 2019 14:26:36 +0000

Race conditions are pretty nasty bugs. Hard to trace, reproduce and isolate; often occurring under unusual circumstances and often lead to hard to debug issues. Thus it's best to prevent & detect them early. Thankfully, there's a production-ready algorithm to tackle the challenge - ThreadSanitizer v2, a battle proven library for compiler support in Go, Rust, C & C++.

First, let's display a typical race condition. We have a simple global count variable:

var Cnt int

and we count the number of events happening, whether we're counting HTTP requests, the number of likes or tinder matches is irrelevant. What's relevant it how we do it. We call function Run:

fun Run(amount int) {
  for i := 0; i < amount; i++ {
    Cnt++
  }
}

But that's not all. We're calling it from multiple goroutines:

func main() {
    wg := &sync.WaitGroup{}
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
      defer wg.Done()
            Run(1000)
        }()
    }
  wg.Wait(); 
  fmt.Println(Cnt)
}

And we'd expect our result to be 10000. Yet it's improbably this shall happen on a multicore multithreaded system. You're most likely get results like 5234 one run, 1243 second run, or even 1521 last run without any determinism or repeatability. This is a typical data race!

Let's not stop on this toy example. Instead, let's evaluate some famous real-world race conditions with serious consequences.

Famous examples

DirtyCOW

This is a race condition in the Linux kernel. It involves a tricky memory pages interplay, mmap and madvise system call which allow for privilege escalation exploit. That is, you could mmap root owned file Copy-on-write, which is a valid operation (every write to this mmap-ed region shall not be written to an underlying file), yet under certain conditions write would propagate to underlying file, despite we're an unprivileged user.

Further info

Therac-25

Another famous example is the Therac-25 radiotherapy machine. It had a race condition between machine settings and display settings. If the user typed instructions too quickly and changed them quickly the machine could end up in maximum radiation dosage while displaying false information to the operator. This led to multiple accidents and death cases.
Further info

Definitions

Before continuing let's briefly iterate over required definitions:

Race condition - A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly

Due to general race condition scope, and requiring domain knowledge understanding what is correct behavior, for the rest of this post we're focusing on data race, much more narrow and objective definition.

Data race - Concurrent access to a memory location, one of which is a write

Concurrent access - is one where event ordering isn’t known. There’s no happens before relation

Happens before relation requires some further explaining. Each goroutine has its own logical clock. It is incremented on each operation. Within each goroutine there's strict ordering, event happen sequentially (or at least we observe them as if they happened sequentially, various compiler optimization/Out-of-order execution might interfere). Between goroutines, there's no order unless there's synchronization between them.

In the following image, you can see it visually depicted. happens before relations are highlighted using arrows. I'd like to remind that happens before relation is transitive.

Common mistakes

Let's observe a few common mistakes in go programs. The examples have a similar equivalent in other major languages, it's in Go since I quite like it.

Race on the loop counter

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func() {
            fmt.Println(i) // Not the 'i' you are looking for.
            wg.Done()
        }()
    }
    wg.Wait()
}

Try it out. You might get 0 4 4 4 4 as output depending on the event ordering. Certainly not the output you'd expect. The fix is quite simple:

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func(j int) {
            fmt.Println(j)
            wg.Done()
        }(i)
    }
    wg.Wait()
}

Unprotected global

This one is similar to the introductory example:

var Cnt int
func Inc(amount int) {
    Cnt += amount
}

func main() {
    go Inc(10)
    go Int(100)
}

Solutions are to sync access to global Cnt variable since increment operations aren't atomic (unless from sync/atomic package).

Thus we use a mutex:

var Cnt int
var m = &sync.Mutex{}
func Inc(amount int) {
    m.Lock()
    defer m.Unlock()
    Cnt += amount
}

or atomic variable:

var Cnt int64
func Inc(amount int64) {
    atomic.AddInt64(&Cnt, amount)
}

Violating go memory model

Go memory model specifies what guarantees you have from the compiler. If you breach that contract, you're in for a bad time. The compiler is free to optimize away your code or do unpredictable things with it.

var a string
var done bool

func setup() {
    a = "hello, world"
    done = true
}
func main() {
    go setup()
    for !done {
    }
    fmt.Print(a)
}

In this example go's memory model doesn't guarantee write to done in one goroutine is visible to other goroutines since there was no synchronization between them. The compiler is also free to optimize away:

for !done {}

into a simpler construct, not loading done variable each iteration. Furthermore, there are no guarantees memory buffers between CPU cores and L1 caches shall be flushed between concurrent reads for the done variable. This is all due to nasty complexity of out-of-order execution and various memory guarantees across architectures. E.g. x86 has stronger memory guaranties on assembly code then ARM architecture.

Synchronization primitives

In Go we have following synchronization primitives:

channels, that is sending and receiving is a synchronization point
sync package
- Mutex
- atomics

For further details look those up, they are not a concept unique to go nor race detection. The important point is they bring happens before ordering into the program code.

Detecting race conditions

In Go, this is simply done with -race compiler flag. We can even run our tests with it and fail on any race condition:

go install -race
go build -race
go test -race

As said in the beginning it uses ThreadSanitizer library under the hood. You should expect runtime slowdown of 2-20x and increased memory consumption 5-10x. Other requirements are CGO enabled and 64bit operating system. It detects race conditions reliably, without false positives. During its usage is uncovered 1000+bugs in chrome, 100+ in go's standard library and more in other projects.

What -race flag does is instrument your code with additional instructions. For example, this simple function:

func Inc(x *int) {
*x++
}

Get's compiled into:

movq "".x+8(SP), AX
incq (AX)

However upon turning on -race you get bunch of assembly code, interesting bits being:

...
call runtime.raceread(SB)
...
call runtime.racewrite(SB)
...

That is compiler adds read and write barriers for each concurrently reachable memory location. The compiler is smart enough not to instrument local variables since this incurs a quite performance penalty.

Algorithm

First the bad news:

Determining if an arbitrary program contains potential data races is NP-hard.

Netzer&Miller 1990

Therefore our algorithm shall have some tradeoffs. ˍ

First how & when do we collect our data. Our choices are either dynamic or static analysis. The main static analysis drawback is the time required for proper source code annotation. This dynamic approach is used since it requires no additional programmer intervention except turning it on. Data could be collected on the fly or dumped somewhere for post mortem analysis. ThreadSanitizer uses on the fly approach due to performance consideration. Otherwise, we could pile up huge amounts of unnecessary data.

There are multiple approaches for dynamic race detection, pure happens before based, lockset based and hybrid models. ThreadSanitizer uses pure happens before. Now we'll go over 3 algorithms, each pure happens before dynamic race detection, each improving upon the last. We'll see how ThreadSanitizer evolved and understand how it works from its humble origins:

Vector Clocks

First, let's explain the concept of vector clocks. Instead of each goroutine remembering only its logical clock, we remember the logical clock of the last time we hear from another goroutine.

Vector clocks are partially ordered set, if two events have strictly greater or less than relation between them there's happens before relation between them. Otherwise, they are concurrent.

DJIT+

DJIT+ is an application of vector clocks for pure happens before data race detector.

We remember vector clocks for:

Each lock $m$ release $$L_m= (t_0, \ldots , t_n)$$
Last memory read on location x $$R_x = (t_0,\ldots, t_n)$$
Last memory write on location x $$W_x = (t_0, \ldots, t_n)$$
Goroutine vector clock, $$C_x = (t_0, \ldots, t_n)$$

Let's see an example where there are no races:

Each row represents a single event as our race detector sees it. First, we write to location $x$ from goroutine 0, and it's remembered in $W_x$. Afterward, we release the lock $m$ and goroutine 0 field is updated in $L_m$, that is our lock release vector clock. By acquiring the lock, we max by elements vectors clocks of $L_m$ and $C_1$, because we know lock's lock happens after lock's release. We perform the write-in goroutine 1 and check whether our own vector clock is concurrent to $W_x$ and $R_x$. It's strictly ordered, thus everything is good.

And now the same example, but without goroutine synchronization. There are no lock's lock and release. When comparing goroutine 1 vector clock, $(0, 8)$ is concurrent to the last write $W_x = (4,0)$. Thus we have detected the data race. This is the most important concept to understand, rest is optimizations.

FastTrack

The Fast track introduces the first optimization. For full details I recommend reading original paper, it's quite well written!

We observe the following property. If there are no data races on writes, each write is sequential. That is, it's sufficient to remember last write's goroutine and logical clock for that goroutine. Thus we create the shadow word, representing last memory write, logical clock and goroutine id. Format <logical clock>@<goroutine id>

For reads, it's a bit different story. It's perfectly valid having multiple concurrent reads as long as reads and write and strictly ordered. Thus we utilize the shadow word read representation as long as a single goroutine reads the data, and fallback expensive full vector clock in concurrent read scenario:

ThreadSanitizer v2

Thread sanitizer further improved on the FastTrack. Instead of having separate $R_x$ and $W_x$, we keep a fixed sized shadow word pool for each 8-byte quadword. That is, we approximate the full vector clock with partial shadow words. Upon full shadow word pool we randomly evict an entry.

By introducing this trade-off, we trace false negatives for speed and performance. Our fixed shadow word pool is memory mapped, thus allowing cheap access with additions and byte shifts compared to more expensive hashmaps or variable sized array access.

Let's go over an example. Each shadow word is 8 bytes wide and consists of goroutine logical clock, goroutine id, write/read flag and position in the 8byte word we're reading/writing.

Everything's good so far. Let's make another operation.

Now let's introduce a data race. The goroutine 3 vector clock for goroutine 1 is 5, that is smaller than 10@1 shadow word written in the pool. Thus we're certain a data race happened. (We assume events for each goroutine arrive in order, otherwise, we couldn't be sure whether 10@1 entry for goroutine 3. is bigger or smaller than the current vector clock's entry for goroutine 3. This is enforced by algorithm design and implementation.)

Summary

To summarize this article, we covered the basics of logical and vector clocks, how they are used in happens before data race detector and we build it up to the ThreadSanitizer v2 which is used in go's -race.

We observed the tradeoffs in the algorithm design, it traded higher false negative rate for speed, however, it forbids false positive. This property builds trust, and with that trust, we know certainly, we have a race if it screams at us, not some weird edge case in the algorithm. No flaky race tests, only true positives are reported.

Though, keep in my this is only data race detector; it's easy to circumvent it. For example, the following code shall pass the data race detector despite being horribly wrong in the concurrent setting:

var Cnt int64

func Run(amount int) {
    for i := 0; i < amount; i++ {
        val := atomic.LoadInt64(&Cnt)
        val ++  // incrementing Cnt isn't atomic, we just load and store the value atomically.
        atomic.StoreInt64(&Cnt, val)
    }
}

References

Slides
https://golang.org/doc/articles/race_detector.html
Race detector options
https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
The Go memory model
""go test -race" Under the Hood" by Kavya Joshi talk
https://blog.golang.org/race-detector
https://danluu.com/concurrency-bugs/
ThreadSanitizer – data race detection in practice google paper
FastTrack: Efficient and Precise Dynamic Race Detection
AddressSanitizer/ThreadSanitizer for Linux Kernel and userspace.
DJIT+ algo MultiRace: efficient on-the-fly data race detection in multithreaded C++ programs
On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions (Netzer, Miller, 1990)

Pragmatic tracing

Neven Miculinic — Fri, 23 Nov 2018 11:53:09 +0000

Introduction

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable - Leslie Lamport

The fact of life is we're building bigger, more complex and distributed systems. What used to be single server serving old-style personal home pages turned into the medium.com mediated blog post and various other ecosystems. Simple bulletin board marketplace turning enormous systems of Amazon, eBay, and other online retailers. This evolution is akin to life itself, as we progress from single cellular bacterias to humans we're today, the thing got complicated. More powerful, but complicated.

Now, let's talk about your current system. Chances are not everything is new. Today almost every system uses numerical routines written in Fortran a few decades ago. Similarly, in your bodies basic cell metabolism haven't changed last 100s million of years.

Though with added complexity, some things did change. Aside from superior featureset, e.g. sight, speech, consciousness we evolve advanced monitoring equipment. Just a single neuron acting and proto-eye doesn't cut it, but a full-fledged eye does.

Vast nerve network almost instantaneously detects issues in your body and alerts the control center. Imagine if you had to check every second did you stub your toe or are you bleeding?

Messages are sent and received via hormones on the gigantic message queue called circulation system, i.e. blood. Much bigger complexity than simple single cell organism and osmosis diffused operations.

Therefore, your new systems cannot be monitored as they were during lower complexity era. Logs aren't enough anymore, neither are metrics nor pprof and similar toolset. In this article, I'm presenting tracing as another tool in your toolbox, both in monolith application and distributed setting. A very useful tool for specific use cases.

Observability toolset

In this chapter, I'll briefly cover basic observability tools at our disposal. I'll also use a football analogy as a metaphor for the clearer explanation. For those familiar feel free to skip to the latest section, tracing.

Logs

log - an official record of events during the voyage of a ship or aircraft.

This is the simplest and most basic tool at our disposal.
One log represents some event happening with associated metadata.

The most common metadata is when did the event happen, and other which application generated this event, on which host, what's the level of this log event.

Usually log event levels are modeled after syslog levels, DEBUG, INFO, WARNING, ERROR, CRITICAL being used the most in the software.

On the application and system level, you usually have log sources and log sink. Each source generates log data, and ships it to sinks. Each sink can apply some fileting, e.g. only ERROR level or higher.
In practical terms this means you're writing log level INFO or greater to stderr, but to the file, you're dumping all logs.

There are log management systems for gathering and searching logs. One of the simpler is logfile with grep, more complex being journaling in systemd, and for production distributed systems you're usually using ELK stack or graylog.

In the football analogy, playing scoring the goal would be a great log. Something like:

msg="Ivan scored the goal" time="2018-01-12" game_minute="23"

The log data is useful for unique, rare events or somehow meaningful events. For frequent events, it needs to be sampled down not to kill your log management system. For example, do you want to log every time you received IP package on the server?

Metrics

metric - a system or standard of measurement.

In observability context, metric is scalar value changed over time. This scalar value is usually a counter (e.g. number of goals scored, HTTP requests), gauge (e.g. temperature, CPU utilization), histogram (e.g. number of 1xx,2xx,3xx,4xx,5xx HTTP response codes) or rank estimation (e.g. 95% percentile response latency).

They are great for identifying the bottleneck, unusual behavior and setting SLOs(Service level objectives). Usually, alarms are tied to some metric, that is whenever the specified metric is outside given bound perform an action - auto-heal or notify the operator.

For example, in human, we have a really important metric - blood glucose level. In a healthy human, too high value performs an auto-healing operation and releasing more insulin into our bloodstream. Another human metric would be pain levels in your left leg. Usually, it's near 0, but over the certain threshold you're vividly aware of it -- that is an alarm is raised.

For computer systems, usual metrics are related to throughput, error rate, latency, resource (CPU/GPU/network/...) utilization. Usual systems mentioned are statsd, graphite, grafana, and Prometheus.

Pprof & Flamegraphs

This tool is best for profiling CPU intensive applications. Before I explain what it offers you I want to cover how it works. X times per second it stops your program and checks which line is being executed on the machine. It collects all execution samples into buckets per line/function and later on reconstructed which function/line was executed which time %. As if you're snapshotting the football match and see who has the ball every 10 seconds, and from that reconstructing who had which ball ownership percentage.

Traces

If you remember anything from this blog post, remember this comet image:

As is burning through Earth's atmosphere it leaves a trace over its path. From this trace we can deduce where it has been and how much time is spent there. A similar situation is within our programs.

The trace represents single execution of request/operation. Trace is composed of multiple spans. Span is a meaningful unit of work in your system -- e.g. database query, RPC call, calculation. Span can be related, e.g. parent/child span. Thus trace forms a tree of spans, or more generally a DAG (if you introduce complex followed by relations and other gimmicks).

Each span can have useful metadata attached to it -- both indexed tagged key/value pair such as userId, http.statuscode, hostname or additional log data e.g. exact database query. The tracing backend provides expected search capabilities, sorting by time, filtering by tags, etc.

In football example, we could have a trace representing scoring a goal. It consists of 3 spans:

Ivan kicking the ball
The ball rolling to the goal
Ball entering the goal post

Common use cases

In this section, I'm going to cover top use cases for tracing. Compared to other techniques like pprof, tracing can detect when your program was put to sleep due to IO waiting, resource contention or other reason.

Overall request overview

It allows you making a case study from an individual trace. Metrics aggregate while trace focuses the story on the individual request. What services did this individual request touch? How much time did it spend there? What's the time breakdown on CPU, GPU, network calls, etc.

If you're debugging your application searching through your traces for specific edge case and analysing that one is golden. Same with analyzing performance outliers.

Big slow bottleneck

When you got obvious bottleneck you know where you need to drill down. You've narrowed down your search space to this particular operation.

The causes for big slow bottleneck could be various. Perhaps you're overusing lock and the program is waiting on one. Or the database query is underoptimizes/missing an index. Finally, your algorithm worked for 10 users but after growing to 10 000 users it's just too slow.

Find it, observe it, analyze it, and fix it.

Fanout

Fan out is the number of outgoing requests a service makes for each incoming requests. The bigger the fan out the bigger the latency. Sometimes fan out is on purpose and useful, however in this context, we're primarily talk about calling the same service over and over again where a bulk operation would be more suitable.

It's the difference between:

SELECT id, name
FROM users
WHERE ...

versus looping over some id list and querying for each id:

SELECT name
FROM users
WHERE id = $1 AND ...

This could happen inadvertently, e.g. using ORM framework deciding to do this thing. You've fetched your ids and now you're happily looping over them not knowing you're issuing new database query each time.

Other times it's the issue with API design. If you have internal API endpoint such as: /api/v1/user/{id} for fetching user data, but require bulk export it's the same issue.

On the tracing, you shall see many small requests to the same service. Despite them being parallel (though they're not necessarily) you shall hit tail latency problem.

The probability you're going to observe p-th percentile latency for the calling services drops exponentially with fanout degree. Here's a simple figure illustrating it.

Chrome tracing format

This is simple JSON format specification for single process tracing. The catapult project includes the rendered for this specification. The same renderer available in chrome under chrome://tracing/ URL. There are various support for spitting out this format, e.g. tensorflow execution, golang tracing, chrome rendering itself, etc. And it's easy to include it into your application if there's no need for distributed tracing and your requirements are simple.

For example, this simple file:

[
  {
    "name": "Asub",
    "cat": "PERF",
    "ph": "B",
    "pid": 22630,
    "tid": 22630,
    "ts": 829
  },
  {
    "name": "Asub",
    "cat": "PERF",
    "ph": "E",
    "pid": 22630,
    "tid": 22630,
    "ts": 833
  }
]

Renders as:

If you're interested in more I recommend at least skimming the specification. The biggest downside is working with the visualizer. As a newcomer, I had a hard time finding how can I filter datapoint in categories, by name, and overall advance use cases beyond the basic scroll and see.

Distributed tracing

All's fine and dandy on a single node, but trouble starts with distributed systems. The problem is how to connect/correlate traces coming from multiple nodes. Which spans belong to which trace and how are those spans related?

Today most solutions take notes from Google's dapper paper. Each trace has its own unique traceID and each span unique SpanID which are propagated across the node boundaries.

Though, there are multiple ideas about how this context should be propagated and whether to include additional data during that propagation (i.e. baggage). Also, each backend has its own ideas how should you deliver trace/span data to them.

The first available backend was Zipkin, nowadays Uber's Jaeger is now CNCF incubating project and a good place to start. Also, various cloud providers have their own in-house SaaS tracing (Google's stackdriver, AWS X-Ray, etc.)

Here's a screenshot from Jaeger frontend for searching and looking at your traces:

Since writing against specific backend could be a hard vendor lock-in, there have emerged two client vendor-neutral APIs -- open tracing and open census.

Client side vendor-neutral APIs

In this subsection, I'm going to compare open census and open tracing standard. Both are evolving project with high GitHub start count, multi-language support with various middleware implementations for databases, HTTP's RPCs, gRPC, etc.

Open tracing is currently CNCF incubating project while Open census emerged from Google's internal trace tool. Nowadays open census has its own vendor-neutral organization.

As for the feature set, the open census includes metrics inside its API while open tracing is metrics only.

In open census, you add trace exporters to the global set, while in the open tracing you have to specify them in each iteration. They have global tracer concept, but at least in go's they don't offer to default on it, but you have to invoke it. Furthermore, open tracing API feels more clunky and less polished compared to the open census.

For the propagation format, open census specify the standard. Open tracing on the other hand only specifies the API, and each supported backend much implement the standard propagation API.

What open tracing has is baggage concept, that is forcing some data to be propagated to each downstream service alongside span propagation context.

Open census example

This subsection shall describe basic open census building blocks. For complete examples see their official documentation, it's quite good!
They even feature great examples. My examples are in go, but they support multiple other languages. For brevity import statement shall be omitted.

First we start by defining trace exporter:

j, err := jaeger.NewExporter(jaeger.Options{
        Endpoint:    "http://localhost:14268",
        ServiceName: "opencensus-tracing",
})
trace.RegisterExporter(j)

We can easily start it using Docker. From their getting started page

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.7

Then we start some span:

ctx, span := trace.StartSpan(ctx, "some/useful/name")
defer span.End()

The ctx is instance of standard's library context.context

We can attach some indexed tagged key/value metadata to this span:

span.AddAttribute(trace.StringAttribute("key", "value"))

Open tracing actually specifies some standardized keys (e.g. error, http.statuscode, ...). If you can I recommend using them despite open census not specifying any.

Or we could add some log data, that is annotations:

span.Annotate(nil, "some useful annotation")
span.Annotate(
    []trace.Attribute{trace.BoolAttribute("key", true)},
    "some useful log data",
)

After we decided to use http let's inject open census middleware:

client = &http.Client{Transport: &ochttp.Transport{}}  // client
http.ListenAndServe(addr, &ochttp.Handler{Handler: handler}) //server

for client side we only have to include the right context into our request:

req = req.WithContext(ctx)
resp, err := client.Do(req)

While on the server side we grab the context from the request and go on with our life.

func HandleXXX (w http.ResponseWriter, req *http.Request) {
    ctx := req.Context() 
    // ...
}

This chapter could be summarised as open census crash course/cheat sheet/in 5 minutes.

Summary

To summarise this blog post I recommend tracing for these use cases the most:

Reasoning about overall performance overview
Detecting big slow operations
Fan out detection

In here I presented various tools such as chrome tracing format, open tracing, and open census. Of those, I recommend starting with open census and starting jaeger in a docker container. Use middleware wherever possible.

Finally, reap the benefits! See how tracing benefits you and how can you best leverage it to your success.

Forem: Neven Miculinic

Kubernetes on ARM: a case study

Design criteria & implementation

Deployment format

Monitoring

Updates

Wireguard VPN setup

Initial deployment

Initial performance analysis

Basic checklist

USE

Performance analysis

Improvements

Summary

References

Golang race detection

Famous examples

DirtyCOW

Therac-25

Definitions

Common mistakes

Race on the loop counter

Unprotected global

Violating go memory model

Synchronization primitives

Detecting race conditions

Algorithm

Vector Clocks

DJIT+

FastTrack

ThreadSanitizer v2

Summary

References

Pragmatic tracing

Introduction

Observability toolset

Logs

Metrics

Pprof & Flamegraphs

Traces

Common use cases

Overall request overview

Big slow bottleneck

Fanout

Chrome tracing format

Distributed tracing

Client side vendor-neutral APIs

Open census example

Summary

Reference & links