Forem: Tal Hoffman

Software Transactional Memory: a stairway to lock-free programming heaven?

Tal Hoffman — Fri, 26 Mar 2021 18:48:26 +0000

When it comes to synchronization of shared state and intermediacy, developer-controlled locking has always been a double-edged sword. Although effective, it's been proven to be deadly when done wrong.

Deadlocks, livelocks, and a lack of composability are all way too common, and frankly - hard to avoid - especially when dealing with large, complex applications.

Enter Software Transactional Memory.

One alternative approach to this pain in the arse problem is moving the locking part inside the runtime, basically freeing the developer from locking decision concerns. It works such that each critical section access is done using an "atomic transaction".

Before we proceed any further, one has to understand the difference between "lock-less programming" and "lock-less programs" in regard to this blog post. STM is meant to be a solution to the first one by taking care of locking for us, the developers, hence "lock-less programming". This means no deadlocks, less livelocks, and much better composability.

We'll also be using a specific STM algorithm which is lock-free, in the sense of not using traditional locking primitives, but rather bounded spinlocks which if already acquired - do not block. Therefore, we're also considered a "lock-less program".

Now let's proceed…

STM was first introduced by Shavit and Touitou back in 1995. It was an exciting improvement of an earlier concept called Hardware Transactional Memory, in which hardware support was used to achieve the same goal, only now it could be done at software level - either entirely or as an hybrid software-hardware solution.

It works by isolating a set of reads and writes to shared memory locations in a construct called "a transaction". The runtime executes the user code as if it were to run alone, with no other threads interfering. It attempts to commit all of the transaction's writes into memory, and aborts execution if it notices a conflict with other threads. The runtime would keep retrying to run the transaction, until it is able to successfully commit all shared memory modifications.

However, as previously stated, it is important to understand that STM by itself doesn't necessarily mean lock-free concurrency. Some implementations are indeed lock-free, while some are not. Software Transactional Memory is merely an abstraction which frees the developer from typical locking concerns.

Credit: https://researchgate.net*

Sounds familiar?

Many of you might have already heard of this modus operandi. A similar form of this concurrency control is used in databases and version control systems. It is known as optimistic concurrency.

Transactional Locking II (with a slight touch)

There are several approaches and algorithms for implementing STM, one of them being "Transactional Locking II (TL2)" proposed by Dive, Shalev, and Shavit, on which we'll focus here.

The algorithm is pretty straightforward, and can be split into three parts: reading a variable, writing to a variable, and committing the transaction.

The secret to it all is "versioning". Basically, each time a value is read or a commit needs to go through, the runtime makes sure that versions are up-to-date and that there is no other thread in the middle of messing up with that memory location at the time of inspection. Otherwise, it starts over until each transaction is looking at a consistent view of things.

It is, in a sense, a form of double-checked locking integrated into the runtime, if you wish.

This works by maintaining a global version "clock" and wrapping each globally shareable variable with its own (shared) version and its own lock. Lets call it a "versioned lock" from now on.

That versioned lock is a 64-bit word (uint64), for that matter, where the first bit is the lock and the other 63 bits hold the version. Kind of like seqlocks which are being used in the Linux Kernel:

// VersionedLock consists of a lock bit and a version number.
// Note that this lock doesn't enforce ownership!
type VersionedLock uint64

// Tries to acquire lock.
// Non-blocking.
func (vl *VersionedLock) TryAcquire() error {
   currentlyLocked, currentVersion, currentLock := vl.Sample()
   if currentlyLocked {
      return ErrAlreadyLocked
   }

   // Lock = true; Version = current
   return vl.tryCompareAndSwap(true, currentVersion, currentLock)
}

// Releases lock.
func (vl *VersionedLock) Release() error {
   currentlyLocked, currentVersion, currentLock := vl.Sample()
   if !currentlyLocked {
      return ErrAlreadyReleased
   }

   // Lock = false; Version = current
   return vl.tryCompareAndSwap(false, currentVersion, currentLock)
}

// Atomically updates lock version and releases it.
func (vl *VersionedLock) VersionedRelease(newVersion uint64) error {
   currentlyLocked, _, currentLock := vl.Sample()
   if !currentlyLocked {
      return ErrAlreadyReleased
   }

   // Lock = false; Version = new
   return vl.tryCompareAndSwap(false, newVersion, currentLock)
}

// Retrieves lock state - whether it is locked, its version, and its raw form.
func (vl *VersionedLock) Sample() (bool, uint64, uint64) {
   current := atomic.LoadUint64((*uint64)(vl))
   locked, version := vl.parse(current)
   return locked, version, current
}

func (vl *VersionedLock) tryCompareAndSwap(doLock bool, desiredVersion uint64, compareTo uint64) error {
   newLock, err := vl.serialize(doLock, desiredVersion)
   if err != nil {
      return errors.WithMessage(err, "try compare and swap")
   }

   if swapped := atomic.CompareAndSwapUint64((*uint64)(vl), compareTo, newLock); !swapped {
      return ErrLockModified
   }
   return nil
}

func (vl *VersionedLock) serialize(locked bool, version uint64) (uint64, error) {
   if (version >> versionOffset) == 1 { // Version mustn't override our lock bit.
      return 0, ErrVersionOverflow
   }

   if locked {
      return (1 << versionOffset) | version, nil
   }
   return version, nil
}

func (vl *VersionedLock) parse(serialized uint64) (bool, uint64) {
   version := (1<<versionOffset - 1) & serialized
   lockedBit := serialized >> versionOffset
   return lockedBit == 1, version
}

Reads

Prior to each read we make sure of 3 things:

That no other thread is currently trying to modify that memory location (lock bit should be off)
That the shared version of that memory location is older or equals the global version, meaning we hold an up-to-date version of it
That the relevant memory location wasn't modified beneath our feet in case a context switch occurred before our read operation (i.e, variable version pre-read == variable version post-read)

If this read-set validation fails - the transaction will restart.

type StmContext struct {
   readLog      map[*StmVariable]interface{}
   writeLog     map[*StmVariable]interface{}
   restart      bool
   readVersion  uint64
   writeVersion uint64
}

// ...

func (sc *StmContext) Read(stmVariable *StmVariable) interface{} {
   if newVal, foundInWriteLog := sc.writeLog[stmVariable]; foundInWriteLog { // Short road to success...
      return newVal
   }

   _, preReadVersion, _ := stmVariable.lock.Sample()
   readVal := stmVariable.val.Load()
   locked, postReadVersion, _ := stmVariable.lock.Sample()

   // Fail transaction if:
   // 1. Variable is currently being changed by some other goroutine; or if
   // 2. Variable was changed before/after being read; or if
   // 3. Variable is too new meaning our read version is outdated
   sc.restart = locked || preReadVersion != postReadVersion || preReadVersion > sc.readVersion

   return readVal
}

Writes

When it comes to writing to shared locations, each update is cached in-memory, waiting for the transaction to end so it can be committed.

func (sc *StmContext) Write(stmVariable *StmVariable, newVal interface{}) {
   sc.writeLog[stmVariable] = newVal
}

Commits

We will implement a slightly modified version of TL2 in terms of committing transactions.

When user code is done executing and commit time is due, we try to "lock" each write log variable and each read set variable*.

The next step is to atomically increment-and-fetch the global clock and making it be our transaction's write version. This would become the official version of each write-log variable in case everything goes well and our transactions commits successfully.

Next up we need to once again validate our read-set, making sure nothing had changed between us executing the user's code and our attempt to commit the transaction. If validation fails - you know the drill - the transaction restarts. A slight optimization here is to avoid running this validation if we are the only transaction (that is, the transaction's read version + 1 == transaction's write version that we've just set).

Otherwise, modifications are committed one-by-one in an atomic manner (say compare-and-swap?) and seqlocks are released.

This entire process repeats itself until the transaction finally looks at a consistent memory layout and is able to commit itself successfully.

Please note that the original TL2 specification doesn't mention anything about locking the read-set. This left me thinking: what would happen if a context switch occurs right after the commit-phase's read-set validation, but before actually updating the write-set memory? In that case, a second thread might modify variables which were already validated by the first one, causing it to have a false sense of memory consistency.

Nonetheless, despite me trying to desperately get good explanations as to why this isn't really an issue - I could not find any. None of the official papers seemed to address this concern either.

Therefore, I decided to go what I believe is the safer way (until being proven otherwise) and lock the read-set as well, taking into account additional performance costs.

func StmAtomic(block func(*StmContext) interface{}) interface{} {
   for {
      ctx := &StmContext{
         readLog:      make(map[*StmVariable]interface{}, 0),
         writeLog:     make(map[*StmVariable]interface{}, 0),
         restart:      false,
         readVersion:  versionClock.Load(),
         writeVersion: 0,
      }

      retVal := block(ctx)
      if ctx.restart {
         continue
      }

      // "And she's buying a stairway to heaven..."
      if len(ctx.writeLog) == 0 {
         return retVal
      }

      lockSet := make(map[*StmVariable]int, 0)
      if err := tryAcquireSets(ctx, lockSet); err != nil {
         if fatal := isFatalAcquireErr(err); fatal { // Avoid a panic if lock is already acquired.
            panic(fatal)
         }
         continue
      }

      ctx.writeVersion = versionClock.Increment()

      // Now that our read and write sets are locked, we need to ensure that nothing has changed in terms of our
      // read set, in-between running the user's code and locking everything.
      // However, if no other concurrent actors were involved (readVersion == writeVersion - 1), there is no need to
      // validate anything cause we were all alone.
      if ctx.readVersion != ctx.writeVersion-1 {
         if validated := validateReadSet(ctx, lockSet); !validated {
            continue
         }
      }

      commitTransaction(ctx, lockSet)

      return retVal
   }
}

The entire repository code can be found here: https://github.com/talhof8/kashmir.

Caveats

No rose without a thorn, though. STM indeed has some very noticeable downsides and thus was considered a research toy for a long time. First off, STM rises the inevitable question of what do you do about non-idempotent operations. That is, I/O, network, print statements, etc…? Such operations cannot be simply undone when a memory conflict is detected. The default answer to this question is as simple (and perhaps conveniently looks the other way) as just to avoid putting side-effect causing operations inside STM atomic blocks. Another approach is to queue all side-effects causing operations inside a buffer and running them only on successful commits, outside the transaction.

Another potential issue - which is more implementation-dependent than an overall STM issue - is livelocks. Obviously threads could still conflict, causing them to keep retrying the transaction thus none progressing. It could be avoided by using some backoff mechanism (be it a random backoff, exponential one, etc…) in-between transaction attempts. This will obviously cause a performance hit, raising the question of how important is performance in this regard?

In addition, a major concern is what to do about languages which do not have garbage collection. Unless given official implementation-level support, some STM algorithms would conceptually allow threads to free heap-allocated variables while other threads are de-referencing them - causing segmentation faults. Have a look here for a deeper explanation.

Over and above that, in our implementation we use a form of spin-locking, meaning that long-running commit attempts might blow up CPU usage (unless preempted by the runtime/OS). This means that performance-wise our STM implementation is most performant in scenarios where transactions succeed on first attempt. Overall STM introduces quite a good performance (also depending on the implementation), albeit slightly less good than classic fine-grained locking.

Moreover, long-running programs could obviously make the global version clock overflow. This case can quite easily be resolved by the runtime, by making it reset to zero, for example, and syncing all relevant transactions to use the new, reset version clock, at a slight performance cost, when such cases occur.

Applications

Those of you who are familiar with the likes of CPython and Ruby MRI have probably heard of the infamous GIL. It is basically a mutex which prevents some interpreters from running threads in parallel, though still concurrently, in order to support specific memory management methods and for calling unsafe native extensions safely. Attempts have been made to get rid of Global Interpreter Lock (GIL) using our dear friend - Transactional Memory. For instance, a fork of pypy called pypy-stm has been created - https://doc.pypy.org/en/latest/stm.html - but seemed to be abandoned and is no longer maintained.

In addition, several languages such as Clojure and Haskell support Software Transactional Memory out-of-the-box as part of the core language/runtime, offering an alternative approach to both standard locking and actor model concurrency. There's also an official proposal and implementation by core Ruby committer Koichi Sasada to add STM to Ruby's parallelism feature named "Ractors" - https://bugs.ruby-lang.org/issues/17261.

Although STM officially being an integral part of both Clojure and Haskell, overall adoption of STM has been lacking to say the least. Attempts to get rid of the GIL by using STM has not come into fruition yet, as well.

If you'd like to read some reflections on that topic, please take a look at https://dl.acm.org/doi/abs/10.1145/3359619.3359747.

STM is often compared to Garbage Collection. Both operate on memory in runtime. Whilst the first manages the state of memory, the latest manages references to memory. Please do keep in mind that garbage collection used to be doubted and back-slashed a lot when first introduced to the world, and has suffered significant performance costs. It has been vastly improved over the years, to say the least, and has become a major, integral part of many programming languages. The community's hope in regard to this matter is to put our knowledge of garbage collection and how we have improved it, to use with transactional memory - making it much more mature, performant, and production-ready.

Originally posted on my personal website

Introduction to CRIU and Live Migration

Tal Hoffman — Thu, 21 Jan 2021 09:07:01 +0000

In this blog post I will try and explain what CRIU is and how it works, what Live Migration is, and how those two are related.

So what the heck is CRIU?

Checkpoint-Restore in Userspace (or CRIU) is a really (really) cool open-source project started by virtualization software company — Virtuozoo — also known for being the creator of
OpenVZ.

What CRIU does is letting you freeze running Linux processes and checkpoint their state to the disk as a collection of files. Those files can later be used to restore a process right from the point it’d been freezed, multiple times, on
any other CRIU-supported Linux machine!

Among its many usage scenarios, CRIU can be used for slow-boot service speed up, remote debugging, snapshots, process duplication, and for what is our main topic today — live migrations.

Take a look at
https://criu.org/Usage_scenarios for more

CRIU is now integrated as part of Podman, Docker (experimental), OpenVZ, LXC/LXD, and can also be used independently.

The mechanics (in a nutshell)

Let’s take a little dive into the internals of how this magic happens.

Checkpoint

The first step when checkpointing a process is walking recursively through its tree, and freezing it so it will not change its state while CRIU
needs to dump it.

CRIU supports two different methods for freezing the state of the process and its sub tasks.

By default, CRIU makes use of ptrace to stop the process. For those not familiar with ptrace:

The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

See https://man7.org/linux/man-pages/man2/ptrace.2.html.

In this method, CRIU first lists and goes through the relevant /proc/$pid entries. Thread ids are collected through /proc/$pid/task, whereas sub-processes are recursively collected through reading /proc/$pid/task/$tid/children files.

Each task it encounters — parent process itself, sub-processes, and threads — is being attached to CRIU’s tracer process by dispatching a PTRACE_SEIZE request, after which a PTRACE_INTERRUPT request is also dispatched in order to
stop that task.

The second method for freezing the process tree is using Linux’s Cgroup Freezer — available through CRIU’s --freeze-cgroup flag. Cgroup Freezer is a subsystem supported by the Linux Kernel which lets us start and stop a set of
tasks (i.e, processes and threads), by defining a control group.

Here is a simple example of usage:

# Create a cgroup freezer directory
mkdir /sys/fs/cgroup/freezer

# Mount directory against a cgroup filesystem of type 'freezer'  
mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer

# Create a child cgroup directory
mkdir /sys/fs/cgroup/freezer/whatever

# Put a task into this cgroup
echo $some_pid > /sys/fs/cgroup/freezer/whatever/tasks

# Freeze cgroup
echo FROZEN > /sys/fs/cgroup/freezer/whatever/freezer.state

# cat /sys/fs/cgroup/freezer/whatever/freezer.state
FREEZING

# cat /sys/fs/cgroup/freezer/whatever/freezer.state
FROZEN

# Thaw (unfreeze) it
echo THAWED > /sys/fs/cgroup/freezer/whatever/freezer.state

Read more at
https://man7.org/linux/man-pages/man7/cgroups.7.html
and
https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt.

Now that the process’s tasks tree is all frozen, CRIU needs to collect all relevant tasks’ resources and write them to dump files, which should later be used for restore.

The first set of resources being dumped are collected simply by reading procfs.
These resources are VMAs (Virtual Memory Areas), memory-mapped files, and opened file descriptors. In addition, registers and other core task parameters are collected using ptrace and parsing /proc/$pid/stat.

Afterwards, CRIU injects a parasite code into each task’s address space whose job is to collect some more information such as credentials and actual memory
contents.

More information about how the parasite code injection is done can be found
here.

The final step in the checkpointing process is cleaning up CRIU’s parasite code, restoring original code, detaching ptrace, and then resuming tasks from where they’ve been stopped (actually this is optional and can be turned-off by specifying -R|--leave-running).

Restore

The restore process is pretty straightforward. During restore CRIU gradually morphs itself into the target process.

First off, the restorer process reads all dumped image files and finds out which processes share which resources. Next, it re-creates all processes in the tree by calling fork(). Note that threads are not restored here but rather on
the last stage.

The PID Dance

Each forked process is supposed to be assigned with its original pid. But how
exactly?

Well, in order to do so, CRIU utilizes a feature introduced in Kernel v3.3 used by the kernel for keeping track of the last pid it has assigned.

It is accessible through the sysctl file /proc/sys/kernel/ns_last_pidand is basically an incrementing counter of process ids. It requires CONFIG_CHECKPOINT_RESTORE to be set and it is enabled by default in most of
Linux distributions.

So in order to fork a process with a desired pid, say 3214, CRIU does the following:

1. Opens /proc/sys/kernel/ns_last_pid
2. Locks file
3. Sets ns_last_pid's new value to pid-1 (3214 - 1)
4. Closes /proc/sys/kernel/ns_last_pid
5. Clones process so that the child process is supposed to have pid 3124
6. Calls getpid() inside the child process to validate desired pid
7. Unlocks file
8. Voilà

Very simple and yet cool!

Bear in mind that if any pid already exists when trying to restore, then the
restore will fail. The solution is to restore the process inside a different pid namespace (and mount namespace — see
https://criu.org/CR_in_namespace).

There are a few caveats to this approach, the main ones being it too slow due to multiple syscalls required for each such clone, and that it is open to race conditions.

As explained in
https://lisas.de/~adrian/criu-and-the-pid-dance-article.pdf:

It can always happen that between setting the desired PID via ns_last_pid and the actual clone() another process, independent of the restore, is created, which means that getpid() will not return the desired PID and CRIU will abort.

Enter clone3() and set_tid.

Using Kernel v5.3’s clone3() (note there’s no matching glibc wrapper yet), and set_tid array available from v5.5, we can now explicitly select specific process ids for a cloned process, in some or all of the PID namespaces where it is present, directly when we call clone3(). This essentially eliminates the race conditions and saves us from multiple syscalls requirement of the ns_last_pid method. It is currently supported by CRIU.

Now that all of the required processes were created, CRIU will carry on and restore resources such as opening file descriptors, preparing namespaces, opening anonymous shared mappings, opening file mappings, opening & pre-mapping
private memory areas, opening sockets, and more…

For the final step, CRIU will switch to the restorer context — cleaning up its own memory mappings — and restore all other resources left: threads, timers (so they will fire as late as possible), credentials & security settings (so they won’t limit us during the restore process), private memory areas re-mappings (using mremap), file mappings (using mmap), and anonymous shared mappings (also using mmap).

From now on, the process is restored and will continue to run from where it was originally checkpointed.

If you’d like to dive even deeper inside the internals, have a look at
https://criu.org/Category:Under_the_hood
:)

Live Migration

Live migration is the process of moving a running Virtual Machine or Application between two different nodes while keeping clients connected. Memory, relevant storage and network connectivity should all be transfered.

CRIU is a perfect match for these kind of tasks, and is actually used in production by some big companies. For instance, Google uses CRIU for live migrating containers inside its Borg clusters.

Lets use CRIU ourselves to demonstrate the migration of a simple loop script from our local machine to a virtual machine.

Our script looks like this (test.sh):

#!/bin/sh
while :; do
    sleep 1
    date
done

In order to checkpoint the script’s process we run:

sudo criu dump -t <pid> --images-dir ~/demo/images --shell-job && echo OK

--images-dir indicates where to dump the image files
--shell-job tells CRIU that our process was spawned from a shell

We then scp (i.e, transfer) both the script (so CRIU is able to restore its fd) and the
dumped images to our VM, and finally restore the process using:

 sudo criu restore -D <path-to-images-dir> -vvv --shell-job -d

-vvv for higher verbosity
--shell-job again to let CRIU known it was spawned from a shell
-d so that the restored process will run in background

And that’s it!

Some more cool CRIU tutorials can be found
here.

Originally posted on my personal blog: https://www.talhoffman.com/introduction-to-criu-and-live-migration