Lock-free reads through data replication

Introduction

Seqlocks and reader-writer locks are two techniques used to control access to shared, read-mostly data in multithreaded programs. Briefly, seqlocks work by counting modifications to a data structure so that readers can see if a modification occurred during their read (so that they know their read was hazardous, and they should retry). Reader-writer locks work by allowing exclusive access to a data structure to either a single writer, or many readers.

These approaches share a common problem: a writer that stalls (due to a page fault, descheduling, or a long-running computation, for instance) inside a critical section can block the readers indefinitely. We'd like to eliminate this restriction; to allow readers to continue even while a writer is blocked, even if that writer is blocked inside a critical section. That is to say, readers should be lock-free with respect to writers. We seek a solution that allows readers to continue even if a writer is blocked inside a critical section.

To accomplish this, we'll keep two copies of our data structure. The writer needs to update both copies of the data, one after the other, in order to logically complete a write. If the writer blocks, it will be blocked inside an update to only one of the copies. The readers can then proceed by reading from the other copy, giving us the lock-freedom property we sought.

For simplicity, we'll consider only the case of a single writer thread (there can be many reader threads). This limitation can be easily removed by adding an additional mutex to any data structures we consider, which a writer thread must acquire before and release after invoking the functions we describe. I've thought hard about the code here but haven't tested it, so be cautious before using it.

Background

Reader-writer locks

Reader-writer locks provide safe access to shared data by changing the control flow of threads attempting to access the data. When a writer owns the lock, it owns it uniquely, and readers block if they attempt to access it. Alternatively, some number of readers may share ownership of the lock; they are then allowed to read but not modify the data structure, and any writers block while readers are present.

Reader-writer locks are standardized in C++14 via the std::shared_timed_mutex class. Below is a class that encapsulates the protected data and the protection mechanism. Readers and the writer pass in a callback function that performs the desired operations. For simplicity, we'll ignore things like exception safety.

template <typename T>
class ReaderWriterLock {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    mu_.lock_shared();
    reader_fn(data_);
    mu_.unlock_shared();
  }

  void write(std::function<void (T&)> writer_fn) {
    mu_.lock();
    writer_fn(data_);
    mu_.unlock();
  }

 private:
  std::shared_timed_mutex mu_;
  T data_;
};

Seqlocks

RW locks often perform poorly in practice. Readers must update lock metadata in order to indicate their presence to a potential future writer. In typical implementations, this involves obtaining exclusive write access to the cacheline containing the lock, which leads to a cache miss for subsequent readers. Even with fast modern memory systems, cache misses can take longer than a short critical section, and, what's worse, can be sequentialized in the memory system. Updates to the lock itself then become a bottleneck. To avoid this, we'd like a concurrency primitive in which readers can be truly read-only. The seqlock is one such primitive.

Seqlocks (short for sequence locks, for reasons which will soon become clear) allow readers to access data concurrently with writers. Since readers proceed without blocking writers, the readers might see an inconsistent state for the protected data (imagine a reader that reads half the protected data before the writer modifies it, then half after). The seqlock provides a mechanism to detect such a conflict, indicating that the reader should retry.

Associated with a seqlock is a sequence number. The sequence number starts at 0, and is incremented before and after each write. This implies that at any time between writes (and therefore, when the data is in a consistent state), the sequence number is even, since the number of iniitial increments will equal the number of concluding ones. Readers repeatedly read the sequence number, then the data structure, then the sequence number again. If the sequence numbers are even (indicating that they weren't read during a writer critical section) and equal (indicating that no writer critical section started and finished in between the two reads), then the reader can infer that the writer did not interfere with its read. Otherwise, the reader should retry.

The code below encapsulates some seqlock-protected data. Because reads and writes may occur on the object simultaneously, we have to represent the stored data using std::atomic variables; in turn, this requires that we can treat the protected data as trivially copyable. For simplicity, we won't enforce this. Similarly, we'll ignore overflow of sequence_number_, use the default memory order for all operations (giving sequential consistency), and ignore the inevitable reads from an uninitialized data_ array.

template <typename T>
class SeqLocked {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    T value;
    int seq0;
    int seq1;
    do {
      seq0 = sequence_number_.load();
      raw_read_into(value);
      seq1 = sequence_number_.load();
    } while (seq0 != seq1 || seq0 % 2 == 1);

    reader_fn(value);
  }

  void write(std::function<void (T&)> writer_fn) {
    // Remember: we are assuming there is only one writer;
    // this read is safe no matter what.
    T value;
    raw_read_into(value);
    writer_fn(value);
    sequence_number_.fetch_add(1);
    raw_write_from(value);
    sequence_number_.fetch_add(1);
  }

 private:
  void raw_read_into(T& output) {
    for (int i = 0; i < sizeof(T); ++i) {
      *((char*)&output + i) = data_[i].load();
    }
  }

  void raw_write_from(const T& input) {
    for (int i = 0; i < sizeof(T); ++i) {
      data_[i].store(*((char*)&input + i));
    }
  }

  std::atomic<int> sequence_number_;
  std::atomic<char> data_[sizeof(T)];
};

The problem: blocked writers

Suppose the writer is modifying some of the data protected by a seqlock. It increments the sequence number, begins writing to the data, but is preempted by the operating system before it completes. Then, all readers are blocked until the writer is rescheduled and completes the write. The readers, therefore, are not lock-free with respect to the writer. An analogous situation can occur using a reader-writer lock.

A solution: replicate the data

The trick we'll use to solve this problem is old, but not widely known. We'll keep two copies of the data being protected. If a writer is blocked midway through a write to one copy of the data, the reader can still continue its read through the other copy.

More specifically, a reader will attempt to complete its read on one copy of the data. If it succeeds, great; the reader is done. If it fails, it will try to read from the other copy of the data. This process repeats until a read attempt succeeds. The writer updates one copy of the data structure, and then the other.

Here is how this technique looks for the reader-writer lock (again, we'll ignore trivialities like counter overflow and exceptions):

template <typename T>
class ReplicatedReaderWriterLock {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    int version_to_use = 0;
    for (;; version_to_use++) {
      if (mu_[version_to_use % 2].try_lock_shared()) {
        break;
      }
    }
    reader_fn(data_[version_to_use % 2]);
    mu_[version_to_use % 2].unlock_shared();
  }

  void write(std::function<void (T&)> writer_fn) {
    for (int i = 0; i < 2; ++i) {
      mu_[i].lock();
      writer_fn(data_[i]);
      mu_[i].unlock();
    }
  }

 private:
  std::shared_timed_mutex mu_[2];
  T data_[2];
};

Note: the reader-writer variant we present is not technically lock-free in our setting; a try_lock_shared call may fail spuriously and return false even if the lock is available in C++14. This is a technical detail and can be remedied, but we won't do so here; such a problem showing up in practice is vanishingly unlikely.

Decreasing reader failure rates through further replication

The technique in the previous section is enough to provide a form of lock-freedom, but readers may still be starved for access if writers continue much more quickly than them (a writer modifies both copies of a seqlock'd data instance before a reader is able to read one, for instance). Here, we'll consider an extension that gives us reduced chances of writer interference, at the cost of additional copies of a data structure.

The key idea is to increase the amount of replication beyond two copies of a data object. In order for the writer to block readers, it must update all the objects before the reader reads any of them. Increasing the amount of replication increases the likelihood of readers succeeding.

Here's how this looks for the seqlock:

template <typename T, int num_replicas>
struct ReplicatedSeqLocked {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    T value;
    int seq0;
    int seq1;
    do {
      int version = current_version_.load();
      seq0 = data_[version].sequence_number_.load();
      data_[version].raw_read_into(value);
      seq1 = data_[version].sequence_number_.load();
    } while (seq0 != seq1 || seq0 % 2 == 1);

    reader_fn(value);
  }

  void write(std::function<void (T&)> writer_fn) {
    int version = current_version_.load();
    int new_version = (version + 1) % num_replicas;
    T value;
    data_[version].raw_read_into(value);
    writer_fn(value);
    data_[new_version].sequence_number_.fetch_add(1);
    data_[new_version].raw_write_from(value);
    data_[new_version].sequence_number_.fetch_add(1);
    current_version_.store(new_version);
  }

 private:
  struct SeqLockData {
    void raw_read_into(T& output) {
      for (int i = 0; i < sizeof(T); ++i) {
        *((char*)&output + i) = data_[i].load();
      }
    }

    void raw_write_from(const T& input) {
      for (int i = 0; i < sizeof(T); ++i) {
        data_[i].store(*((char*)&input + i));
      }
    }

    std::atomic<int> sequence_number_;
    std::atomic<char> data_[sizeof(T)];
  };

  SeqLockData data_[num_replicas];
  std::atomic<int> current_version_;
};

Some concluding notes on our implementation

There are a few differences between the replicated seqlock and replicated reader-writer lock.

  • The seqlock version updates the entire data structure on every modification. The reader-writer one can make small in-place updates. This is a consequence of the API we chose for our implementations, so it could potentially be eliminated, but this would complicate the writer interface (the readers and writers would have to deal with atomic-wrapped data directly, and ask the protecting data structure whether its reads were safe).

  • The converse to this is that the replicated seqlock need only modify one version of the data, whereas a replicated reader-writer lock must modify all copies of the data.

  • An N-replicated reader-writer lock enables another optimization: a thread can hash its thread-id or CPU number to get an index to read from: thread_id % N. This spreads around the contention on lock-metadata, reducing the performance overhead of the lock. In the limit, we have a data instance per CPU, and can avoid all non-local writes in the reader path.

Additionally, note that we could have used a single sequence counter in the ReplicatedSeqLock, shared between all data instances. Readers would check it to see that the range of modified instances does not include the version that it read from. This results in a modest space savings, at the cost of losing some simplicity of implementation.

Notes

Reader-writer locks were introduced by Courtois et al., and are a well-known synchronization tool.

Seqlocks were popularized by their introduction into the Linux kernel by Stephen Hemminger (building on the work of Andrea Arcangeli), but were introduced by Leslie Lamport, or arguably even earlier, in the work of William B. Easton. This fact is not widely known.

The idea of using multiple copies of a data structure to improve the blocking properties of readers goes back to Peterson, who introduced a mechanism which is wait-free for the writer as well as readers. These ideas were explored most fully by Chen and Burns. Larsson et al. provide a comprehensive survey. Lamport uses a clever variant of a similar idea as well.

The specific cases of replicating data for reader-writer locks and seqlocks is a well-known trick, but I can't find a cite for the first use. It is used, for instance, in the Linux kernel, and explored by Ramalhete and Correia in what they call "double-instance locking".

Using more than two copies of a data structure to improve reader seqlock throughput when writers are common is a trick from Dmitry Vyukov.

social