Introduction

In the last post, we looked at ways in which replicating data allowed readers to proceed in a lock-free manner. In this post, we'll extend this trick further, and show how to make readers wait-free.

Like last time, we'll make a number of simplifying assumptions throughout: counters never overflow, all variables are 0-initialized (unless otherwise specified), readers and writers don't throw exceptions, and writers are externally synchronized, so we may assume only one writer at a time. These limitations are trivial to eliminate. We'll also leave all atomic memory operations with the default sequentially consistent memory order. This is always correct, but can be inefficient. A warning: I've thought hard about but not tested the included code. As always, check and test before using.

The goal

To make it clear what we're after, we seek an implementation of the following interface:

template <typename T>
class ReaderWriterData {
 public:
  void read(std::function<void (const T&)> reader_fn);
  void write(std::function<void (T&)> writer_fn);
};

In which the read() method always completes in a finite number of steps, regardless of any writer activity. The writer might start modifying the protected data, get interrupted by the operating system and never rescheduled, and readers should still be able to complete their reads. This should be enough to convince us that maintaining multiple copies of the data structure is necessary: otherwise, readers would have no copy to read from while the writer is blocked (they can't read from the copy the writer is modifying without seeing inconsistent state). The writer will have to have some mechanism of directing readers to one copy or another of the protected data.

A solution

Inspiration

The interface above looks similar to that of data protected by a reader-writer lock, so let's start there. Below is a simple, reader-preference, busy-waiting reader-writer lock. The low-order bit is used to indicate the presence of a writer, and all other bits are a count of the number of readers blocking the writer from proceeding.

class ReaderWriterLock {
 public:
  void lock() {
    bool swap_succeeded;
    do {
      int expected = 0;
      swap_succeeded = lock_word_.compare_exchange_weak(expected, kWriterPresent);
    } while (swap_succeeded == false);
  }

  void unlock() {
    lock_word_.fetch_sub(kWriterPresent);
  }

  void lock_shared() {
    lock_word_.fetch_add(kReaderIncrement);
    while (lock_word_.load() & kWriterPresent) {
      // Busy loop
    }
  }

  void unlock_shared() {
    lock_word_.fetch_sub(kReaderIncrement);
  }

 private:
  static const int kWriterPresent = 1 << 0;
  static const int kReaderIncrement = 1 << 1;
  std::atomic<int> lock_word_;
};

We'll take two ideas from this lock:

Maintaining a count of active readers, which the writer waits to drop to zero.
Letting the writer advertise its presence to readers, who then know that reading is unsafe.

We'll combine them with our realizations from the last section: we need multiple copies of the data structure, which the writer will move readers between. Since we'll have multiple copies, we'll need to apply the writer's modifications to each of the copies.

An incorrect approach

Here is a first pass at a solution using reader counts:

template <typename T>
class ReaderWriterData {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    int version = version_for_readers_.load();
    reader_count_[version].fetch_add(1);
    reader_fn(data_[version]);
    reader_count_[version].fetch_sub(1);
  }

  void write(std::function<void (T&)> writer_fn) {
    for (int i = 0; i < 2; ++i) {
      // Move readers away from version i
      version_for_readers_.store(1 - i);
      // Wait for them to leave
      while (reader_count_[i].load() > 0) {
        // Busy wait
      }
      // Do the write
      writer_fn(data_[i]);
    }
  }

 private:
  std::atomic<int> version_for_readers_;
  std::atomic<int> reader_count_[2];
  T data_[2];
};

This approach doesn't work, because of a race between the reader and writer. To see how this race manifests, suppose there is only 1 reader. It reads version_for_readers_ to be 0. Then the writer appears, stores 1 into version_for_readers_, and waits for reader_count_[0] to become 0. Since the reader has not yet incremented reader_count_[0], the writer may exit its loop and begin its write to data_[0]. The reader then increments reader_count_[0] and proceeds to read from data_[0]. We now have a writer and a reader concurrently accessing data_[0], violating our interface.

Fixing the race

To address the problem from the last section, we need to be more precise about the meaning of an increment to reader_count_[i]. It is a mechanism for a reader to prevent the writer from beginning a new modification to data_[i]. This has two consequences:

If the reader increments reader_count_[i] and subsequently observes the writer to be absent from version i, then no write may proceed on version i until the reader undoes its increment; reading is safe between these points.
If the reader has incremented both counters, a writer is prevented from switching versions.

This suggests the following strategy for readers: read version_for_readers_, and block writers from beginning a new modification to the indicated version. Then, check to see if a writer may already be present in version (i.e. check to see if the race from the previous section occurred). If not; great: reading is safe on the version until the reader unblocks the writer from beginning a modification to it. If the writer may be present, then the reader should try reading from the other version; it blocks updates to that one (so at this point, the writer is prevented from starting a modification to either version). Then, the reader can recheck for writer presence, safe in the knowledge that whichever version is safe to read from will remain safe until it unblocks the writer from it. Whichever version the reader is not reading from, it can then unblock the writer from.

Here's how this looks in code:

template <typename T>
class ReaderWriterData {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    // We'll fill this in later.
    int version_to_read = -1;

    int version_1 = version_for_readers_.load();
    reader_count_[version_1].fetch_add(1);
    int version_2 = version_for_readers_.load();
    if (version_1 == version_2) {
      // We stopped new modifications to version version_1, and subsequently
      // observed that the writer wants us to use version version_1,
      // indicating that no write is happening there. So, no write is
      // happening to version version_1, and no write will start there until
      // we decrement reader_count_[version_1].
      version_to_read = version_1;
    } else {
      // The version changed out from under us; a writer might have already
      // begun modifying version version_1 before we blocked modifications
      // to it. Block modifications to version_2, and check if it's safe.
      reader_count_[version_2].fetch_add(1);
      // At this point, no new modifications may begin on either version.
      int version_3 = version_for_readers_.load();
      // Version version_3 must be safe; the writer told us to use it, so
      // the wasn't modifying it at the time of the previous statement.
      // Moreover, modifications to either version aren't allowed to be
      // started. So no modification started on version version_3 between
      // the previous statement (when reading was safe), and will not start
      // until we unblock writes to version version_3. We can read it
      // safely.
      version_to_read = version_3;
      // Unblock the *other* version
      reader_count_[1 - version_3].fetch_sub(1);
    }

    // We've ensured reading is safe at this point.
    reader_fn(data_[version_to_read]);

    // Unblock the writer from the version we read.
    reader_count_[version_to_read].fetch_sub(1);
  }

  void write(std::function<void (T&)> writer_fn) {
    for (int i = 0; i < 2; ++i) {
      version_for_readers_.store(1 - i);
      while (reader_count_[i].load() > 0) {
        // Busy wait
      }
      writer_fn(data_[i]);
    }
  }

 private:
  std::atomic<int> version_for_readers_;
  std::atomic<int> reader_count_[2];
  T data_[2];
};

Exercise

Why shouldn't a reader always block writes to both versions before checking version_for_readers_, and then unblock the one not in use?

Exercise

Give a similar strategy that detects writer presence by using the lock_word_ representation of the reader-writer lock above, and eliminating version_for_readers_.

Exercise

Give a similar strategy that allows readers to block writers by writing to a per-reader data structure, which the writer may observe. Note: as a consequence, we can see that no atomic compare-and-swap or fetch-and-add or other RMW primitives are necessary to implement the interface and progress guarantees we sought. This is a rather interesting fact in and of itself.

Increasing writer-performance

Lazy updates

The careful reader might have noticed that one round of busy looping in the write method above could have been avoided: since the writer waits for all the readers to leave the second instance before modifying it, on the next call to write, there are no readers reading the second instance. So, if the writer changed its iteration order to always start on version 1 - version_for_readers_.load(), the busy wait loop will complete on the first iteration.

We can go a step further, however. Once we've performed the write on one instance of the data structure, and indicated to readers that they should begin using it, there's no need for the writer to wait for the readers to leave the other copy of the data structure, since no new readers are going to use it. Instead, we can store the std::function that perform the modification of the data structure, and leave the work of actually performing the modification for the next writer (note then that the passed in function can't e.g. hold references to data on the writing thread's stack unless it is sure it will outlive the ReaderWriterData). The next time a write occurs, the writer will apply both modifications to the old copy of the data structure, which hopefully all the readers will have already left. If the time between write calls is longer than the time it takes to perform a read, then the writer will never have to wait for readers to leave a version of the data structure.

Increased data replication

Even using lazy updates, it is still possible for a writer to have to wait for readers to leave. This can be particularly problematic if, for instance, the number of readers is higher than the number of CPUs allocated to the process. In that case, there will always be some reader which is descheduled, probably for a time on the order of milliseconds. Since operations on most data structures are much faster (on the order of nano- or micro- seconds), this can lead to significant writer slowdowns and busy waiting.

To avoid this, we can increase the number of copies of the data structure we keep. If slow readers are stuck on version 0 of the data structure and version 1 is the current one, then a writer may proceed on version 2 without interfering with readers or writers. This doubles the amount of time we allow for read operations to take before they begin to block writers. This parameter is tunable; we can decrease the odds of the writer blocking, at the cost of increasing storage consumption and the number of total data structure modifications per logical write.

An implementation

The combination of the previous two performance improvements causes some amount of subtlety; when we have more than two copies of the data structure, which one do we update in response to a new write?

It's clear that we can't modify the most recently updated version (that's the one new readers will read from). We could proceed in a round-robin manner, but that means that the writer might busy-loop waiting for readers to leave one version, even if other versions are empty of readers. We will adopt a middle-ground: picking the oldest version of the data structure that does not have any readers reading from it. This will require us to explicitly keep track of the number of modifications each version of the data structure has undergone, and to store as many of the std::functions passed in to write as is necessary to bring the oldest version up to date.

Similarly, with more than 2 copies of the data structure, we can't infer the version being written using only the version readers should read. Put another way, version_for_readers_ in the previous section really served two purposes:

Letting readers know where to read from
Warning readers where a write might be happening.

In the 2-version case, the version being written was trivially determinable from the version readers were reading, so we could handle this with only one variable. In the multiply-replicated case, we'll have to split this variable in two: current_version_ will be the most up-to-date version of the data structure, and tells readers where they should read from, while version_being_written_ will indicate where a modification might be underway. version_being_written_ functions as sort of an inversion of the "hazard-pointer" technique, if you're familiar with such strategies; it's a way for writers to indicate dangers to readers.

Here is a version of ReaderWriterData with the optimizations described:

template <typename T, int num_versions>
class ReaderWriterData {
 public:
  void read(std::function<void (const T&)> reader_fn) {
    int version_to_read = -1;

    int version_1 = current_version_.load();
    reader_count_[version_1].fetch_add(1);
    int written_1 = version_being_written_.load();
    if (version_1 != written_1) {
      // Great; we blocked new modifications to version version_1, and then
      // observed that no in-flight modifications were already happening. We
      // can proceed on it.
      version_to_read = version_1;
    } else {
      // The writer moved on, and started a modification on version
      // version_1. That means current_version_ must have changed. Try again
      // on that.
      int version_2 = current_version_.load();
      reader_count_[version_2].fetch_add(1);
      // Now, new modifications are blocked to versions version_1 and
      // version_2. At least one of them is not being modified, and so is
      // safe to read from.
      int written_2 = version_being_written_.load();
      if (version_2 == written_2) {
        // Have to use version version_1.
        reader_count_[version_2].fetch_sub(1);
        version_to_read = version_1;
      } else {
        // Can use version version_2.
        reader_count_[version_1].fetch_sub(1);
        version_to_read = version_2;
      }
    }

    reader_fn(data_[version_to_read]);

    reader_count_[version_to_read].fetch_sub(1);
  }

  void write(std::function<void (T&)> writer_fn) {
    in_flight_modifications_.push_back(writer_fn);

    int version_to_write = -1;
    do {
      version_to_write = oldest_version_with_no_readers();
      version_being_written_.store(version_to_write);
    } while (reader_count_[version_to_write].load() > 0);

    // When we exit the loop, we'll have warned any possible future readers
    // that they should avoid version version_to_write, and then observed
    // that no readers are present in that version. Therefore, any future
    // readers who try to enter it will observe our write to
    // version_being_written_.

    // Now, apply all necessary pending modifications to version
    // version_to_write.
    int index_of_first_unapplied_modification =
        modification_count_[version_to_write] - modification_num_of_front_;
    for (int i = index_of_first_unapplied_modification;
         i < in_flight_modifications_.size();
         ++i) {
      in_flight_modifications_[i](data_[version_to_write]);
      ++modification_count_[version_to_write];
    }

    // Update readers to use the new version.
    version_being_written_.store(-1);
    current_version_.store(version_to_write);

    // Clear out any modifications that have been applied to every version.
    int min_modification_count = std::min_element(modification_count_,
                                                  modification_count_
                                                      + num_versions);
    while (modification_num_of_front_ < min_modification_count) {
      in_flight_modifications_.pop_front();
      ++modification_num_of_front_;
    }
  }

 private:
  std::array<int, num_versions> versions_ordered_by_age() {
    std::array<int, num_versions> result;
    for (int i = 0; i < num_versions; ++i) {
      result[i] = i;
    }
    std::sort(result.begin(), result.end(); [&](int i1, int i2) {
      return modification_count[i1] < modification_count_[i2];
    });
    return result;
  }

  int oldest_version_with_no_readers() {
    std::array<int, num_versions> versions = versions_ordered_by_age();
    while (true) {
      for (int i = 0; i < num_versions; ++i) {
        if (versions[i] == current_version_.load()) {
          // Don't want try to update the current version!
          continue;
        }
        if (reader_count_[versions[i]].load() == 0) {
          return versions[i];
        }
      }
    }
  }

  // The version readers should try to read from
  std::atomic<int> current_version_;

  // The version that a writer is currently modifying.
  // For simplicity, assume this is initially -1.
  std::atomic<int> version_being_written_;

  // All modifications that some version has not yet had applied to it
  std::deque<std::function<void (T&)>> in_flight_modifications_;

  // The modification number of the front of in_flight_modifications_
  int modification_num_of_front_;

  // The modification counts of each version
  int modification_count_[num_versions];

  // The number of readers reading each version
  std::atomic<int> reader_count_[num_versions];

  // The versions themselves.
  T data_[num_versions];
};

Exercise

In the last section, there was an exercise to use a per-reader data structure instead of atomic fetch-and-adds. Generalize the solution to use the optimizations described, using no more than O(num_readers * num_versions) atomic integer variables in the helper data structures.

Exercise

Reduce the space usage from the previous exercise to be only O(num_readers + num_versions).

Going to the limit - getting a wait-free writer

With the optimizations above, we can reduce wait times for writers significantly. This leads us to wonder -- can we reduce them all the way to 0, and have a writer that can't be blocked by readers? Perhaps surprisingly, the answer is yes, so long as we can bound the number of readers.

Consider what happens when we set num_versions to 2 * num_readers + 2. The writer is wait free so long as there is some unblocked version other than the current version. Each reader blocks at most 2 versions at a time, so there are at most 2 * num_readers versions blocked at a time. There are therefore at most 2 * num_readers + 1 versions the writer cannot proceed on, and so there is at least one version that the writer can proceed on. All loops in the writer will therefore terminate in at most 1 iteration, and the writer is wait free.

Exercise

Come up with a scheme that requires only num_readers + 2 copies of a data structure to provide wait-freedom.

Hint: extend the strategy, developed in the previous exercises, of per-reader state. Introduce a reader state corresponding to "about to perform a read". Readers set their state to "about to perform a read", and then compare and swap their state to "reading on version X". Writers try to compare and swap all reader states from "about to perform a read" to "reading on version Y" before writing to a version other than Y.

Possible extensions

There are a number of extensions that could be applied to the strategies discussed in this post. Two that stand out as being of particular practical performance are adding a try-write method, and adopting these techniques to use futexes or a similar technique.

Try-write is straightforward: an attempt to write is made, but the writer does not proceed if it might block. This is a straightforward: instead of looping in the write method, the writer only tries to write once. If it fails, it returns a status indicating that the write did not occur.

A futex is a scheduling primitive used to efficiently implement mixed user-level/kernel-level locking on Linux. Fast-path, contention-free lock acquisitions happen entirely in user-space, only entering the kernel in the case where a lock is contended, and then blocking, allowing other threads to use the CPU time that would otherwise be spent spinning. This mechanism could be used to avoid the spinning required by write. The readers would keep a count of the number of blocked versions of the data structure, waking up the writer when the count becomes nonzero.

Notes

The reader-preference reader-writer lock is a well-known algorithm due to Mellor-Crummey and Scott.

Many of the ideas here a well-known in the garbage collection and database literature; in particular:

Maintaining multiple copies of a data structure to allow readers to avoid waiting for writers
Maintaining versions with those copies so that writers can proceed while readers are reading from an old version
Keeping a log of modifications to the data, which gets applied in batch.

Instead of detailing all of these, I'll note only some of the literature that focuses on strategies for in-memory data structures on shared-memory multiprocessors with simple version reclamation semantics (i.e. which do not require the use of garbage collection).

The first algorithm that allows for wait-free readers in the presence of a writer goes back to Peterson, whose scheme is also wait-free for the writer. It comes at the cost of maintaining num_readers + 2 copies of the protected data, which he shows to be both necessary and sufficient. His scheme is only applicable to arrays of data which allow concurrent reads and writes (a la std::atomic<char>[N]), and requires the reader to execute up to 3 physical reads per logical read.

Chen and Burns show how to use a compare-and-swap primitive to implement a scheme that requires only 1 read of the data structure per logical read operation, and does not require concurrent access to a data structure by both a reader and a writer. writer to write to a copy of the data structure concurrently with a reader reading from it. It also uses the theoretical minimum of num_readers + 2 copies of the data.

Ramalhete and Correia show how to reduce the number of copies of the data to only 2, at the cost of no longer being wait-free for the writer. They provide several variants of their technique, which they call the "left-right" algorithm. The two-copy strategy presented here is a variant of this algorithm.

The fetch-and-add based concurrency-control strategy used here is, as far as I know, novel, but it is reminiscent of the "Epoch-based reclamation" strategy used by Fraser to provide exclusion between readers and deleters of nodes in lock-free data structures. Similarly, I believe the lazy-update scheme and the generalization to multiply-replicated data are new, though lazy-updates bears similarity to Shalev and Shavit's Predictive Log-Synchronization technique. In the same paper, they present a strategy that allows concurrent reads by switching writers between two copies of a data structure. Their solution, however, is not wait-free for readers.

Futexes were introduced to the linux kernel by Franke, Russell, and Kirkwood, though they bear similarities to BeOS "benaphores".

David Goldblatt's website

Going from lock-free to wait-free

Introduction

The goal

A solution

Inspiration

An incorrect approach

Fixing the race

Exercise

Exercise

Exercise

Increasing writer-performance

Lazy updates

Increased data replication

An implementation

Exercise

Exercise

Going to the limit - getting a wait-free writer

Exercise

Possible extensions

Notes

Introduction

The goal

A solution

Inspiration

An incorrect approach

Fixing the race

Exercise

Exercise

Exercise

Increasing writer-performance

Lazy updates

Increased data replication

An implementation

Exercise

Exercise

Going to the limit - getting a wait-free writer

Exercise

Possible extensions

Notes

social