Introduction
In the last post, we
looked at ways in which replicating data allowed readers to proceed in a
lock-free manner. In this post, we'll extend this trick further, and show how to
make readers wait-free.
Like last time, we'll make a number of simplifying assumptions throughout:
counters never overflow, all variables are 0-initialized (unless otherwise
specified), readers and writers don't throw exceptions, and writers are
externally synchronized, so we may assume only one writer at a time. These
limitations are trivial to eliminate. We'll also leave all atomic memory
operations with the default sequentially consistent memory order. This is always
correct, but can be inefficient. A warning: I've thought hard about but not tested the
included code. As always, check and test before using.
The goal
To make it clear what we're after, we seek an implementation of the following
interface:
template <typename T>
class ReaderWriterData {
public:
void read(std::function<void (const T&)> reader_fn);
void write(std::function<void (T&)> writer_fn);
};
In which the read()
method always completes in a finite number of steps,
regardless of any writer activity. The writer might start modifying the
protected data, get interrupted by the operating system and never rescheduled,
and readers should still be able to complete their reads. This should be
enough to convince us that maintaining multiple copies of the data structure is
necessary: otherwise, readers would have no copy to read from while the writer
is blocked (they can't read from the copy the writer is modifying without seeing
inconsistent state). The writer will have to have some mechanism of directing
readers to one copy or another of the protected data.
A solution
Inspiration
The interface above looks similar to that of data protected by a reader-writer
lock, so let's start there. Below is a simple, reader-preference, busy-waiting
reader-writer lock. The low-order bit is used to indicate the presence of a
writer, and all other bits are a count of the number of readers blocking the
writer from proceeding.
class ReaderWriterLock {
public:
void lock() {
bool swap_succeeded;
do {
int expected = 0;
swap_succeeded = lock_word_.compare_exchange_weak(expected, kWriterPresent);
} while (swap_succeeded == false);
}
void unlock() {
lock_word_.fetch_sub(kWriterPresent);
}
void lock_shared() {
lock_word_.fetch_add(kReaderIncrement);
while (lock_word_.load() & kWriterPresent) {
// Busy loop
}
}
void unlock_shared() {
lock_word_.fetch_sub(kReaderIncrement);
}
private:
static const int kWriterPresent = 1 << 0;
static const int kReaderIncrement = 1 << 1;
std::atomic<int> lock_word_;
};
We'll take two ideas from this lock:
-
Maintaining a count of active readers, which the writer waits to drop to
zero.
-
Letting the writer advertise its presence to readers, who then know that
reading is unsafe.
We'll combine them with our realizations from the last section: we need multiple
copies of the data structure, which the writer will move readers between. Since
we'll have multiple copies, we'll need to apply the writer's modifications to
each of the copies.
An incorrect approach
Here is a first pass at a solution using reader counts:
template <typename T>
class ReaderWriterData {
public:
void read(std::function<void (const T&)> reader_fn) {
int version = version_for_readers_.load();
reader_count_[version].fetch_add(1);
reader_fn(data_[version]);
reader_count_[version].fetch_sub(1);
}
void write(std::function<void (T&)> writer_fn) {
for (int i = 0; i < 2; ++i) {
// Move readers away from version i
version_for_readers_.store(1 - i);
// Wait for them to leave
while (reader_count_[i].load() > 0) {
// Busy wait
}
// Do the write
writer_fn(data_[i]);
}
}
private:
std::atomic<int> version_for_readers_;
std::atomic<int> reader_count_[2];
T data_[2];
};
This approach doesn't work, because of a race between the reader and writer. To
see how this race manifests, suppose there is only 1 reader. It reads
version_for_readers_
to be 0. Then the writer appears, stores 1 into
version_for_readers_
, and waits for reader_count_[0]
to become 0. Since
the reader has not yet incremented reader_count_[0]
, the writer may exit its
loop and begin its write to data_[0]
. The reader then increments
reader_count_[0]
and proceeds to read from data_[0]
. We now have a writer
and a reader concurrently accessing data_[0]
, violating our interface.
Fixing the race
To address the problem from the last section, we need to be more precise about
the meaning of an increment to reader_count_[i]
. It is a mechanism for a
reader to prevent the writer from beginning a new modification to data_[i]
.
This has two consequences:
-
If the reader increments reader_count_[i]
and subsequently observes the
writer to be absent from version i
, then no write may proceed on version
i
until the reader undoes its increment; reading is safe between these
points.
-
If the reader has incremented both counters, a writer is prevented from
switching versions.
This suggests the following strategy for readers: read version_for_readers_
,
and block writers from beginning a new modification to the indicated version.
Then, check to see if a writer may already be present in version (i.e. check to
see if the race from the previous section occurred). If not; great:
reading is safe on the version until the reader unblocks the writer from
beginning a modification to it. If the writer may be present, then the reader
should try reading from the other version; it blocks updates to that one (so at
this point, the writer is prevented from starting a modification to either
version). Then, the reader can recheck for writer presence, safe in the
knowledge that whichever version is safe to read from will remain safe until it
unblocks the writer from it. Whichever version the reader is not reading from,
it can then unblock the writer from.
Here's how this looks in code:
template <typename T>
class ReaderWriterData {
public:
void read(std::function<void (const T&)> reader_fn) {
// We'll fill this in later.
int version_to_read = -1;
int version_1 = version_for_readers_.load();
reader_count_[version_1].fetch_add(1);
int version_2 = version_for_readers_.load();
if (version_1 == version_2) {
// We stopped new modifications to version version_1, and subsequently
// observed that the writer wants us to use version version_1,
// indicating that no write is happening there. So, no write is
// happening to version version_1, and no write will start there until
// we decrement reader_count_[version_1].
version_to_read = version_1;
} else {
// The version changed out from under us; a writer might have already
// begun modifying version version_1 before we blocked modifications
// to it. Block modifications to version_2, and check if it's safe.
reader_count_[version_2].fetch_add(1);
// At this point, no new modifications may begin on either version.
int version_3 = version_for_readers_.load();
// Version version_3 must be safe; the writer told us to use it, so
// the wasn't modifying it at the time of the previous statement.
// Moreover, modifications to either version aren't allowed to be
// started. So no modification started on version version_3 between
// the previous statement (when reading was safe), and will not start
// until we unblock writes to version version_3. We can read it
// safely.
version_to_read = version_3;
// Unblock the *other* version
reader_count_[1 - version_3].fetch_sub(1);
}
// We've ensured reading is safe at this point.
reader_fn(data_[version_to_read]);
// Unblock the writer from the version we read.
reader_count_[version_to_read].fetch_sub(1);
}
void write(std::function<void (T&)> writer_fn) {
for (int i = 0; i < 2; ++i) {
version_for_readers_.store(1 - i);
while (reader_count_[i].load() > 0) {
// Busy wait
}
writer_fn(data_[i]);
}
}
private:
std::atomic<int> version_for_readers_;
std::atomic<int> reader_count_[2];
T data_[2];
};
Exercise
Why shouldn't a reader always block writes to both versions before
checking version_for_readers_
, and then unblock the one not in use?
Exercise
Give a similar strategy that detects writer presence by using the
lock_word_
representation of the reader-writer lock above, and eliminating
version_for_readers_
.
Exercise
Give a similar strategy that allows readers to block writers by
writing to a per-reader data structure, which the writer may observe. Note: as a
consequence, we can see that no atomic compare-and-swap or fetch-and-add or
other RMW primitives are necessary to implement the interface and progress
guarantees we sought. This is a rather interesting fact in and of itself.
Increasing writer-performance
Lazy updates
The careful reader might have noticed that one round of busy looping in the
write
method above could have been avoided: since the writer waits for all the
readers to leave the second instance before modifying it, on the next call
to write
, there are no readers reading the second instance. So, if the writer
changed its iteration order to always start on version
1 - version_for_readers_.load()
, the busy wait loop will complete on the first
iteration.
We can go a step further, however. Once we've performed the write on one
instance of the data structure, and indicated to readers that they should begin
using it, there's no need for the writer to wait for the readers to leave the
other copy of the data structure, since no new readers are going to use it.
Instead, we can store the std::function
that perform the modification of the
data structure, and leave the work of actually performing the modification for
the next writer (note then that the passed in function can't e.g. hold
references to data on the writing thread's stack unless it is sure it will
outlive the ReaderWriterData
). The next time a write occurs, the writer will
apply both modifications to the old copy of the data structure, which
hopefully all the readers will have already left. If the time between write
calls is longer than the time it takes to perform a read, then the writer will
never have to wait for readers to leave a version of the data structure.
Increased data replication
Even using lazy updates, it is still possible for a writer to have to wait for
readers to leave. This can be particularly problematic if, for instance, the
number of readers is higher than the number of CPUs allocated to the process. In
that case, there will always be some reader which is descheduled, probably for a
time on the order of milliseconds. Since operations on most data structures are
much faster (on the order of nano- or micro- seconds), this can lead to
significant writer slowdowns and busy waiting.
To avoid this, we can increase the number of copies of the data structure we
keep. If slow readers are stuck on version 0 of the data structure and version 1
is the current one, then a writer may proceed on version 2 without interfering
with readers or writers. This doubles the amount of time we allow for read
operations to take before they begin to block writers. This parameter is
tunable; we can decrease the odds of the writer blocking, at the cost of
increasing storage consumption and the number of total data structure
modifications per logical write.
An implementation
The combination of the previous two performance improvements causes some amount
of subtlety; when we have more than two copies of the data structure, which one
do we update in response to a new write?
It's clear that we can't modify the most recently updated version (that's the
one new readers will read from). We could proceed in a round-robin manner, but
that means that the writer might busy-loop waiting for readers to leave one
version, even if other versions are empty of readers. We will adopt a
middle-ground: picking the oldest version of the data structure that does not
have any readers reading from it. This will require us to explicitly keep track
of the number of modifications each version of the data structure has undergone,
and to store as many of the std::function
s passed in to write
as is necessary
to bring the oldest version up to date.
Similarly, with more than 2 copies of the data structure, we can't infer the
version being written using only the version readers should read. Put another
way, version_for_readers_
in the previous section really served two purposes:
In the 2-version case, the version being written was trivially determinable from
the version readers were reading, so we could handle this with only one
variable. In the multiply-replicated case, we'll have to split this variable in
two: current_version_
will be the most up-to-date version of the data
structure, and tells readers where they should read from, while
version_being_written_
will indicate where a modification might be underway.
version_being_written_
functions as sort of an inversion of the
"hazard-pointer" technique, if you're familiar with such strategies; it's a way
for writers to indicate dangers to readers.
Here is a version of ReaderWriterData
with the optimizations described:
template <typename T, int num_versions>
class ReaderWriterData {
public:
void read(std::function<void (const T&)> reader_fn) {
int version_to_read = -1;
int version_1 = current_version_.load();
reader_count_[version_1].fetch_add(1);
int written_1 = version_being_written_.load();
if (version_1 != written_1) {
// Great; we blocked new modifications to version version_1, and then
// observed that no in-flight modifications were already happening. We
// can proceed on it.
version_to_read = version_1;
} else {
// The writer moved on, and started a modification on version
// version_1. That means current_version_ must have changed. Try again
// on that.
int version_2 = current_version_.load();
reader_count_[version_2].fetch_add(1);
// Now, new modifications are blocked to versions version_1 and
// version_2. At least one of them is not being modified, and so is
// safe to read from.
int written_2 = version_being_written_.load();
if (version_2 == written_2) {
// Have to use version version_1.
reader_count_[version_2].fetch_sub(1);
version_to_read = version_1;
} else {
// Can use version version_2.
reader_count_[version_1].fetch_sub(1);
version_to_read = version_2;
}
}
reader_fn(data_[version_to_read]);
reader_count_[version_to_read].fetch_sub(1);
}
void write(std::function<void (T&)> writer_fn) {
in_flight_modifications_.push_back(writer_fn);
int version_to_write = -1;
do {
version_to_write = oldest_version_with_no_readers();
version_being_written_.store(version_to_write);
} while (reader_count_[version_to_write].load() > 0);
// When we exit the loop, we'll have warned any possible future readers
// that they should avoid version version_to_write, and then observed
// that no readers are present in that version. Therefore, any future
// readers who try to enter it will observe our write to
// version_being_written_.
// Now, apply all necessary pending modifications to version
// version_to_write.
int index_of_first_unapplied_modification =
modification_count_[version_to_write] - modification_num_of_front_;
for (int i = index_of_first_unapplied_modification;
i < in_flight_modifications_.size();
++i) {
in_flight_modifications_[i](data_[version_to_write]);
++modification_count_[version_to_write];
}
// Update readers to use the new version.
version_being_written_.store(-1);
current_version_.store(version_to_write);
// Clear out any modifications that have been applied to every version.
int min_modification_count = std::min_element(modification_count_,
modification_count_
+ num_versions);
while (modification_num_of_front_ < min_modification_count) {
in_flight_modifications_.pop_front();
++modification_num_of_front_;
}
}
private:
std::array<int, num_versions> versions_ordered_by_age() {
std::array<int, num_versions> result;
for (int i = 0; i < num_versions; ++i) {
result[i] = i;
}
std::sort(result.begin(), result.end(); [&](int i1, int i2) {
return modification_count[i1] < modification_count_[i2];
});
return result;
}
int oldest_version_with_no_readers() {
std::array<int, num_versions> versions = versions_ordered_by_age();
while (true) {
for (int i = 0; i < num_versions; ++i) {
if (versions[i] == current_version_.load()) {
// Don't want try to update the current version!
continue;
}
if (reader_count_[versions[i]].load() == 0) {
return versions[i];
}
}
}
}
// The version readers should try to read from
std::atomic<int> current_version_;
// The version that a writer is currently modifying.
// For simplicity, assume this is initially -1.
std::atomic<int> version_being_written_;
// All modifications that some version has not yet had applied to it
std::deque<std::function<void (T&)>> in_flight_modifications_;
// The modification number of the front of in_flight_modifications_
int modification_num_of_front_;
// The modification counts of each version
int modification_count_[num_versions];
// The number of readers reading each version
std::atomic<int> reader_count_[num_versions];
// The versions themselves.
T data_[num_versions];
};
Exercise
In the last section, there was an exercise to use a per-reader data
structure instead of atomic fetch-and-adds. Generalize the solution to use the
optimizations described, using no more than O(num_readers * num_versions)
atomic integer variables in the helper data structures.
Exercise
Reduce the space usage from the previous exercise to be only
O(num_readers + num_versions)
.
Going to the limit - getting a wait-free writer
With the optimizations above, we can reduce wait times for writers
significantly. This leads us to wonder -- can we reduce them all the way to 0,
and have a writer that can't be blocked by readers? Perhaps surprisingly, the
answer is yes, so long as we can bound the number of readers.
Consider what happens when we set num_versions
to 2 * num_readers + 2
.
The writer is wait free so long as there is some unblocked version other than
the current version. Each reader blocks at most 2 versions at a time, so there
are at most 2 * num_readers
versions blocked at a time. There are therefore at
most 2 * num_readers + 1
versions the writer cannot proceed on, and so there
is at least one version that the writer can proceed on. All loops in the
writer will therefore terminate in at most 1 iteration, and the writer is wait
free.
Exercise
Come up with a scheme that requires only num_readers + 2
copies of a
data structure to provide wait-freedom.
Hint: extend the strategy, developed in the previous exercises, of per-reader
state. Introduce a reader state corresponding to "about to perform a read".
Readers set their state to "about to perform a read", and then compare and swap
their state to "reading on version X
". Writers try to compare and swap all
reader states from "about to perform a read" to "reading on version Y
" before
writing to a version other than Y
.
Possible extensions
There are a number of extensions that could be applied to the strategies
discussed in this post. Two that stand out as being of particular practical
performance are adding a try-write method, and adopting these techniques to use
futexes or a similar technique.
Try-write is straightforward: an attempt to write is made, but the writer does
not proceed if it might block. This is a straightforward: instead of looping in
the write
method, the writer only tries to write once. If it fails, it returns
a status indicating that the write did not occur.
A futex is a scheduling primitive used to efficiently implement mixed
user-level/kernel-level locking on Linux. Fast-path, contention-free lock
acquisitions happen entirely in user-space, only entering the kernel in the case
where a lock is contended, and then blocking, allowing other threads to use the
CPU time that would otherwise be spent spinning. This mechanism could be used to
avoid the spinning required by write
. The readers would keep a count of the
number of blocked versions of the data structure, waking up the writer when the
count becomes nonzero.
Notes
The reader-preference reader-writer lock is a well-known algorithm due to
Mellor-Crummey and Scott.
Many of the ideas here a well-known in the garbage collection and database
literature; in particular:
-
Maintaining multiple copies of a data structure to allow readers to avoid
waiting for writers
-
Maintaining versions with those copies so that writers can proceed while
readers are reading from an old version
-
Keeping a log of modifications to the data, which gets applied in batch.
Instead of detailing all of these, I'll note only some of the literature that
focuses on strategies for in-memory data structures on shared-memory
multiprocessors with simple version reclamation semantics (i.e. which do not
require the use of garbage collection).
The first algorithm that allows for wait-free readers in the presence of a
writer goes back to Peterson, whose
scheme is also wait-free for the writer. It comes at the cost of maintaining
num_readers + 2
copies of the protected data, which he shows to be both
necessary and sufficient. His scheme is only applicable to arrays of data which
allow concurrent reads and writes (a la std::atomic<char>[N]
), and requires
the reader to execute up to 3 physical reads per logical read.
Chen and
Burns show how
to use a compare-and-swap primitive to implement a scheme that requires only 1
read of the data structure per logical read operation, and does not require
concurrent access to a data structure by both a reader and a writer.
writer to write to a copy of the data structure concurrently with a reader
reading from it. It also uses the theoretical minimum of num_readers + 2
copies of the data.
Ramalhete and
Correia
show how to reduce the number of copies of the data to only 2, at the cost of no
longer being wait-free for the writer. They provide several variants of their
technique, which they call the "left-right" algorithm. The two-copy strategy
presented here is a variant of this algorithm.
The fetch-and-add based concurrency-control strategy used here is, as far as I
know, novel, but it is reminiscent of the "Epoch-based reclamation" strategy
used by Fraser to
provide exclusion between readers and deleters of nodes in lock-free data
structures. Similarly, I believe the lazy-update scheme and the generalization
to multiply-replicated data are new, though lazy-updates bears similarity
to Shalev and Shavit's Predictive
Log-Synchronization technique. In the same paper, they present a strategy that
allows concurrent reads by switching writers between two copies of a data
structure. Their solution, however, is not wait-free for readers.
Futexes were introduced to the linux kernel by
Franke, Russell, and Kirkwood,
though they bear similarities to BeOS "benaphores".