carlosgaldino · home

Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System

Bayou [1] is a replicated storage system providing weakly consistent guarantees designed for mobile computing environments. The system and paper were published in 1995. At that time PDA's (Personal Digital Assistant) were common and played a big influence in Bayou's design1.

Bayou only requires occasional, pair-wise communication between computers. This is because the collaborating computers were not all connected simultaneously, disconnections happened with certain frequency and connection time was expensive. Even though nowadays the price of connection for mobile devices has dropped, allowing "full connectivity", the characteristics mentioned by the authors can't simply be left out when developing systems today.

Introduction

Based on this notion that devices will be offline for a good portion of time, Bayou doesn't have a "disconnected" mode. Instead, it considers that several "connectedness" levels exist. One well known problem in distributed systems are network partitions. Systems with stronger consistency guarantees might (or probably must) prohibit one side of the partition from making progress. That is not the case with Bayou, as the authors state:

Supporting disconnected workgroups is a central goal of the Bayou system.

The normal mode of operation in Bayou is pair-wise communication so when devices are partitioned there is no difference.

Since weak connectivity is taken into account, the system provides weakly consistent, replicated data. Clients can read and write to any replica without explicit coordination from other replicas. The changes are propagated through the replicas via pair-wise interactions meaning that eventually all servers will store the same data.

Applications must be aware that they might read inconsistent data and Bayou provides support for application-specific conflict detection and resolution. Clients can even read conflicting data while the conflict is not yet resolved, either because human intervention is required or because conflicting operations are still being spread through the system.

The authors cited several examples of applications that could be built using Bayou: shared calendars, mail and bibliographic databases, and others.

Bayou's Basic System Model

In the Bayou system, the data is replicated in full at a number of servers. The applications acting as clients interact with the Bayou system via an API. The API provide two operations: Read and Write. Read allows queries over a data collection and Write can insert, update, or delete a number of data items in a collection.

Only one server is required for clients to perform some work. Clients can read and write data in the server and don't need to wait the Write to propagate. Clients don't need to interact with a single server, and Bayou even provides session guarantees to reduce client-observed inconsistencies when interacting with multiple servers2.

The Write also carries information that lets the receiving server to detect if there is a conflict and how to resolve it.

The storage system can be thought as an ordered log of the Write operations and the data resulting from the execution of these operations. Each server performs the operations locally, the conflict detection and resolution are also performed as they are encountered during execution of the Write operations. As mentioned before, the data after all known Write ops are immediately available to any client.

When the pair-wise connections are made, the servers exchange information about the Write ops that each one knows about. The information includes which operations they are, and the order they should be executed. This contact between servers is called anti-entropy session.

The following figure illustrates the Bayou System Model:

                                        Server
                                   ┌────────────────┐
┌───────────────────┐              │  ┌───────────┐ │
│                   │              │  │  Storage  │ │
│    Application    │              │  │  System   │ │
│   ┌──────────┐    │              │  └───────────┘ │
├───┤Bayou API ├────┤     Read     │                │                             ┌────────────────┐
│   └──────────┘    │──────or─────▶│   Server State │                             │  ┌───────────┐ │
│   ┌───────────┐   │    Write     │ ┌─────────────┐│                             │  │  Storage  │ │
│   │Client Stub│   │              │ │█████████████││                             │  │  System   │ │
│   └───────────┘   │              │ │█████████████││                             │  └───────────┘ │
└───────────────────┘              │ └─────────────┘│                             │                │
       Client                      └────────────────┘             ┌──────────────▶│   Server State │
                                            ▲                     │               │ ┌─────────────┐│
                                            │                     ▼               │ │█████████████││
                                            │            ┌────────────────┐       │ │█████████████││
                                            │            │  ┌───────────┐ │       │ └─────────────┘│
                                            │            │  │  Storage  │ │       └────────────────┘
                                            │            │  │  System   │ │                ▲
                                      Anti-entropy       │  └───────────┘ │                │
                                            │            │                │   Servers      │
                                            └───────────▶│   Server State │                │
                                                         │ ┌─────────────┐│                ▼
                                                         │ │█████████████││       ┌────────────────┐
                                                         │ │█████████████││       │  ┌───────────┐ │
                                                         │ └─────────────┘│       │  │  Storage  │ │
                                                         └────────────────┘       │  │  System   │ │
                                                                  ▲               │  └───────────┘ │
                                                                  │               │                │
                                                                  └──────────────▶│   Server State │
                                                                                  │ ┌─────────────┐│
                                                                                  │ │█████████████││
                                                      ┌───────────────────┐       │ │█████████████││
                                                      │                   │       │ └─────────────┘│
                                                      │    Application    │       └────────────────┘
                                                      │   ┌──────────┐    │                ▲
                                                      ├───┤Bayou API ├────┤        Read    │
                                                      │   └──────────┘    │◀────────or─────┘
                                                      │   ┌───────────┐   │       Write
                                                      │   │Client Stub│   │
                                                      │   └───────────┘   │
                                                      └───────────────────┘
                                                             Client

Conflict Detection and Resolution

One of the goals of the system is to not restrict how conflicts are detected and resolved in a single way for all applications. What works for one application might not work for a different one. The intention is to support arbitrary applications, with different forms of conflict detection and resolution.

In Bayou, the application specifies its notion of conflicts and also the policy for their resolution, and the system implements the mechanisms for reliably detecting conflicts, and resolving them as the application specified. The mechanisms included in Bayou are: dependency checks and merge procedures.

Dependency checks

Application-specific conflict detection is done via the usage of dependency checks. Each Write operation includes a dependency check that consists of an application-supplied query and its expected result. A conflict is detected if when running the query the expected result is not returned at the server that the query was executed. If a conflict is detected the Write operation is not executed and the Bayou system will then invoke the mechanism for resolving the conflict.

You probably have heard about the bank transaction example where a transaction from A to B is being attempted. Let's say the amount to be transferred is $100. Now let's say that the before the application issues the Write operation, it has read that account A has $150. In a system with a more traditional optimistic concurrency control, it would check if account A still had $150 before applying the Write. If between issuing and applying the Write, a different process removed $50 from account A the $100 Write would be aborted. Note that account A would still have sufficient funds to complete the operation successfully, according to the application semantics. In Bayou because the application is responsible for defining if there is a conflict or not, the application could simply provide a dependency check that would require that account A had $100 before applying the write, even though that prior to issuing the Write the application has read that A had $150.

Merge procedures

When a conflict is detected, Bayou executes another procedure provided by the application to resolve the conflict in question. This procedure is called a merge procedure, and it is given by the application along with the Write operation and the dependency check. These merge procedures are written in a high-level, interpreted language. They can have embedded data, and can perform Read ops on the current replica's state. The merge procedure is expected to resolve any conflicts detected by its dependency check and provide a revised update to apply. The whole process of detecting a conflict, resolving it is done atomically at each server as part of executing the Write operation.

If a conflict cannot be resolved by the merge procedure, it is still expected to run to completion, and log that the detected conflict couldn't be resolved. This allows that a person could later resolve the conflict.

Bayou doesn't forbid other clients from issuing operations when a conflict is detected. The replicas can always remain accessible. A potential problem is that new operations might depend on data that is in conflict and this could lead to a cascade of conflicts.

Replica Consistency

As mentioned before, Bayou guarantees that all servers eventually receive all Write ops via the pair-wise anti-entropy process and that two servers holding the same set of Write ops will have the same data contents. To achieve eventual consistency Bayou has two features: Write ops are performed in the same, well-defined order at all servers, and the conflict detection and merge procedures are deterministic so that all servers resolve the conflicts in the same manner.

Write operations can have two types: tentative and committed. When Bayou accepts a Write from a client it characterizes the operation as tentative. These operations are ordered according to the timestamps of the server that first accepted them. Eventually, each Write will become committed, and they are ordered according to the time at which they commit and before any tentative Write.

Servers are not required to maintain synchronized clocks, although it is desired to keep them reasonably close so that the induced order is more or less the same order the user sees when submitting operations. The timestamps at each server must be monotonically increasing so that the pair <timestamp, Id of server that assigned it> produces a total order on the operations. To timestamp the Write operations the Bayou system maintains a logical clock which it is generally synchronized with its wall clock, but to preserve the causal ordering of Write ops, the server may need to advance its logical clock when receiving new Write ops during the anti-entropy process.

One question you might be asking now is if each server applies the Write operations as soon as they receive them, how the operations will then be eventually applied in the same order at each server?

The answer is that the servers are able to undo the effects of a previously executed tentative Write and re-execute it in a different order, and a single server is responsible for defining the total order for execution.

Write Stability and Commitment

Because Write ops may be executed multiple times, Bayou characterizes a Write as stable when it has been executed at that server for the last time. This happens when the set of operations before the Write in the log is fixed.

Bayou uses a commit procedure to mark the operations as stable. That is, a Write becomes stable when it is explicitly committed. As mentioned before, committed Write ops are placed ahead of any tentative Write operations in each server's log.

The commit system used in Bayou is called primary commit. This is, a single server is responsible for committing the operations. The information about which Write operations are committed is then spread during anti-entropy. Each data collection can have a different designated primary server.

Considering the context from 1995, presented in the beginning of the paper, where most of the time the devices were disconnected from each other, you can imagine that a commit protocol that requires the majority of servers to be connected at the same time to decide on committed operations would be unreasonable. The protocol used by Bayou allows maximizing the rate of committed Write ops when the primary is chosen to be the main point originating updates.

Although Bayou uses a protocol that has a primary server in charge of committing operations, its design doesn't impose restrictions when the primary is unavailable. The other servers still accept updates, allowing the system to continue functioning. The Bayou API provides means for inquiring about the stability of operations. Clients can ask if a given Write operation (using the operation unique identifier) is stable at the server. Note that the answer may vary depending on which server was contacted. The API also makes possible for clients to work only with stable data.

Another thing to keep in mind is that the order in which the operations are committed depend on the order the servers make contact with the primary. So, if a server stays disconnected for a long time with the primary it could have operations committed after operations (from different servers) with later timestamps. Operations for a single server will commit in timestamp order since they are ordered in the server that accepted them and exchanged in that order during anti-entropy.

Storage System Implementation Issues

The following figure illustrates how the storage system in Bayou looks like:

                                   ┌────────────────────┐
                                   │████████████████████│O
                                ┌──│████████████████████│
                                │  └────────────────────┘
       Timestamp Vectors        │            Write Log
     ┌────────────────────┐     └───────────▶┌───────┐◀─┐               Tuple Store (checkpoint)
     │████████████████████│C                 ├───────┤  │         ┌─────────────────────────────────┐
     │████████████████████│────────┐         ├───────┤  │         │   ┌───────┐                     │
     └────────────────────┘        │         ├───────┤Committed   │   │Table 1│  ┌───────────┐      │
     ┌────────────────────┐        │         ├───────┤  │         │   ├─┬─┬─┬─┤  │  Table 2  │      │
     │████████████████████│F       │         ├───────┤◀─┼──┐      │   ├─┼─┼─┼─┤  ├─┬─┬─┬─┬─┬─┤      │
     │████████████████████│─┐      └────────▶├───────┤◀─┘  │      │   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │
     └────────────────────┘ │  Undo Log      ├───────┤◀─┐  │      │   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │
                            │ ┌───────┐◀─────├───────┤  │  │      │   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │
                            │ ├───────┤      ├───────┤  │  │      │   ├─┼─┼─┼─┤  └─┴─┴─┴─┴─┴─┘      │
                            │ ├───────┤      ├───────┤  │  │      │   ├─┼─┼─┼─┤                     │
                            │ ├───────┤      ├───────┤  │  └─────▶│   ├─┼─┼─┼─┤                     │
                            │ ├───────┤      ├───────┤  │         │   └─┴─┴─┴─┘                     │
                            │ ├───────┤      ├───────┤Tentative   │     ┌───────────────────┐       │
                            │ ├───────┤      ├───────┤  │         │     │      Table 3      │       │
                            │ ├───────┤      ├───────┤  │         │     ├─┬─┬─┬─┬─┬─┬─┬─┬─┬─┤       │
                            │ ├───────┤      ├───────┤  │         │     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
                            │ ├───────┤      ├───────┤  │         │     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
                            │ ├───────┤      ├───────┤  │         │     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
                            │ └───────┘◀────┐├───────┤  │         │     └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘       │
                            └───────────────▶└───────┘◀─┘         └─────────────────────────────────┘
                                                 ▲
             Tuple Store                         │
┌─────────────────────────────────┐              │
│   ┌───────┐                     │              │
│   │Table 1│  ┌───────────┐      │              │
│   ├─┬─┬─┬─┤  │  Table 2  │      │              │
│   ├─┼─┼─┼─┤  ├─┬─┬─┬─┬─┬─┤      │              │
│   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │              │
│   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │              │
│   ├─┼─┼─┼─┤  ├─┼─┼─┼─┼─┼─┤      │              │
│   ├─┼─┼─┼─┤  └─┴─┴─┴─┴─┴─┘      │              │
│   ├─┼─┼─┼─┤                     │              │
│   ├─┼─┼─┼─┤                     │◀─────────────┘
│   └─┴─┴─┴─┘                     │
│     ┌───────────────────┐       │
│     │      Table 3      │       │
│     ├─┬─┬─┬─┬─┬─┬─┬─┬─┬─┤       │
│     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
│     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
│     ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤       │
│     └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘       │
└─────────────────────────────────┘

There are three main components: the Write Log, the Tuple Store, and the Undo Log.

The Write Log stores Write that have been received by the server. They are ordered by their global committed order or tentative order. Once a Write is committed the server may discard that operation from the Log since it will not be re-executed anymore, so the Write Log will actually contain a tail of committed operations and all known tentative operations following them.

To keep track of which Write operations the server has received and were already discarded, each server maintains a timestamp vector, in the figure represented as the "O vector" (O for "ommitted"). Each server stores the timestamps of the latest Write it has received (and discarded) from a given server as a way to prevent re-accepting the same operation in a future anti-entropy.

The Tuple Store is a database where the updates will act on and provide the necessary data for Read operations. It was implemented as an in-memory relational database, something that was considered a practical limitation by the authors3, but not something that is intrinsic to the overall Bayou design. As pointed out earlier, clients may ask to work only with stable data so the Tuple Store needs to maintain two views of its data: a full view and a committed view. This is done by having a 2-bit characteristic vector in each tuple identifying the set of views that contain the tuple in question.

The other two vectors shown in the figure above, vectors C and F, store the timestamps for committed and tentative operations, respectively. They are used to quickly identify the sets of Write operations that need to be exchanged during anti-entropy.

The Undo Log, as the name suggests, is used to store tentative Write that need to have their effects rolled back. This is the case when a tentative Write has been executed but later the server discovers new committed operations that need to be executed. The server needs to leave the Tuple Store state the same as when the newly received operation was inserted.

For crash recovery purposes, the full Write Log and a checkpoint of the Tuple Store are stored in stable storage. The checkpoint Tuple Store only reflects a prefix of Write operations and it contains the effects of any Write ops that have already been discarded. In the figure you see two Tuple Stores and two Write Logs because the current Tuple Store and the Write Log are stored in memory for performance reasons. The system also records in stable storage the unique id of the last Write that is reflected in the checkpoint Tuple Store. With this information the system can recover the state of the server prior to a crash by recovering the recorded Tuple Store and replaying a suffix of Write operations present in the Write Log.

Conclusion

This paper was an interesting read because many of the challenges that influenced Bayou's design still exist today, if not identical but similar. The idea of pushing the conflict detection and resolution out of the system and closer to the application can broaden the range of detected conflicts and also resolve them in a better way than simply rejecting them. The key is that the application can offer more context with its semantics and work together with the system. The servers always make progress locally which in turn gives high availability since all operations are accepted. At some point these servers need to agree on the total order of execution of these operations and the commit protocol in Bayou is responsible for providing the eventual consistency claimed by the system. Clients can even ask for different consistency levels via the session guarantees presented in [2].

Since many of the challenges still exist you might recognize the solutions being applied in newer systems. This is a good thing too keep in mind, learning from older systems can help and inspire new ideas and new systems for the future.

References

  • [1] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser, “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” 1995, pp. 1–12.
  • [2] D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and B. B. Welch, “Session Guarantees for Weakly Consistent Replicated Data,” 1994, pp. 1–10.

Notes


  1. The authors mention the usage of laptops, and cell phones as well. 

  2. The guarantees are the following: read your writes, monotonic reads, writes follow reads, and monotonic writes. For more information about these session guarantees you can consult [2]. 

  3. Remember, this was 1995. 

Carlos Galdino
@carlosgaldino
github.com/carlosgaldino