FaRM: Fast Remote Memory

Jul 26, 2016

FaRM is a main memory distributed computing platform that provides distributed transactions with strict serializability, high performance, durability, and high availability.

To scale out, FaRM distributes objects across machines in a data center and also allows transactions to span any number of machines. To reduce CPU overhead it uses one-sided RDMA (Remote Direct Memory Access) operations.

Everything is stored in DRAM and it provides durability by attaching batteries to power supply units and writing the contents of DRAM to SSD when the power fails.

Programming model and architecture

FaRM provides the abstraction of a global shared address space that spans machines in a cluster. Each machine runs application threads and stores objects in the address space. The FaRM API provides transparent access to local and remote objects within transactions. Transactions can be started at any time and the coordinator will be whoever started it. The thread that started the transaction can execute arbitrary logic as well as read, write, allocate, and free objects. At the end of execution, the thread invokes FaRM to commit the transaction.

FaRM uses optimistic concurrency control. The operations are buffered locally and are only made visible to other transactions on successful commit.

FaRM provides strict serializability of all successfully committed transactions. Individual reads are atomic, reading only committed data, successive reads of the same object return the same data, and reading objects written by the transaction return the latest value written. Reads of different objects are not atomic but the transaction will not commit, thus the strict serializability property still holds.

Other features provided by the FaRM API is lock-free reads, which are optimized single-object read only transactions, and locality hints, enabling programmers to co-locate related objects on the same set of machines.

Each machine runs FaRM in a user process with a kernel thread pinned to each hardware thread. Each kernel thread runs an event loop that executes application code and polls the RDMA completion queues.

The configuration can change over time as machines fail or are added to the cluster. FaRM represent the configuration via a tuple, {i, S, F, CM}, where:

i: unique, monotonically increasing 64-bit configuration identifier;
S: set of machines in the configuration;
F: mapping from machines to failure domains that are expected to fail independently (e.g., different racks);
CM: configuration manager. CM ∈ S.

ZooKeeper is used as the coordination service to ensure that machines agree on the current configuration and store it. Lease management, failure detection and coordination recovery do not use ZooKeeper.

Regions of 2 GB are used as the global address space in FaRM and each region is replicated in f + 1 machines, where f is the desired fault tolerance. Objects are always read from the primary copy of the containing region, and it uses local memory access if the region is on the local machine, otherwise one-sided RDMA read is used.

The CM is contacted when a machine wants to allocate a new region. The CM assigns a region identifier from a monotonically increasing counter and will select replicas for the region. Replica selection takes into account the number of regions stored on each machine, the available storage, and each replica is placed in a different failure domain. The region might be co-located with a target region if the application specifies a locality constraint. A two-phase protocol is used to coordinate the allocation in all replicas. The CM sends a prepare message to all replicas and they report if they allocated the region successfully. If all reply with success the CM sends a commit message to all of them.

Each machine also stores ring buffers that implement FIFO queues. The queues are used either as transaction logs or message queues.

The figure below shows the FaRM architecture:

                   ┌──────────────────────────────────┐
                   │                                  │
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐     lease
                   │                Machine A      renewals
│                  ▼        CPU               │       │            ┌ ─ ─ ─ ─ ─ ─
           ┌───────────────┐                          │              Machine D  │
│       ┌──│  Application  │◀───────┐         │       └────────────│    (CM)
        │  ├───────────────┤        │                               ─ ─ ─ ─ ─ ─ ┘
│       │  │     FARM      │        │         │
     local └───────────────┘        │
│    reads       ▲   ▲              │         │
        │        │   └────┐         │
│       ▼        │        │  NVRAM  │         │
   ┌────────┬────────┬────────┬───────────┐
│  │ Region │ Tx log │ Tx log │ Msg queue │   │
   └────────┴────────┴────────┴───────────┘
│       ▲        ▲          ▲        ▲        │                         ┌─┐
        │        │          │        │                            ┌─────│ │─────┐
└ ─ ─ ─ ┼ ─ ─ ─ ─│─ ─ ─ ─ ─ ┼ ─ ─ ─ ─│─ ─ ─ ─ ┘                   │     └─┘     │
        │        │          │        │                           ┌─┐           ┌─┐
      remote  tx records    │      messages                      │ │           │ │
      reads      │          │        │                           └─┘           └─┘
        │        │      tx records   │                            │             │
        │        │          │        │                            │ ┌─┐     ┌─┐ │
        │        │          │        │                            └─│ │─────│ │─┘
        │        │          │        │                              └─┘     └─┘
      ┌ ─ ─ ─ ─ ─ ─       ┌ ─ ─ ─ ─ ─ ─                             Coordination
        Machine B  │        Machine C  │                              service
      └ ─ ─ ─ ─ ─ ─       └ ─ ─ ─ ─ ─ ─                             (ZooKeeper)

Distributed transactions and replication

To improve performance FaRM integrates the transaction and replication protocols. Both data and transaction logs use primary-backup replication in non-volatile DRAM, and it uses unreplicated transaction coordinators that communicate directly with primaries and backups. As mentioned before, FaRM uses optimistic concurrency control with read validation.

The timeline for a FaRM transaction is shown below:

During the execution phase, transactions use one-sided RDMA to read objects and they buffer the writes locally. Versions and addresses of all accessed objects are recorded by the coordinator. For primaries and backups on the same machine as the coordinator, object reads and writes to the log use local memory accesses rather than RDMA. At the end of the execution, FaRM attempts to commit the transaction by performing the following steps:

Lock: The coordinator writes a LOCK record to the log of each machine that is a primary for any written object. This record contains the versions and new values of all written objects on that primary, as well as the list of all regions with written objects. The primaries process the records by attempting to lock the objects at the specified versions using compare-and-swap, and send back a message reporting if all locks were successfully taken. If any object version changed since it was read by the transaction, or if the object is currently locked by another transaction then the locking fails and the coordinator aborts the transaction. In this case, the coordinator writes an abort record to all primaries and returns an error to the application.
Validate: All objects that were read but not written by the transaction are then validated by the coordinator by reading their versions from their primaries. The validation will fail and the transaction will abort if any object has changed.
Commit backups: The coordinator writes a COMMIT-BACKUP record to the non-volatile logs at each backup and then waits for an ack from the NIC (Network Interface Controller) hardware without interrupting the backup’s CPU. The COMMIT-BACKUP log record has the same payload as a LOCK record.
Commit primaries: The coordinator writes a COMMIT-PRIMARY record to the log at each primary after all COMMIT-BACKUP writes have been acked. The coordinator then reports completion to the application after receiving at least one hardware ack for a COMMIT-PRIMARY record, or if the coordinator wrote one locally. The primaries process these records by updating the objects, incrementing their versions, and unlocking them, which exposes the writes committed by the transaction.
Truncate: Both backups and primaries keep the records in their logs until they are truncated. Truncation is done lazily by the coordinator after it receives acks from all primaries. Backups apply the updates to their copies of the objects at truncation time.

Correctness

The serialization of committed read-write transactions occurs when the locks are acquired, for committed read-only transactions it occurs at the point of their last read. This is because the versions of all read and written objects at the serialization point are the same as the versions seen during execution.

Serializability in FaRM is also strict¹: the serialization point is always between the start of execution and the completion being reported to the application.

Even in case of failures FaRM ensures serializability. This is achieved by waiting acks for all backups before writing COMMIT-PRIMARY. Imagine that the coordinator does not receive an ack for a backup and writes COMMIT-PRIMARY. The application will receive a confirmation that the transaction was successful. If all replicas that stored the modification fail and the only backup surviving is the one that didn’t acked because it never received COMMIT-BACKUP, this would result in losing the modification that was once confirmed.

For COMMIT-PRIMARY a single ack is enough for reporting back success to the application. This is true because FaRM assumes up to f failures for a f + 1 shard. Then at least one machine will have a record attesting the commit. If not a single one COMMIT-PRIMARY was written and the primaries and backups failed, the LOCK records would survive but there wouldn’t be any record attesting that validation succeeded.

Failure recovery

Durability and high availability are provided by replicating the data. FaRM assumes that machines can fail by crashing but can recover (fail-recovery) without losing the contents of non-volatile DRAM.

FaRM provides durability for all committed transactions even if the entire cluster fails or loses power: the committed state is recovered from regions and logs stored in non-volatile DRAM. It ensures durability even if at most f replicas per object lose the contents of non-volatile DRAM. FaRM can maintain availability with failures and network partitions provided that a majority of machines remain connected to each other and to a majority of replicas in the ZooKeeper service, and the partition contains at least one replica of each object.

Failure detection

FaRM uses leases to detect failures. Every machine other than the CM holds a lease at the CM and the CM holds a lease at every other machine. If any of these leases expire, failure recovery is then triggered.

FaRM uses dedicated queue pairs for leases to avoid having lease messages delayed in a shared queue behind other message types. It also uses a dedicated lease manager thread that runs at the highest user-space priority. This thread is not pinned to any hardware thread and it uses interrupts instead of polling to avoid starving critical OS tasks that are executed periodically on every hardware thread.

Reconfiguration

Reconfiguration happens when a failure occurs but there is a protocol that is followed to move a FaRM instance to the next configuration.

FaRM implements something they called precise membership. This guarantees that machines will only talk to other machines in the same configuration. Messages from machines outside the configuration are ignored.

The reconfiguration steps are as follows:

Suspect: If a lease for a machine at the CM expires, the CM suspects that a failure occurred for that machine and reconfiguration is initiated. The CM then starts blocking all external client requests. If a non-CM machine suspects that the CM failed due to a lease expiry, it will first ask a small number of “backup CMs” to initiate reconfiguration. If the configuration is unchanged after a timeout period it attempts the reconfiguration itself.
Probe: An RDMA read is issued by the new CM to all machines in the configuration except the machine that is suspected. Any machine that the read fails is also suspected. To avoid having the new CM in a minority in case of a network partition, the new CM only proceeds with reconfiguration if it received responses from a majority of the probes.
Update configuration: After receiving replies to the probes, the new CM attempts to update the configuration data stored on ZooKeeper.
Remap regions: Regions that were previously mapped to failed machines are reassigned so that the number of replicas come back to f + 1. If a primary failed, a surviving backup is promoted to be the new primary. If the CM detects that regions lost all their replicas or if there is no space to re-replicate regions, it signals an error.
Send new configuration: A NEW-CONFIG message is sent to all machines in the new configuration after the CM has remapped all regions. NEW-CONFIG also resets the lease protocol if the CM has changed.
Apply new configuration: When a machine receives a NEW-CONFIG message with a configuration identifier that is greater than its own, it updates its current configuration identifier and its copy of the region mapping. It also allocates space to hold any new region replicas assigned to it. From this point on, the machine starts rejecting requests from machines that are not part of this new configuration and it does not issue new requests to those machines. Requests from external clients are also blocked. The machine will send a NEW-CONFIG-ACK message to CM as well.
Commit new configuration: After receiving NEW-CONFIG-ACK messages from all machines in the configuration, the CM waits to ensure that any leases granted in previous configurations to machines not present in the new configuration have expired. Then it sends a NEW-CONFIG-COMMIT message to all configuration members. All members now unblock external requests and initiate transaction recovery.

Transaction state recovery

After reconfiguration, FaRM recovers transaction state using the logs distributed across the replicas of objects modified by a transaction. This involves recovering the state both at the replicas of objects modified by the transaction and at the coordinator to determine the outcome of the transaction. This recovery is done by the following steps:

Block access to recovering regions: When the primary of a region fails, the backup is promoted to be the new primary. Access to this region is blocked until all transactions that updated it have been reflected at the new primary. Local access and RDMA references for the region in question are blocked until step 4 which is when the write locks have been acquired by all recovering transactions that updated the region.
Drain logs: When trying to commit a transaction, the coordinators reply successfully to the application after receiving acks from NICs for COMMIT-BACKUP and COMMIT-PRIMARY messages. This just guarantees that the messages have been written to the logs but they might not have been processed yet. During transaction state recovery, the machines then drain the logs to ensure that all relevant records are processed during recovery. All records in the logs are processed in order after receiving a NEW-CONFIG-COMMIT message. At the end of it a LastDrained variable that stores the configuration identifier is updated. This LastDrained variable is used to reject log records for transactions with configuration identifiers that are less than or equal to LastDrained.
Find recovering transactions: A recovering transaction is one whose commit phase spans configuration changes, and for which some replica of a written object, some primary of a read object, or the coordinator has changed due to reconfiguration. Only recovering transactions² go through transaction recovery at primaries and backups. Records for a recovering may be distributed over the logs of different primaries and backups updated by the transaction. Each backup of a region sends a NEED-RECOVERY message to the primary with the configuration identifier, the region identifier, and the identifiers of recovering transactions that updated the region.
Lock recovery: The primary of each region waits until the local machine logs have been drained and NEED-RECOVERY messages have been received from each backup, to build the complete set of recovering transactions that affect the region. The transactions are then sharded by identifier across its threads such that each thread t recovers the state of transactions with coordinator thread identifier t. In parallel, the threads from the primary fetch any transaction log records from backups that are not already stored locally and then lock any objects modified by the recovering transactions. The region is active when lock recovery is complete. Then local and remote coordinators can obtain local pointers and RDMA references, allowing them to read and commit updates to this region in parallel with subsequent recovery steps.
Replicate log records: The threads in the primary replicates log records by sending backups the REPLICATE-TX-STATE message for any transactions that they are missing.
Vote: The coordinator for a recovering transaction decides whether to commit or abort the transaction based on votes from each region updated by the transaction. The votes are sent by the primaries of each region.The threads in the primary send RECOVERY-VOTE messages to their peer threads in the coordinator for each recovering transaction that modified the region. The rules for voting are the following: * If any replica saw COMMIT-PRIMARY or COMMIT-RECOVERY then vote commit-primary. * If any replica saw COMMIT-BACKUP and did not see ABORT-RECOVERY then vote commit-backup. * If any replica saw LOCK and no ABORT-RECOVERY then vote lock. * Otherwise vote abort. If the primaries do not have any log records for the transaction they vote truncated if the transaction has already been truncated or unknown if it has not.
Decide: The coordinator decides whether or not the transaction should commit. It decides to commit if it receives a commit-primary vote from any region. Otherwise, it waits for all regions to vote and commits if at least one region voted commit-backup and all other regions voted lock, commit-backup, or truncated. Otherwise it decides to abort. Then the coordinator sends COMMIT-RECOVERY or ABORT-RECOVERY to all participant replicas. A TRUNCATE-RECOVERY message is sent once the coordinator receives acks from all primaries and backups.

Recovering data

To ensure that it can tolerate f replica failures in the future for a given region, FaRM must re-replicate data at new backups. Each machine sends a REGIONS-ACTIVE message to the CM when all regions for which it is primary become active. The CM after receiving all REGIONS-ACTIVE messages sends a ALL-REGIONS-ACTIVE message to all machines in the configuration. At this point, FaRM begins data recovery for new backups in parallel with foreground operations.

A new backup for a region initially has a freshly allocated and zeroed local region replica. The regions are divided across worker threads that recover them in parallel. Each thread issue one-sided RDMA operations to read a block at a time from the primary. Before being copied to the backup the recovered object is examined to see if a modification (by a new transaction) has been made while the backup was receiving the object.

Recovering allocator state

The FaRM allocator splits regions into blocks of 1 MB that are used as slabs for allocating small objects. Two pieces of meta-data re kept: block headers, which contain the object size, and slab free lists. The block headers are replicated to backups when a new block is allocated. Even if the primary fails they will be available at the new primary (which will be one of the backups). To avoid inconsistencies when the old primary fails while replicating the block header, the new primary sends all block headers to all backups immediately after receiving NEW-CONFIG-COMMIT.

The slab free lists are kept only at the primary but each object has a bit in its header that is set by an allocation and cleared by a free during transaction execution. After a failure, the free lists are recovered in the new primary by scanning the objects in the region.

Conclusion

In the paper’s introduction the authors say that prior attempts to implement the abstraction of “a single machine that never fails and that executes one transaction at a time in an order consistent with real time” in a distributed system resulted in poor performance. They then cite several systems that sacrificed some characteristic; like strong consistency guarantees, transactions, etc; to have better performance.

That is what led them to build FaRM. The paper is a demonstration that new systems don’t need to compromise those characteristics in order to have more performance. The ideas presented in the paper reflect this desire: reducing the number of messages exchanged, using one-sided RDMA to avoid using the CPU, storing the data in non-volatile DRAM, etc.

The performance results presented in the paper are interesting. FaRM achieves a peak throughput of 140 million TATP transactions per second on 90 machines, it recovers from a failure under 50 milliseconds, and it achieved 4.5 million TPC-C “new order” transactions per second. You can consult the paper for more numbers.

References

Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles, October 04-07, 2015, Monterey, California.

Notes

Linearizability + Serializability = Strict Serializability ↩︎
The paper can be consulted for looking at how exactly it is decided whether a transaction is a recovering transaction or not. ↩︎