paxos.pl
that implements a replicating key-value
store of Prolog terms on top of SWI-Prolog
library(broadcast)
libraries and its TIPC or UDP based
extension that allow broadcasting outside the process using networking.
This module provides a replicated data store that is coordinated using a variation on Lamport's Paxos concensus protocol. The original method is described in his paper entitled, "The Part-time Parliament", which was published in 1998. The algorithm is tolerant of non-Byzantine failure. That is late or lost delivery or reply, but not senseless delivery or reply. The present algorithm takes advantage of the convenience offered by multicast to the quorum's membership, who can remain anonymous and who can come and go as they please without effecting Liveness or Safety properties.
Paxos' quorum is a set of one or more attentive members, whose processes respond to queries within some known time limit (< 20ms), which includes roundtrip delivery delay. This property is easy to satisfy given that every coordinator is necessarily a member of the quorum as well, and a quorum of one is permitted. An inattentive member (e.g. one whose actions are late or lost) is deemed to be "not-present" for the purposes of the present transaction and consistency cannot be assured for that member. As long as there is at least one attentive member of the quorum, then persistence of the database is assured.
Each member maintains a ledger of terms along with information about when they were originally recorded. The member's ledger is deterministic. That is to say that there can only be one entry per functor/arity combination. No member will accept a new term proposal that has a line number that is equal-to or lower-than the one that is already recorded in the ledger.
Paxos is a three-phase protocol:
1: A coordinator first prepares the quorum for a new proposal by broadcasting a proposed term. The quorum responds by returning the last known line number for that functor/arity combination that is recorded in their respective ledgers.
2: The coordinator selects the highest line number it receives,
increments it by one, and then asks the quorum to finally accept the new
term with the new line number. The quorum checks their respective
ledgers once again and if there is still no other ledger entry for that
functor/arity combination that is equal-to or higher than the specified
line, then each member records the term in the ledger at the specified
line. The member indicates consent by returning the specified line
number back to the coordinator. If consent is withheld by a member, then
the member returns a nack
instead. The coordinator requires
unanimous consent. If it isn't achieved then the proposal fails and the
coordinator must start over from the beginning.
3: Finally, the coordinator concludes the successful negotiation by
broadcasting the agreement to the quorum in the form of a
paxos_changed(Key,Value)
event. This is the only event that
should be of interest to user programs.
For practical reasons, we rely on the partially synchronous behavior (e.g. limited upper time bound for replies) of broadcast_request/1 over TIPC to ensure Progress. Perhaps more importantly, we rely on the fact that the TIPC broadcast listener state machine guarantees the atomicity of broadcast_request/1 at the process level, thus obviating the need for external mutual exclusion mechanisms.
Note that this algorithm does not guarantee the rightness of the value proposed. It only guarantees that if successful, the value proposed is identical for all attentive members of the quorum.
paxos_initialize([])
is executed lazily as part
of the first paxos operation. Defined options:
NodeID must be a small non-negative integer as these identifiers are used in bitmaps.
paxos_key(Term,Key)
, pasox_set(Key,Term)
.
I.e., Term is a ground compound term and its key is the
name/arity pair. This version provides compatibility with older versions
of this library.On success, paxos_set/1 will
also broadcast the term
paxos_changed(Key,Value)
, to the quorum.
Options processed:
max_sets
(20).
response_timeout
(0.020, 20ms).
Term | is a compound that may have unbound variables. |
response_timeout
.
paxos_key(Term,Key)
, pasox_get(Key,Term)
.
I.e., Term is a compound term and its key is the name/arity
pair. This version provides compatibility with older versions of this
library.Options processed:
max_gets
(5).
response_timeout
(0.020, 20ms).
Term | is a compound. Any unbound variables are unified with those provided in the ledger entry. |
response_timeout
(0.020, 20ms).
paxos_changed(Key,Value)
notifications for Key,
which are emitted as the result of successful paxos_set/3
transactions. When one is received for Key, then Goal
is executed in a separate thread of execution.
Term | is a compound, identical to that used for paxos_get/1. |
Goal | is one of:
|