% This LaTeX document was generated using the LaTeX backend of PlDoc, % The SWI-Prolog documentation system \section{library(paxos): A Replicated Data Store} \label{sec:paxos} \begin{tags} \tag{author} Jeffrey Rosenwald (JeffRose@acm.org) \tag{See also} \file{tipc_broadcast.pl}, \file{udp_broadcast.pl} \tag{license} BSD-2 \end{tags} This module provides a replicated data store that is coordinated using a variation on Lamport's Paxos concensus protocol. The original method is described in his paper entitled, "The Part-time Parliament", which was published in 1998. The algorithm is tolerant of non-Byzantine failure. That is late or lost delivery or reply, but not senseless delivery or reply. The present algorithm takes advantage of the convenience offered by multicast to the quorum's membership, who can remain anonymous and who can come and go as they please without effecting Liveness or Safety properties. Paxos' quorum is a set of one or more attentive members, whose processes respond to queries within some known time limit ($<$ 20ms), which includes roundtrip delivery delay. This property is easy to satisfy given that every coordinator is necessarily a member of the quorum as well, and a quorum of one is permitted. An inattentive member (e.g. one whose actions are late or lost) is deemed to be "not-present" for the purposes of the present transaction and consistency cannot be assured for that member. As long as there is at least one attentive member of the quorum, then persistence of the database is assured. Each member maintains a ledger of terms along with information about when they were originally recorded. The member's ledger is deterministic. That is to say that there can only be one entry per functor/arity combination. No member will accept a new term proposal that has a line number that is equal-to or lower-than the one that is already recorded in the ledger. Paxos is a three-phase protocol: 1: A coordinator first prepares the quorum for a new proposal by broadcasting a proposed term. The quorum responds by returning the last known line number for that functor/arity combination that is recorded in their respective ledgers. 2: The coordinator selects the highest line number it receives, increments it by one, and then asks the quorum to finally accept the new term with the new line number. The quorum checks their respective ledgers once again and if there is still no other ledger entry for that functor/arity combination that is equal-to or higher than the specified line, then each member records the term in the ledger at the specified line. The member indicates consent by returning the specified line number back to the coordinator. If consent is withheld by a member, then the member returns a \const{nack} instead. The coordinator requires unanimous consent. If it isn't achieved then the proposal fails and the coordinator must start over from the beginning. 3: Finally, the coordinator concludes the successful negotiation by broadcasting the agreement to the quorum in the form of a \verb$paxos_changed(Key,Value)$ event. This is the only event that should be of interest to user programs. For practical reasons, we rely on the partially synchronous behavior (e.g. limited upper time bound for replies) of \predref{broadcast_request}{1} over TIPC to ensure Progress. Perhaps more importantly, we rely on the fact that the TIPC broadcast listener state machine guarantees the atomicity of \predref{broadcast_request}{1} at the process level, thus obviating the need for external mutual exclusion mechanisms. \textit{Note that this algorithm does not guarantee the rightness of the value proposed. It only guarantees that if successful, the value proposed is identical for all attentive members of the quorum.}\vspace{0.7cm} \begin{description} \predicate[det]{paxos_initialize}{1}{+Options} Initialize this Prolog process as a paxos node. The initialization requires an initialized and configured TIPC, UDP or other broadcast protocol. Calling this initialization may be omitted, in which case the equivant of \verb$paxos_initialize([])$ is executed lazily as part of the first paxos operation. Defined options: \begin{description} \termitem{node}{?NodeID} When instantiated, this node rejoins the network with the given node id. A fixed node idea should be used if the node is configured for persistency and causes the new node to receive updates for keys that have been created or modified since the node left the network. If \arg{NodeID} is a variable it is unified with the discovered \arg{NodeID}. \arg{NodeID} must be a small non-negative integer as these identifiers are used in bitmaps. \end{description} \predicate{paxos_property}{1}{?Property} True if \arg{Property} is a current property for the paxos network. Defined properties are: \begin{shortlist} \item node(?NodeID) \item quorum(?NodeBitmap) \item failed(?NodeBitmap) \end{shortlist} \predicate[semidet]{paxos_set}{1}{+Term} Equivalent to \verb$paxos_key(Term,Key)$, \verb$pasox_set(Key,Term)$. I.e., \arg{Term} is a ground compound term and its key is the name/arity pair. This version provides compatibility with older versions of this library. \predicate[semidet]{paxos_set}{2}{+Key, +Value} \nodescription \predicate[semidet]{paxos_set}{3}{+Key, +Value, +Options} negotiates to have \arg{Key}-\arg{Value} recorded in the ledger for each of the quorum's members. This predicate succeeds if the quorum unanimously accepts the proposed term. If no such entry exists in the Paxon's ledger, then one is silently created. \predref{paxos_set}{1} will retry the transaction several times (default: 20) before failing. Failure is rare and is usually the result of a collision of two or more writers writing to the same term at precisely the same time. On failure, it may be useful to wait some random period of time, and then retry the transaction. By specifying a retry count of zero, \predref{paxos_set}{2} will succeed iff the first ballot succeeds. On success, \predref{paxos_set}{1} will also broadcast the term \verb$paxos_changed(Key,Value)$, to the quorum. \arg{Options} processed: \begin{description} \termitem{retry}{Retries} is a non-negative integer specifying the number of retries that will be performed before a set is abandoned. Defaults to the \textit{setting} \verb$max_sets$ (20). \termitem{timeout}{+Seconds} Max time to wait for the forum to reply. Defaults to the \textit{setting} \verb$response_timeout$ (0.020, 20ms). \end{description} \begin{arguments} \arg{Term} & is a compound that may have unbound variables. \\ \end{arguments} \begin{tags} \tag{To be done} If the \arg{Value} is already current, should we simply do nothing? \end{tags} \predicate{paxos_quorum_ask}{4}{?Template, +Message, -Result, +Options} Ask the paxos forum for their opinion. This predicate is not really part of the paxos protocol, but reuses notably the quorum maintenance mechanism of this library for asking questions to the quorum (cluster). \arg{Message} is the message to be asked. \arg{Result} is a list of copies of \arg{Template} from the quorum. \arg{Options}: \begin{description} \termitem{timeout}{+Seconds} Max time to wait for a reply. Default is the setting \verb$response_timeout$. \termitem{node}{?Node} Can be used to include the replying node into \arg{Template}. \termitem{quorum}{?Quorum} Set/query the interviewed quorum as a bitmask \end{description} \predicate[semidet]{paxos_get}{1}{?Term} Equivalent to \verb$paxos_key(Term,Key)$, \verb$pasox_get(Key,Term)$. I.e., \arg{Term} is a compound term and its key is the name/arity pair. This version provides compatibility with older versions of this library. \predicate[semidet]{paxos_get}{2}{+Key, +Value} \nodescription \predicate[semidet]{paxos_get}{3}{+Key, +Value, +Options} unifies Term with the entry retrieved from the Paxon's ledger. If no such entry exists in the member's local cache, then the quorum is asked to provide a value, which is verified for consistency. An implied \predref{paxos_set}{1} follows. This predicate succeeds if a term with the same functor and arity exists in the Paxon's ledger, and fails otherwise. \arg{Options} processed: \begin{description} \termitem{retry}{Retries} is a non-negative integer specifying the number of retries that will be performed before a set is abandoned. Defaults to the \textit{setting} \verb$max_gets$ (5). \termitem{timeout}{+Seconds} Max time to wait for the forum to reply. Defaults to the \textit{setting} \verb$response_timeout$ (0.020, 20ms). \end{description} \begin{arguments} \arg{Term} & is a compound. Any unbound variables are unified with those provided in the ledger entry. \\ \end{arguments} \predicate[det]{paxos_replicate_key}{3}{+Nodes:bitmap, ?Key, +Options} Replicate a \arg{Key} to \arg{Nodes}. If \arg{Key} is unbound, a random key is selected. \begin{description} \termitem{timeout}{+Seconds} Max time to wait for the forum to reply. Defaults to the \textit{setting} \verb$response_timeout$ (0.020, 20ms). \end{description} \predicate[det]{paxos_on_change}{2}{?Term, :Goal} \nodescription \predicate[det]{paxos_on_change}{3}{?Key, ?Value, :Goal} Executes the specified \arg{Goal} when \arg{Key} changes. \predref{paxos_on_change}{2} listens for \verb$paxos_changed(Key,Value)$ notifications for \arg{Key}, which are emitted as the result of successful \predref{paxos_set}{3} transactions. When one is received for \arg{Key}, then \arg{Goal} is executed in a separate thread of execution. \begin{arguments} \arg{Term} & is a compound, identical to that used for \predref{paxos_get}{1}. \\ \arg{Goal} & is one of: \begin{itemize} \item a callable atom or term, or \item the atom \const{ignore}, which causes monitoring for \arg{Term} to be discontinued. \end{itemize} \\ \end{arguments} \predicate[multifile]{paxos_ledger_hook}{5}{+Action, ?Key, ?Gen, ?Value, ?Holders} Hook called for all operations on the ledger. Defined actions are: \begin{description} \termitem{current}{} Enumerate our ledger content. \termitem{get}{} Get a single value from our ledger. \termitem{create}{} Create a new key in our ledger. \termitem{accept}{} Accept a new newly proposed value for a key. Failure causes the library to send a \textit{NACK} message. \termitem{set}{} Final acceptance of Ken@\arg{Gen}, providing the holders that accepted the new value. \termitem{learn}{} Accept new keys in a new node or node that has been offline for some time. \end{description} \end{description}