transactor

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
USAGE
CLIENT COMMUNICATION PROTOCOL
DEBUG FLAGS AND DEBUGMASK
NETWORKING
SECURITY
FILES
DIAGNOSTICS
NOTES
FAILURE RECOVERY
SEE ALSO
BUGS
COPYRIGHT
AUTHOR

NAME

transactor - a coordinator of distributed transactions

SYNOPSIS

transactor -c configfile [OPTION]...

DESCRIPTION

This manual page describes transactor, a general purpose transaction coordinator that implements a distributed consensus protocol.
Transactor can synchronize transactions between multiple redundant databases that run on different computers (nodes), possibly spread across the world and connected by the Internet.
Transactor does not offer database functionality itself. On top of a database, it can be used to implement a failsafe, redundant and distributed system that will continue to work even if some nodes experience failure.
The system will continue to work as long as a majority of nodes remains alive; a majority is more than half of all nodes, i.e. two out of three, or three out of five. After a network outage or hardware failure/replacement, a failed node can join the system through a recovery process.
There are no restrictions on the underlying database, since transactor does not interpret the transaction data in any way beyond safely storing it, communicating it to the peers and reaching a consensus on the next transaction. Therefore, while transactions are usually found in the realm of databases, transactor can be used in any situation where a distributed application must reach a consensus across all nodes.
Since transactor implements the basic service of transaction coordination, it needs to be used in conjunction with an application/database-specific server that operates the database.

OPTIONS

-c filename, --config filename
Use this configuration file. See the man page for transactor.conf(5). A configuration file must be specified.
-nd, --nodetach
Do not run as a daemon: remain attached to the terminal, and do not fork. This option will also set a few debug flags, unless the option -dm is specified. Debug information will be printed on stderr or to the debug log file specified with -d.
-d filename, --debug filename
Write debug information to this debug log file. The amount of debug information is determined by the debug flags set in the debug mask; if no debug mask is specified, all debug flags will be set.
-dm value, --debugmask value
Set the debug mask to value. The debug mask is an unsigned 32-bit number. Each bit that is set causes specific debug information to be emitted to the debug file (or to stderr, if -nd is specified). See the section DEBUGMASK below for a description of the bits. The value can be specified as a hex value, preceded by '0x', or as an octal value, preceded by '0'.
--speedtest
Allow running a speed test. The admin interface (see the section USAGE below) must be used to turn on the speed test. This option should not be set for production use.
-h, --help
Print a list of commandline options.

USAGE

Application Server (Client)
An application utilizing transactor needs an application server that communicates with transactor, operates the database, and serves data/takes requests from the rest of the application. The application server acts as a client to transactor.
The application server should do the following:
1.
Open the database and connect to transactor.
2.
Subscribe to new transactions from transactor.
3.
Whenever there is a request from the application to perform some operation on the database, encode the request in a chunk of binary data (a transaction specification) and submit (post) that data to transactor.
4.
Whenever the transactor sends a transaction specification (tlog_tspec), decode it and perform the equivalent operations on the database.
If the transaction specification does not match the one posted on behalf of the local application, it has been submitted on a remote node. Probably the result of the database operation should be discarded. (That depends on the application).
If the transaction specification matches the specification posted, return the results of the database operation to the application.
Transactor will only consider one locally submitted transaction specification for any particular transaction.
If two or more local application servers compete for placing their transaction specifications in the next transaction, transactor will handle that situation correctly but not very efficiently. In particular, bandwidth may be wasted by distributing a specification to all peer transactors that will be immediately replaced in the next transaction.
For best performance, only one local application server should be used, or the transaction servers should employ some kind of locking scheme to ensure only one is posting transactions at any point in time.
Transaction Specification
A transaction specification (tspec) is a chunk of binary data that has been prepared by the application server and is intended to become a transaction.
It is completely opaque to transactor; the contents of the chunk will not be interpreted in any way.
Specification Id: sid
When transactor receives such a transaction specification, it marks it with a unique stamp, a specification id (sid). The specification id is of the form node:number, where node is the 16-bit index of the node in the cluster, and number is a 64-bit number.
With this stamp, the tspec is distributed among the peers, and considered for the next transaction. The transactor client (i.e. the application server) is given the sid for further reference.
Transaction Id: tid, current tid, tlog_tid
All transactions in the system are numbered consecutively by a 64-bit number, starting with 1 for the first one. This number is called the transaction id (tid). With this number, a transaction can be looked up in the transaction log (tlog).
There are two important tid numbers at any time: the ``current tid'' is the next transaction to be completed by transactor; it has either started or will be started as soon as one of the transactor nodes in the cluster receives a transaction specification.
The ``tlog_tid'' is the last transaction in the local transaction log files; up to that number, all transactions can be retrieved by the application server.
If the local transactor is up-to-date, the current tid will be one above the tlog_tid. If the transactor is recovering, the difference between both numbers will be larger.
Transactions and the Consensus Protocol
Once a transaction specification has been submitted to any transactor node in the system, the cluster of transactors will start the consensus protocol to decide on the transaction specification to place in the current transaction. If multiple specifications have been submitted concurrently, one will be picked.
While some nodes may fail in that process, the overall system will proceed as long as a majority of all transactors (more than half of the total number) is alive. (If there is no majority of nodes participating, the process hangs.)
The final outcome of this process is a decision on the next transaction specification, which is known by all non-failed nodes in the cluster, Those nodes also have the specification available.
Each transactor stores the transaction in the tlog, and informs its application servers if so requested.
Transactor as a Server
Transactor runs as a server process and offers its service via a TCP network port. An application server using transactor is a client to the transactor service, connecting to this port and issuing commands.
Transactor binds to another port that is intended for administration purposes. It offers essentially the same functionality as the main service port.
See the section on the CLIENT COMMUNICATION PROTOCOL below for a description of those network ports.
Transactor reads those port specifications, as well as the list of its peers, from the configuration file (see man transactor.conf(5)).
It communicates with its peers over TCP and UDP. See the NETWORKING section below.
Transactor saves all completed transactions in a perpetual transaction log on the file system. It also keeps a message log to recover operations when the server has been shut down. Those are described in the FILES section below.

CLIENT COMMUNICATION PROTOCOL

After connecting to transactor's service port, a client (i.e. an application server) can issue commands.
Transactor also offers an admin port, to which an operator (a human) can connect with telnet, netcat, or socat.
On both interfaces, the same commands are available; the difference is that the admin interface will not be closed after syntactic errors.
If a keepalive timeout is set in the configuration file, transactor will send a pingn message to the client after some time without communication, which must be answered by a corresponding pongn by the client. Both messages consist in four characters, followed by a newline. Those ping and pong messages are not commands, therefore they are not listed below.
Command Syntax
Each command must be on a single line. Whitespace before and after the command is discarded. If a command has arguments, those are on the same line, separated by whitespace.
The only exception from this rule is run-length encoded data in the post command and in the tlog_tspec response.
Commands with syntax errors are answered by a line starting with error: and describing the error. On the regular service port, transactor will disconnect after the error to avoid confusing an application server that expects a different response. The admin interface will remain open.
Binary Data Encoding
In the post command, binary data can be passed to transactor in two possible encodings.
The first encoding is a C-style-encoded string where all non-printable or non-ascii characters are encoded as \ooo (three octal digits), and the characters " and ' are encoded as " and '. The string is enclosed by double quotes " on each side.
Note that encoding the data as a string does not mean it will be interpreted in any way by transactor; in particular, transactor does not know (and does not want to know) about character sets, Unicode, multibyte characters, 0 characters and the likes. Transactor treats the string as an opaque sequence of bytes.
The string cannot span multiple lines; the opening and closing double-quote must be on the same line as the command word.
The second encoding is a run-length encoded binary transfer. The client first sends the length in bytes as a textual number, then a newline character, then all bytes of the binary data, and finally another newline character.
In the tlog_tspec response to the gettspec or the subscribe command, transactor sends binary data to the client. This data is always run-length encoded, i.e. in the second method mentioned above.
Client Commands and Responses: Admin Commands
The first group of commands are primarily intended for the admin interface. They should not be used by a regular client.
quit
exit
The quit and exit commands cause the server to close the connection to the client. The client need not use these commands; it can simply close the connection on its own.
terminate
The terminate command causes the server to shut down operations gracefully and exit. All allocated memory is freed before exiting. Transactor should be able to restart even if it has not been gracefully shut down (i.e. with kill -9).
help
Prints a short overview of all commands.
help command
Prints details about command.
start
The start command causes transactor to start serving clients on the regular interface, after that has been suspended by the stop command. This command is not implemented yet.
stop
The stop command causes transactor to stop serving clients on the regular interface. This command is not implemented yet.
reset_node nodename
The reset_node command causes transactor to discard all messages ever sent to or received from the node named nodename. The nodename argument must be the name of a peer in the configuration file, encoded as a C-style string. See section FAILURE RECOVERY for details. This command is not implemented yet.
info
Prints info about the current transactor status in a human-readable format that might also be suitable for automatic parsing.
dump
Prints a detailed debug dump of the internal data structures to the client.
dump filename
Prints a detailed debug dump of the internal data structures to file filename.
debugdump
Prints a detailed debug dump of the internal data structures to the debug-file. A debug dump can also be triggered by the signals SIGUSR1, SIGUSR2, and SIGHUP; for those signals, the debug dump will be written to the signal_dumpfile specified in the configuration file.
debugfile filename
Sets or changes the name of the debug file.
debugmask all
Sets all debug flags (prints all available debug messages).
debugmask none
Unsets all debug flags (prints no debug messages).
debugmask +flag
Sets the specified debug flag (turns on printing the corresponding debug messages to the debug log file). For possible values of flag, see the section DEBUG FLAGS AND DEBUGMASK below.
debugmask -flag
Unsets the specified debug flag (turns off printing the corresponding debug messages). For possible values of flag, see the section DEBUG FLAGS AND DEBUGMASK below.
debugmask
Show the debug mask.
debug on
Switch on debugging.
debug off
Switch off debugging and close the debugfile.
debug
Show whether debug is on or off.
speedtest size
Start an endless loop of random transactions of the specified size in bytes.
speedtest off
Stop the speed test.
Client Commands and Responses: Regular Commands
The second group of commands is intended for use by the regular client. They are required for the client to interact with and properly use the transactor.
subscribe tid
Subscribe to all transactions with transaction id from tid upwards.
After issuing this command, the client will be fed the binary data of all transactions, starting from tid up to the last transaction completed. As new transactions are being completed, the client will be fed those as well.
The client is also sent an initial status message, and will be sent further status messages for all status changes of the transactor. See the getstatus command for an explanation of the status response.
Subscriptions can be changed by the same subscribe command; a newer command supersedes an older one.
Responses (not necessarily instantaneous):
status n_peers iself majority_alive tlog_status tlogtid tid started
The status message is sent as an immediate response to the subscribe request. Until the subscription is turned off, a new status message will be sent whenever the majorit_alive or tlog_status flags change their states.
See the getstatus command for an explanation of the status response.
tlog_tspec tid tid sid sid size sizendata...n
The data of the specification for transaction tid in the tlog.
This message is sent once for all transactions between the subscribed tid and the current tlog end tid. As the transactor makes progress and additional transactions are added to the transaction log, new messages will be sent to keep the client updated, until the subscription is turned off.
See the gettspec command for an explanation of the tlog_tspec response.
subscribe off
Turn off the subscription.
post rqnum tid "data_string"
Post the data in the C-style quoted data_string. This form of the post command is simply newline-terminated. Except for the data encoding, it behaves like the run-length encoded form below.
post rqnum tid sizendata...n
Post the run-length encoded data of size bytes length. The data starts immediately after the n that ends the post line, and is immediately followed by another n character.
A positive 32-bit number rqnum can be chosen by the client; the number will be used in the response to this post command. The number is not otherwise used in transactor.
The post command causes transactor to try and place the transaction specification data in transaction tid.
Responses:
post_response rqnum tid_old
It is too late. The transaction id tid given by the client is outdated; internally, the transactor is ahead and has already started or even completed this transaction.
The client should obtain the current transaction id with the getstatus command, unless it knows it through subscribe. If the current transaction id is the same as used by the client in the request, the transaction is already started and the client must wait until this transaction is completed, before trying to post again.
post_response rqnum tid_new
It is too early. The transaction id tid given by the client is ahead of the end of the local transaction log (see the FILES section for a description of the transaction log).
Note that the transaction log (tlog) may lag back behind the current transaction id of the transactor, if transactor is recovering from a disconnected or failed phase. In this case, there is no possible tid for the client to use that is not either tid_old or tid_new. The client should wait until the tlog_tid as shown by the getstatus command is one behind the current transaction id of the transactor; then the transactor is up-to-date again.
post_response rqnum other_sid
The transactor currently holds another block of binary data in a transaction specification that is part of an ongoing transaction and cannot be overridden.
The client should retry posting when the current transaction is completed.
post_response rqnum tid_started
The tid given by the client is the current transaction id of transactor, and also just one ahead of the last tid in the transaction log (the tlog_tid), but the transaction is already started and it's too late.
The client should retry posting when the current transaction is completed.
post_response rqnum post_error
An unspecified error occurred. This should not happen, and the client could react by terminating.
post_response rqnum propose_error
An unspecified error occurred. This should not happen, and the client could react by terminating.
post_response rqnum posted sid tid
Success! (At least, for the moment). The request data was successfully posted and made into a transaction specification with the specification id sid. The transactor will try to place it in transaction tid. The specification id sid is unique to this particular transaction data and will not be used by another transactor in the same cluster.
From now on, the decisions are out of the hands of the client. It's too late for the client to withdraw the transaction specification, but transactor will not guarantee that this ends up as the next transaction. Another client on another computer could win the race.
Note that there is no notification of the client of the final outcome for transaction id tid as a specific response to this request. The client can either poll transactor with getstatus until tid is finished, then retrieve the result with gettspec and compare the sid to find out whether it has been successful, or it can subscribe and be notified of the completion automatically.
repost tid sid
Try to place the previously posted data with specification id sid in transaction tid.
This is the same as the post command, but avoids retransmission of the binary transaction data from the client to transactor and, more importantly, redistribution of that data to the other transactors in the cluster.
In order for this to succeed, the transaction specification must not have become a transaction yet; so, the previous post must have been successful but another transaction specification must have been selected by the cluster of transactors for the previously used tid.
Also, no other client to the local transactor must have had a successful post in the meantime, since the cluster of transactors will only consider one particular transaction specification from each transactor at one single point in time.
If one of these conditions is violated, the repost will fail and the client must use a plain post command instead.
Responses:
repost_response tid sid tid_old
Same meaning as the similar post_response above.
repost_response tid sid tid_new
Same meaning as the similar post_response above.
repost_response tid sid other_sid
The sid given by the client is no longer stored in the transactor, or never was. The client should use post to obtain a new specification id for the data.
repost_response tid sid tid_started
Same meaning as the similar post_response above.
repost_response tid sid propose_error
Same meaning as the similar post_response above.
repost_response tid sid posted sid tid
Same meaning as the similar post_response above.
repost_response tid sid sid_done
The transaction specification with specification id sid has already been placed in a transaction. It is no longer available.
The client should use post to obtain a new specification id for the data.
gettid
The gettid command asks the server for the transaction id of the last completed transaction that it knows of, which is the tlog_tid of the local transaction log.
If the local transactor is fully up-to-date, this will be one below the current transaction id of transactor (which is the transaction in progress or yet to be started). This value can lag behind if the local transactor is not up to date.
Response:
tlog_tid tlog_tid
The last transaction id stored in the local transaction log (tlog).
getstatus
The getstatus command asks the server for current transaction status information. The resulting information can be used by the client to decide how to proceed.
Response:
status n_peers iself majority_alive tlog_status tlogtid tid started
The status message communicates n_peers, the total number of peers in the cluster of transactors; iself, the 0-based index of the local transactor in the cluster-wide list of peers; the majority_alive flag, which is majalive when the majority of peers is alive, and majnotalive otherwise; the tlog_status flag, which is uptodate when the local transaction log is up-to-date, and backlag otherwise; tlog_tid, the last transaction id in the local transaction log (same as in the gettid reponse); tid, the transaction id of the next transaction coordinated in the cluster; and the started flag, which is started when the next transaction has already started, and not_started otherwise.
gettspec tid
The gettspec requests asks the server for the transaction specification data of the completed transaction with transaction id tid.
Responses:
tlog_tid tlog_tid
The transaction id of the last transaction in the local transaction log (tlog). This response is sent if the requested tid is too high.
tlog_tspec tid tid sid sid size sizendata...n
The data of the transaction specification in the local transaction log (tlog). The specification id sid is sent so that the client can compare it against any specification id's it has saved from the last post or repost request; otherwise, it is of no use.
The data of length size is run-length encoded; it starts immediately after the n that ends the tlog_tspec line, and is immediately followed by another n character.
tlog_tspec_not_available tid
The transaction data for the historic transaction id tid is no more available in the local transaction log.

DEBUG FLAGS AND DEBUGMASK

The following list describes the debug flags and their meaning, as well as the corresponding bits of the debug mask.
Debug flags for the interface to transactor clients:
se_tasks (01)
Print information on transactions that are sent to subscribed clients.
se_requests (02)
Print information on requests received by clients.
se_connections (04)
Print information on clients connecting to transactor.
Debug flags for the transactor state machine that implements the distributed consensus protocol:
ta_state (010)
Print the state of the transactor state machine.
ta_arrivals (020)
Print messages relevant to the transactor state as they arrive from peer transactors.
ta_posts (040)
Print messages relevant to the transactor state as they are posted to peer transactors.
ta_moreposts (0100)
Print more information on messages relevant to the transactor state as they are posted to peer transactors.
ta_acks (0200)
Print message acknowledgements as they arrive or are posted to peer transactors.
ta_trns (0400)
Print information on transactions being completed.
ta_part (01000)
Print information on how the transactor participates in the transaction amongst its peers.
ta_coord (02000)
Print information on how the transactor coordinates a transaction amongst its peers.
ta_trntime (04000)
Print information on the total time required to complete each transaction.
Debug flags for the communication/routing layer:
select (010000)
Print information on the calls to select().
peerdi (020000)
Print information on the routing paths of messages between peers, and changes to the local peer distance information.
msg_arrived (040000)
Print messages arriving in the communication/routing layer from peer transactors.
queueing (0100000)
Print messages being queued to other peers by the communication/routing layer.
msg_ackobs (0200000)
Print 'acknowledge' and 'obsolete' messages received and queued by the communication/routing layer.
msg_payload (0400000)
Print payload messages (i.e. messages that make up the distributed transaction protocol) received and queued by the communication/routing layer.
slot (01000000)
Print the changes to the state of all message slots in the communication/routing layer.
connectivity (02000000)
Print changes to the direct connectivity of peers, i.e. peers becoming connected or disconnected.
peer_index (04000000)
Print internal information on the indices of the peers that are currently worked on by the communication/routing layer.
Debug flags for the peerlink layer:
pl_connection (010000000)
Print information on the establishment or loss of connections to a peer in the peerlink layer.
pl_queueing (020000000)
Print messages being queued in the peerlink layer by the communication/routing layer.
pl_sendreceive (040000000)
Print TCP and UDP messages being sent or received in the peerlink layer.
pl_readwrite (0100000000)
Print information on TCP data being received in the peerlink layer.
pl_status (0200000000)
Print the internal status of the peerlink layer for all peers.
pl_listeninfo (0400000000)
Print information on TCP sockets listening for connections from peers.
pl_nosupperrors (01000000000)
Do not suppress multiple error messages from the peerlink layer. This flag does not have any effect.
pl_packets (02000000000)
Print information on messages being encoded, encrypted, decrypted, and decoded before/after network transport.

NETWORKING

Service and Admin Port
The regular service port and the admin port of transactor listen on a standard TCP port each.
Low Level Networking: UDP
Networking between the nodes of the cluster (between the transactors) is more sophisticated, since transaction negotiation consists in multiple small messages and it is desirable to use a lean protocol with very little overhead.
Transactor therefore tries to send all messages as UDP packets. On a network with a reasonably low rate of packet failures, this works quite well. Transactor uses checksums and deals with duplicate or lost packets. UDP messaging also allows transactor to use quite short timeouts to detect node or connection failures; this is desirable since unreachable nodes block the consensus process until they are considered failed.
Path MTU
Transactor also does the regular path MTU discovery for UDP. In an attempt to treat connections with asymmetric and arbitrary MTU sizes correctly, transactor defines five standard UDP packet sizes that are tested specifically on connection establishment, and falls back to sending messages in those standard sizes whenever it experiences problems with UDP packets getting through.
Low Level Networking: TCP
Transactor also opens a TCP connection between each pair of nodes. In situations where UDP is unstable because the network does not offer good UDP transport, the user should limit UDP message size explicitely in the configuration file, or turn off UDP altogether. Packets that cannot be sent over UDP will be switched over to the TCP connection.
Listening Sockets
For each pair of nodes, the TCP connection requires a listening socket on one side and a connecting one on the other. The connecting node is that with the alphabetically lower name.
The connecting node will use an ephemeral TCP port for connecting, as it is typical for the TCP protocol. The TCP port specified in the configuration file will be ignored.
Multiple Transactors On Same IP Address
If the user runs multiple nodes of transactor, belonging to the same cluster, on the same computer and using the same IP address, there will be problems with TCP connection establishment. The reason is that a node with a listening socket usually determines the peer that connects to the socket by matching the IP address. If two peers have the same address, the mismatch will result in some fruitless attempts at connection establishment.
Usually, the connections can be established after a few hiccups, but the process of connection establishment is not so predictable.
Those problems can be avoided by using a different listening socket for each connection in each node.
This problem is mainly relevant when testing transactor on a single computer, it doesn't occur in real-life distributed systems.
Communication/Routing Layer
In order to fulfill the requirements on the failure detector in the consensus algorithm, routing of messages is required. Transactor has a communication layer that implements high level message passing, error and failure detection, and routing.
This is a precondition for the failure detector to deal with unreliable networking correctly.
This means that a cluster of nodes may continue to work even though many nodes are not directly connected.
The user/administrator is responsible for keeping the graph of nodes in the cluster as well connected as possible, to avoid single points of failure.
In the ideal case, each pair of nodes is directly connected.
Timeouts
Some of the relevant timeouts of transactor are:
1 min
Time between connection attempts of failed connections.
20 s
Time between attempts to bind local sockets.
20 s
Total time for connection establishment.
10 s
Timeout to close a broken connection. Since the system will attempt to route around the broken connection, twice this timeout will pass before the system declares a node unreachable.
Note that broken connections will only be detected when the consensus protocol is being run; transactor does not use idle pings to monitor network connectivity when there is no work to do.
Therefore, typically the first transaction after some time of stagnation will encounter the unreachable-node-timeout of 20 s, before the transaction is completed. The next transactions will run regularly.

SECURITY

Transactor operates under the assumption that the local computer is trustworthy, and the peer transactors are trustworthy, but the network inbetween is not.
Transactor does not handle byzantine failures, i.e. malicious destructive operation by peer transactors or their application servers.
Network messages can (and should) be encrypted by specifying symmetric keys in the configuration file. See the transactor.conf(5) manpage. Encryption is done with AES in ECB mode. Transactor's UDP message handling will cope with replay attacks and faked random messages, since each packet includes a 64 bit checksum under the AES.
However, it is possible that repeated transaction data patterns show as repeated ciphertext in the packets. If that type of message content leakage is a concern, a VPN tunnel should be used.

FILES

/etc/transactor.conf
The configuration file. See transactor.conf(5) for the format. There is no default path for the configuration file. It must be specified on the command line.
/var/transactor/tlog The local transaction log. This is a directory; see transactor_tlog(5) for the structure and contents. There is no default path. The full absolute path of the directory is specified in the configuration file. The transaction log directory must be initially empty, but existing.
/var/transactor/mlog The local message log. This file serves as a journal of messages sent to and received from peer transactors. It is automatically shrunk from time to time; its size may grow to a few 100 transactions. See transactor_mlog(5) for a description. There is no default path. The full absolute path of the file is specified in the configuration file. The message log file must be initially empty, but existing.
/var/run/transactor.pid The pid file for use by the start-stop-scripts. This file contains the process id of transactor in textual form, followed by a single newline. There is no default path. The full absolute path of the file is specified in the configuration file.

DIAGNOSTICS

Transactor logs important events, particularly hard errors and reasons for program termination, through the syslog() service. The log facility can be adjusted in the config file.
Debug output can be sent to a debug log file; see the commandline options and the client commands debug, debugmask, debugfile, and debugdump in the section CLIENT COMMUNICATION PROTOCOL above. Be warned that this may generate large amounts of data.
The SIGUSR1, SIGUSR2, and SIGHUP signals cause transactor to write a debug dump of the current status to the signal_dumpfile specified in the configuration file.
The administator can telnet to the administration interface and use the commands info, and dump to get information.

NOTES

Some notes on the algorithm/protocol of transactor are in order.
Literature
There is some classic literature relevant to the problems of distributed consensus/atomic broadcast.
[1]
M. J. Fischer, N. A. Lynch, and M. S. Paterson
Impossibility of distributed consensus with one faulty process

Journal of the ACM 32 (1985), pp. 374-382
[2]
T. D. Chandra and S. Toueg
Unreliable failure detectors for reliable distributed systems

Journal of the ACM 43 (1996), pp. 225-267
[3]
T. D. Chandra, V. Hadzilacos, and S. Toueg
The weakest failure detector for solving consensus

Journal of the ACM 43 (1996), pp. 685-722
The algorithm on p.243 of [2] shows the principle of consensus with error detectors.
Transactor Protocol
As a real-world application, transactor uses a much more elaborate protocol that incorporates the following features:
o
unreliable real-world messaging
o
small message (packet) size, large transaction data size
o
detection of network failure
o
detection of peer failure
o
recovery of nodes
o
handling of a continuing sequence of transactions
o
automatic start of the consensus process
o
optimization of the best case scenario
In the best case scenario (good network connection, all nodes up, small transaction data, no concurrent transaction requests by two or more nodes), transactor may need just three UDP messages per node per transaction.
Exactly When Does The Transaction Happen?
If you're used to the world of two-phase commits, you will undoubtedly think about the exact point of no return, where the transaction actually is decided.
In the world of distributed consensus, however, that simple notion needs to be revisited. For each transaction, there is such a point where it has been decided; however, the single node that has only an incomplete picture of the full system cannot determine that point.
As an omniscient global observer, one can note that the transaction happens when the number of nodes that have fulfilled the coordinator's request (to settle on the transaction specification chosen by the coordinator) reaches a majority, i.e. is more than half the number of the total nodes. I.e. the transaction happens when the last node to make that majority has safely stored the coordinator's message.
While that point in time is clearly defined, the fascinating fact is that, at the time, none of the nodes knows that it's happening, including the coordinator and the node that stores the message.
Limits and Miscellaneous Notes
Number of nodes
The memory requirement of transactor for the communication/routing layer grows with the square of the number of nodes. The number of nodes should therefore be kept reasonably small.
An odd number is obviously preferrable, so 3, 5, or 7 nodes seem reasonable. The theoretical limit is 16000.
Transaction specification size
The theoretical limit of the transaction data is 4 GB (on a 64-bit machine). For practical considerations, up to 100 transactions are stored in one file of the transaction log, and up to five such files are kept in memory of transactor, plus one transaction specification from each node plus some overhead.
This gives maximum limit for the transaction size of about 4 MB. However, transactor does not use optimum network transport for such large amounts of data; it maintains only a small number of outstanding unacknowledged packets at a time.
Therefore, large transaction sizes should probably be handled by the application servers outside transactor, and transactor should only be used to negotiate the transaction with a symbolic handle.
Up to some hundred thousand bytes of data, transactor will behave very well.
Number of application servers per transactor
For maximum throughput, the number of application servers should be limited to one per transactor; see the subsection Application Server above.
Number of transactions
While the number of transactions is virtually unlimited (a 64 bit unsigned value), old transactions cannot be safely removed.

FAILURE RECOVERY

When a node fails or is taken out of the cluster, this can be due to several reasons. Depending on the reason, the recovery procedure is different.
Network Outage
If the network connection to the node is temporarily unavailable, there is nothing to do. Once the node can be reached again, it will automatically join the cluster and catch up eventually. (There are some timeouts relevant to this process, usually on the scale of minutes).
Network Address Change
If a node changes its network address, re-integration in the cluster should be done in those steps:
1.
Stop transactor on the failed node.
2.
Change the configuration file of the failed node to reflect its new network address.
3.
Start transactor on the failed node.
4.
For each other node in the cluster, stop transactor, change the configuration file to reflect the new peer address, and start it again. Wait for the failed node to be integrated before proceeding to the next node.
Note that for the cluster to continue operating all the way through this process, there must be at least five nodes in total, since in step 3, when performed for the first peer node, an additional node will be down without the failed node being up yet.
Hardware Change
If the hardware of a node is to be changed, transactor should be stopped. Then the config file, transaction log and message log must be copied over to the new hardware. See the next point if those logs are not available any more. Finally, transactor should be started on the new hardware.
Hardware Failure
In the event of a hardware failure where either the transaction log or the message log are not fully recoverable, proceed like this:
1.
Shut down all failed nodes, not just those with hardware failure, but also nodes that are not reachable or down for other reasons.
2.
On all failed nodes, not just those with hardware failure, empty the message log file, i.e. truncate it to zero length.
3.
On all nodes with hardware failure, create or copy the config file and those parts of the transaction log that are still recoverable to the new hardware.
If some files are unrecoverable, do not copy any further transaction files. Do not create any empty directories of the new transaction log; create only those directories necessary to hold the recovered files.
The transaction log must hold a contiguous sequence of transactions starting with tid 1.
If in doubt, leave the transaction log completely empty.
4.
On all running nodes of the cluster, telnet to the admin port of transactor and give the command reset_node nodename, for each failed node, not just those with hardware failure.
This must be done while the failed nodes are still down.
5.
On all failed nodes of the cluster, start transactor.
The reason for this procedure is that each pair of nodes uses message sequence numbers for their communication. When a node's message log file is damaged, all peers communicating with this peer must have those sequence numbers reset. For the running nodes, the command reset_node can be used. For failed nodes, the message log file must be cleared, but that also resets the sequence numbers for those nodes' communication in turn.
Recovery or Reconfiguration With Stopped Cluster
If the total cluster operation may be stopped, recovery or even changing the number of nodes should be done in those steps:
1.
Stop transactor on all nodes.
2.
Clear the message log file on all nodes.
3.
Change the configuration files on all nodes as desired.
4.
Copy the transaction log directory from a non-failed node to all nodes with corrupt transaction log.
5.
Start transactor on all nodes.

SEE ALSO

transactor.conf(5), transactor_mlog(5), transactor_tlog(5)
syslog.conf(5), telnet(1), socat(1)

BUGS

Transactor is slow, due to the protocol overhead. The best case is three network packets per peer per transaction; however, those three are serialized. The regular case is five packets plus the transaction data packets, and the worst case with nodes failing during consensus adds a few rounds (of another four packets each).
The number of nodes, or the names of the nodes, cannot be changed in a running system.
Transactor should come with start-stop-scripts for the popular architectures.
Transactor should come with a C library to access it from an application server.
Old transactions cannot be deleted from the transaction log currently. It is unsafe to do so.
Running multiple transactors of the same cluster on the same network interface is bound for trouble; see NETWORKING above.
The handling of MTU path discovery for UDP packets relies on correct ICMP 'fragmentation required' responses by the network. There should be heuristics in the code to work around bad networks.
The overall UDP handling could be better, with round trip time detection etc.
Transactor might crash if it encounters partially written files after a hardware failure or unclean OS shutdown. This will affect the local node only, but will require manual failure recovery.
Transactor does not handle byzantine failures, i.e. malicious peers or application servers. The system is based on trust. (Malicious network attacks should be handled properly, though).
Recovery of failed nodes is too complicated.
The reset_node command is not implemented yet.
There is no authentication of the application server or admin user.
There is no encryption of the transaction log and message log.
Some of the network timeouts, particularly the retry after failed connections, are probably too big.
One transactor should handle multiple application servers better.
The syslog messages should be reduced somewhat.

COPYRIGHT

Transactor is Copyright 2003, 2004 Claus Fischer

AUTHOR

Transactor was written by Claus Fischer <transactor@clausfischer.com>.