Replication subsystem enhancements draft

Before describing the Current issues of Eventuate's event replication subsystem, a brief description of the underlying System model is given. This sets the context for the solution proposals in section Extensions.

Introduction

Eventuate applications store events in local event logs that can be connected to each other to form a replicated event log. The owner of a local event log is called location and event replication occurs across locations that share a replicated event log. Event replication is reliable and preserves causal ordering of events which is tracked with vector clocks. From a consumer perspective, a replicated event log provides a causal reliable broadcast (CRB) of events across a group of locations.

Event-sourced components may consume events from and produce events to a local event log. Events produced at one location can be consumed at other locations if these locations share a replicated event log. Event-sourced components that consume from a local event log receive events in an order that is consistent with causal order i.e. they will never see an effect before its cause. Event-sourced components that collaborate over a shared replicated event log can therefore provide causal consistency for replicated mutable state. A special example are operation-based CRDTs that use causality metadata for global state convergence.

System model

In the following, it is assumed that a location has only a single local event log that is connected to the local event logs of other locations, to form a replicated event log (see also replication networks). Of course, multiple local event logs per location are supported too but this is not relevant in context of this discussion.

Replication networks

Two locations, A and B can be connected by an uni-directional replication connection, A -> B or A <- B, where the arrow indicates the event replication direction. Bi-directional event replication, A <-> B (or A - B) is realized by two independent, uni-directional replication connections in opposite directions. With replication connections, locations can be connected to a replication network which may also contain cycles, as show in the following example:

 A --- B
  \   /
    C
    |
    D

Strictly speaking, a bi-directional replication connection between two locations is also a cycle but this is not relevant in context of the system model. When discussing replication network cycles, bi-directional replication connections need not be considered. Also, if there is an uni- or bi-directional replication connection between two locations, these two locations are said to be directly connected.

Connecting locations to a replication network means connecting their local event logs to a replicated event log. Hence, the terms replication network and replicated event log are used interchangeably here, assuming there is only one local event log per location. Also, the terms location and local event log are used interchangeably.

Potential causality

Each local event log maintains a vector clock. The size of the clock scales with the number of local event logs in the replication network. An event that is written to a local event log is assigned a vector timestamp taken from the current time of the vector clock of that event log. A vector clock generates a partial order of events, the happened-before relation → or potential causality of events. Vector timestamps can be used to determine whether any two events have a potential causal relationship or are concurrent: e1 → e2 means that e1 causally precedes e2 whereas e1 ↔︎ e2 means that e1 and e2 are concurrent and don't have a causal relationship.

Storage order

The storage order of events in a local event log is consistent with the potential causality of events:

if e1 → e2 then the sequence number of e2 is greater than the sequence number of e1 in all local event logs of a replicated event log.
if e1 ↔︎ e2 then the relative order of e1 and e2 in a local event log is not defined: e1 may have a lower sequence number than e2 in one local event log but a higher sequence number than e2 in another local event log.

More formally, a given local event log is one of several possible linear extensions of the partial order of events. To preserve the linear extension invariant during replication, replicated events that are in the causal past of a target event log are excluded from being written to that event log.

The causal past of an event log is determined by its current version vector (CVV) which is the least upper bound of the vector timestamps of all events stored in that log. If the vector timestamp of a replicated event causally precedes or is equal to the CVV of the target event log, it is excluded from being written. This mechanism is referred to as causality filter.

In order to reduce network bandwidth usage, CVVs of target event logs are also transferred to source event logs during replication so that most events can already be filtered there.

Replication progress

The progress of event replication from a source log A to a target log B is recorded at B as (A, snr) tuple where snr is the sequence number of the last event that has been read from A. The progress is stored in a separate transaction, after the events from A have been written to B.

When B crashes after the events have been written but before the progress has been stored, and B later recovers, it will retry replication from a previously stored progress. Events from this retry however will be detected as duplicates by the causality filter and dropped. This makes event replication reliable and idempotent.

Dynamic changes

Causality filters allow for dynamic changes in a replication network. A location may connect to different other locations over time without breaking the linear extension invariant. Please note that this feature is not yet available in the public API yet but the technical basis (= causality filters) already exists.

Current issues

Eventuate provides some features that are not yet compatible with the above system model. These are

Replication filters. These filters can be installed by applications to exclude some events from being replicated over a replication connection. They do not compose with causality filters in cyclic replication networks and lead to situations where more events than expected are excluded from replication.
Event deletion. Event deletion at one location may lead to situations where causally preceding events at other locations are erroneously excluded from replication.
Disaster recovery. Disaster recovery works only over un-filtered replication connections. When used in combination with filtered replication connections, disaster recovery results are not deterministic. Furthermore, disaster recovery, as currently implemented, can only be used in static replication networks.

The Appendix gives some examples of these issues. The next section describes required extensions and restrictions to the system model, needed to make it compatible with these features.

Extensions

Replication network history

When discussing replication network topologies, not only the current topology of a replication network must be considered, but also the complete history of its changes. For example, an acyclic replication network with locations A, B and C and the following topology at time t1

 A --- B
      /
    C

and another topology at time t2

 A     B
  \   /
    C

must actually be described as cyclic replication network when projecting its history:

 A --- B
  \   /
    C

For reconstructing a replication network topology from its history, replication connection changes must be recorded at the involved locations. Replication network examples in the following sections always assume topologies that include the complete history of changes.

Replication filters

Replication filters are not allowed on replication network cycles (an exception are redundant filtered connections). For example, given a cycle ABC, a location D attached to C and another location E attached to B

 A --- B --- E
  \   /
    C 
    |
    D

replication filters are only allowed between locations C and D as well as B and E. Now, consider a replacement of location E with a cyclic sub replication network XYZ:

 A --- B --- X --- Z
  \   /       \   /
    C           Y
    |
    D

In this case, replication filters are still allowed between B and X because they are not part of a cycle. Removing the replication connections between Y and Z

 A --- B --- X --- Z
  \   /       \   
    C           Y
    |
    D

would additionally allow replication filters between X and Y as well as X and Z.

In addition to replication network changes, the history of replication filters must be recorded too. A replication connection that has been filtered in the past will still be considered as filtered in the future, even if the current version of the replication connection has no replication filter.

Event deletion

Event deletion distinguishes logical deletion from physical deletion as explained in Deleting events. Event deletion is always performed up to a given position in a local event log. For physical deletion, a deletion version vector (DVV) must be introduced. It is the least upper bound of the vector timestamps of all physically deleted events. There is one DVV per local event log. The DVV of an event log A always causally precedes or is equal to that event log's current version vector (CVV, see also Storage order) i.e. DVV-A → CVV-A or DVV-A = CVV-A.

Events may be deleted from a local event log A only if for all local event logs i that are directly connected to A the following condition holds after deletion: DVV-A → CVV-i or DVV-A = CVV-i. In other words, all events that should be physically deleted from a log A must have already been replicated to all directly connected logs i. This does not only ensure that all events are replicated to directly connected event logs but also prevents the replication anomaly described in Cyclic replication networks and event deletion (see Appendix).

Disaster recovery

A disaster is defined as total or partial event loss at a given location. In contrast to event deletion, event loss always starts from a given position in a local event log (see also section Disaster recovery in the user documentation). Goal of disaster recovery is to recover local event log state from directly connected locations in the replication network. Disaster recovery has a mandatory metadata recovery phase and an optional event recovery phase.

Metadata recovery

With a disaster, not only events but also the local time of the local vector clock is lost. Without properly recovering the clock, it may start running again in the past i.e. at a time before the disaster happened. This may lead to local time values that are perceived as duplicates in the replication network which must be prevented because it breaks causality and interferes with causality filters.

Therefore, during metadata recovery, the local time of the local vector clock must be set to maximum of values seen at all other directly connected locations. Furthermore, the replication progress, recorded at directly connected locations, must be reset to the sequence number of the last event in the local event log or to zero if the local event log is completely lost.

Event recovery

Event recovery means replicating events back from locations that are directly connected to the location to be recovered. These directly connected locations must have unfiltered bi-directional replication connections with the location to be recovered. For example, a location A can only be recovered from a location B if both replication connections A -> B and A <- B are unfiltered.

Event recovery over filtered connections is only supported for terminal locations i.e. locations that are connected to only one other location. When replicating events to the terminal location during recovery, application-specific replication filters must be OR-ed with a filter that accepts events that have been initially emitted at that terminal location.

Restrictions also apply to event recovery from locations with deleted events. Assuming that location A should be recovered from location B, event recovery is only allowed if the deletion version vector of B causally precedes or is equal to the current version vector of A i.e. DVV-B → CVV-A or DVV-B = CVV-A.

Assuming that location A can be recovered from locations B and C but only B meets this condition, a first round of event recovery must be attempted from B. After recovery from B has completed, CVV-A has an updated value and the condition must be evaluated for C again. If it is met, event recovery must be attempted from C in a second round.

Hint: Applications that want to delete events older than n days from their local event logs should consider local event log backups at intervals of m days with m < n in order to meet the conditions for event recovery from locations with deleted events.

Location addition

Location addition requires additional rules as a new location may introduce a new cycle which may conflict with ongoing event deletion as outlined in Cyclic replication networks and event deletion (see Appendix). For example, consider an acyclic replication network ABC:

A --- B
      |
      C

A emits events e1, e2 and e3 with e1 → e2 → e3 which are replicated to B but not yet to C. Then A deletes events e1 and e2 which is allowed by the rules established so far. Now, location D is added

A --- B
|     |
D --- C

and two bi-directional replication connections, A - D and C - D, are established. This may lead to a situation where e3 is replicated from A to D and then to C. As a consequence, the causality filter at C will later reject events e1 and e2 from B.

To prevent this anomaly, further rules must be defined for event replication over new replication connections. This doesn't only apply to connections to and from new locations but also to new replication connections between existing locations: event replication over a new replication connection from location X to location Y may only start if the deletion version vector of X (DVV-X) causally precedes or is equal to the current version vector of Y (CVV-Y) i.e. DVV-X → CVV-Y or DVV-X = CVV-Y

Applying this rule to the above example, replication between C and D would start immediately in both directions and between A and D only in direction from D to A. Replication from A to D only starts after D received events e1 and e2 from C, preventing the anomaly that C rejects events e1 and e2.

For the special case that location D is not even interested in getting all past events from A and C but only wants to receive new events, it must first set both its CVV and its DVV to the least upper bound of the CVVs of locations A and C. More generally, a new location X that only wants to receive new events from all new directly connected locations Y1 - Yn must additionally set its CVV-X and DVV-X to the LUB(DVV-Y1, ..., DVV-Yn), before applying the previous rule.

Location retirement

Locations can be permanently removed from a replication network. After having been permanently removed they are referred to as retired locations. Retired locations do not contribute to a replication network topology even if they have been part of its history. The identifiers of retired locations can also be removed from vector clocks.

Global topology view

In order to enforce the constraints of the extended system model, each location must have a global view of the replication network topology including its full history. Assuming that each location emits system events with information about topology changes, replication filter changes and location retirements, each location can construct such a view with eventual consistency.

Appendix

The following subsections give some examples of the replication anomalies described in section Current issues.

Cyclic replication networks with filtered connections

Context:

Cyclic replication network with locations A, B and C
Bi-directional replication between all of them
Replication in direction A -> C is filtered so that only event e3 is replicated.

A --- B
 \   /
   C

Scenario:

A emits events e1, e2 and e3 with e1 → e2 → e3
e3 is replicated from A to C
e3 is replicated from C to B
Event replication from A to B starts

Problem:

B will never store e1 and e2 in its event log because they causally precede e3

Cyclic replication networks and event deletion

Context:

Cyclic replication network with locations A, B and C
Bi-directional replication between all of them

A --- B
 \   /
   C

Scenario:

C emits events e1, e2 and e3 with e1 → e2 → e3
e1, e2 and e3 are replicated to B
B deletes e1 and e2
e3 is replicated from B to A
Event replication from C to A starts

Problem:

A will never store e1 and e2 in its event log because they causally precede e3

Disaster recovery over filtered connections

Context:

Acyclic replication network with locations A, B and C
Bi-directional replication between A and B as well as A and C
Replication in direction A -> B is filtered so that only events e1 and e2 are replicated.
Replication in direction A -> C is filtered so that only event e3 is replicated.

A --- B
 \ 
   C

Scenario:

A emits events e1, e2 and e3 with e1 → e2 → e3
e1 and e2 are replicated to B
e3 is replicated to C
A looses all events (disaster)
A is recovered by first replicating e3 from C
Then, replication of events e1 and e2 from B is attempted

Problem:

A will never store e1 and e2 in its local event log because they causally precede e3

Fail-over

An additional requirement, that is not yet covered by the above system model and its extension, is the possibility to switch to another replication partner with a filtered replication connection. A typical use case is fail-over to another replication partner if the current replication partner remains unavailable for too long. For example, location D is by default connected to location A and the connection is filtered.

    D
  /
 A --- B
  \   /
    C

With the above system model, D is not allowed to have an additional replication connection to B, for example, neither at the same time nor in the past, as it would introduce a cycle with a filtered connection. However, if switching to another location is subject to further constraints, event loss similar to that explained in Cyclic replication networks with filtered connections can be prevented.

Assuming that with every event replication batch (even if empty), transmitted from A to D, A's CVV-A is transmitted too and cached at D, it can be used by B as condition for switching from A to B. Switching to B is only allowed if CVV-A causally precedes or is equal to CVV-B when a connection attempt is made. If B doesn't meet the condition, D could retry later or try connecting to C by comparing CVV-A to CVV-C.

Connections that have been established under these conditions over time don't need to be projected onto a current replication network view and therefore don't introduce a cycle in which D participates. Only the current connection needs to be considered.

Redundant filtered connections

A generalization to fail-over is having filtered replication connections from D to A and B at the same time with additional constraints for writing events to A and B.

    D
  /   \
 A --- B
  \   /
    C

In this case, it is important that both connections, D - A and D - B, use the same replication filter although the filters in different directions may differ i.e. filters from D may differ from those to D. The cycle ABC is unfiltered.

An event e1 replicated from D to A, for example, must not only pass the causality filter at A but also needs to be validated if the ABC part of its vector timestamp vts(e1, ABC) causally precedes or is equal to CVV-A. If this is not the case, replication is rejected and must be retried later, otherwise it is written to A.

Although there is now a cycle ABD containing replication filters, this additional validation together with the identical filter constraint ensures that replication anomalies similar to those explained in Cyclic replication networks with filtered connections cannot occur. The scenario can be extended to an additional replication connection to C with the same replication filter. A filtered replication connection within the cycle ABC is not allowed.

A further generalization is possible. Assuming D is a composite location, being an unfiltered cycle DEF, any location of DEF can be connected to any other location in ABC with a filtered replication connection, provided that all filtered replication connections from ABC to DEF use the same replication filter (and vice versa). The following example connects D with A and E with B:

    F
  /   \
 D --- E
 |     |
 A --- B
  \   /
    C

Events replicated from D to A and from E to B must be validated as described above. For an event e2 replicated into the opposite direction, for example from A to D, the DEF part of its vector timestamp vts(e2, DEF) must be compared to CVV-D and evaluated accordingly.

Assuming that the

cycle ABC is a replicated application A1 where A and C run in data center 1 and B runs in data center 2 and
cycle DEF is a replicated application A2 where D and F run in data center 1 and E runs in data center 2

then a network partition between data center 1 and data center 2 still allows these two applications to communicate within their data center.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication subsystem enhancements draft

Introduction

System model

Replication networks

Potential causality

Storage order

Replication progress

Dynamic changes

Current issues

Extensions

Replication network history

Replication filters

Event deletion

Disaster recovery

Metadata recovery

Event recovery

Location addition

Location retirement

Global topology view

Appendix

Cyclic replication networks with filtered connections

Cyclic replication networks and event deletion

Disaster recovery over filtered connections

Fail-over

Redundant filtered connections

Clone this wiki locally