-
Notifications
You must be signed in to change notification settings - Fork 100
Replication subsystem enhancements draft
Before describing the Current issues of Eventuate's event replication subsystem, a brief description of the underlying System model is given. This sets the context for the solution proposals in section Extensions.
Eventuate applications store events in local event logs that can be connected to each other to form a replicated event log. The owner of a local event log is called location and event replication occurs across locations that share a replicated event log. Event replication is reliable and preserves causal ordering of events which is tracked with vector clocks. From a consumer perspective, a replicated event log provides a causal reliable broadcast (CRB) of events across a group of locations.
Event-sourced components may consume events from and produce events to a local event log. Events produced at one location can be consumed at other locations if these locations share a replicated event log. Event-sourced components that consume from a local event log receive events in an order that is consistent with causal order i.e. they will never see an effect before its cause. Event-sourced components that collaborate over a shared replicated event log can therefore provide causal consistency for replicated mutable state. A special example are operation-based CRDTs that use causality metadata for global state convergence.
In the following, it is assumed that a location has only a single local event log that is connected to the local event logs of other locations, to form a replicated event log (see also replication networks). Of course, multiple local event logs per location are supported too but this is not relevant in context of this discussion.
Two locations, A
and B
can be connected by an uni-directional replication connection, A -> B
or A <- B
, where the arrow indicates the event replication direction. Bi-directional event replication, A <-> B
(or A - B
) is realized by two independent, uni-directional replication connections in opposite directions. With replication connections, locations can be connected to a replication network which may also contain cycles, as show in the following example:
A --- B
\ /
C
|
D
Strictly speaking, a bi-directional replication connection between two locations is also a cycle but this is not relevant in context of the system model. When discussing replication network cycles, bi-directional replication connections need not be considered. Also, if there is an uni- or bi-directional replication connection between two locations, these two locations are said to be directly connected.
Connecting locations to a replication network means connecting their local event logs to a replicated event log. Hence, the terms replication network and replicated event log are used interchangeably here, assuming there is only one local event log per location. Also, the terms location and local event log are used interchangeably.
Each local event log maintains a vector clock. The size of the clock scales with the number of local event logs in the replication network. An event that is written to a local event log is assigned a vector timestamp taken from the current time of the vector clock of that event log. A vector clock generates a partial order of events, the happened-before relation →
or potential causality of events. Vector timestamps can be used to determine whether any two events have a potential causal relationship or are concurrent: e1 → e2
means that e1
causally precedes e2
whereas e1 ↔︎ e2
means that e1
and e2
are concurrent and don't have a causal relationship.
The storage order of events in a local event log is consistent with the potential causality of events:
- if
e1 → e2
then the sequence number ofe2
is greater than the sequence number ofe1
in all local event logs of a replicated event log. - if
e1 ↔︎ e2
then the relative order ofe1
ande2
in a local event log is not defined:e1
may have a lower sequence number thane2
in one local event log but a higher sequence number thane2
in another local event log.
More formally, a given local event log is one of several possible linear extensions of the partial order of events. To preserve the linear extension invariant during replication, replicated events that are in the causal past of a target event log are excluded from being written to that event log.
The causal past of an event log is determined by its current version vector (CVV) which is the least upper bound of the vector timestamps of all events stored in that log. If the vector timestamp of a replicated event causally precedes or is equal to the CVV of the target event log, it is excluded from being written. This mechanism is referred to as causality filter.
In order to reduce network bandwidth usage, CVVs of target event logs are also transferred to source event logs during replication so that most events can already be filtered there.
The progress of event replication from a source log A
to a target log B
is recorded at B
as (A
, snr
) tuple where snr
is the sequence number of the last event that has been read from A
. The progress is stored in a separate transaction, after the events from A
have been written to B
.
When B
crashes after the events have been written but before the progress has been stored, and B
later recovers, it will retry replication from a previously stored progress. Events from this retry however will be detected as duplicates by the causality filter and dropped. This makes event replication reliable and idempotent.
Causality filters allow for dynamic changes in a replication network. A location may connect to different other locations over time without breaking the linear extension invariant. Please note that this feature is not yet available in the public API yet but the technical basis (= causality filters) already exists.
Eventuate provides some features that are not yet compatible with the above system model. These are
- Replication filters. These filters can be installed by applications to exclude some events from being replicated over a replication connection. They do not compose with causality filters in cyclic replication networks and lead to situations where more events than expected are excluded from replication.
- Event deletion. Event deletion at one location may lead to situations where causally preceding events at other locations are erroneously excluded from replication.
- Disaster recovery. Disaster recovery works only over un-filtered replication connections. When used in combination with filtered replication connections, disaster recovery results are not deterministic. Furthermore, disaster recovery, as currently implemented, can only be used in static replication networks.
The Appendix gives some examples of these issues. The next section describes required extensions and restrictions to the system model, needed to make it compatible with these features.
When discussing replication network topologies, not only the current topology of a replication network must be considered, but also the complete history of its changes. For example, an acyclic replication network with locations A
, B
and C
and the following topology at time t1
A --- B
/
C
and another topology at time t2
A B
\ /
C
must actually be described as cyclic replication network when projecting its history:
A --- B
\ /
C
For reconstructing a replication network topology from its history, replication connection changes must be recorded at the involved locations. Replication network examples in the following sections always assume topologies that include the complete history of changes.
Replication filters are not allowed on replication network cycles (an exception are redundant filtered connections). For example, given a cycle ABC
, a location D
attached to C
and another location E
attached to B
A --- B --- E
\ /
C
|
D
replication filters are only allowed between locations C
and D
as well as B
and E
. Now, consider a replacement of location E
with a cyclic sub replication network XYZ
:
A --- B --- X --- Z
\ / \ /
C Y
|
D
In this case, replication filters are still allowed between B
and X
because they are not part of a cycle. Removing the replication connections between Y
and Z
A --- B --- X --- Z
\ / \
C Y
|
D
would additionally allow replication filters between X
and Y
as well as X
and Z
.
In addition to replication network changes, the history of replication filters must be recorded too. A replication connection that has been filtered in the past will still be considered as filtered in the future, even if the current version of the replication connection has no replication filter.
Event deletion distinguishes logical deletion from physical deletion as explained in Deleting events. Event deletion is always performed up to a given position in a local event log. For physical deletion, a deletion version vector (DVV
) must be introduced. It is the least upper bound of the vector timestamps of all physically deleted events. There is one DVV
per local event log. The DVV
of an event log A
always causally precedes or is equal to that event log's current version vector (CVV
, see also Storage order) i.e. DVV-A → CVV-A
or DVV-A = CVV-A
.
Events may be deleted from a local event log A
only if for all local event logs i
that are directly connected to A
the following condition holds after deletion: DVV-A → CVV-i
or DVV-A = CVV-i
. In other words, all events that should be physically deleted from a log A
must have already been replicated to all directly connected logs i
. This does not only ensure that all events are replicated to directly connected event logs but also prevents the replication anomaly described in Cyclic replication networks and event deletion (see Appendix).
A disaster is defined as total or partial event loss at a given location. In contrast to event deletion, event loss always starts from a given position in a local event log (see also section Disaster recovery in the user documentation). Goal of disaster recovery is to recover local event log state from directly connected locations in the replication network. Disaster recovery has a mandatory metadata recovery phase and an optional event recovery phase.
With a disaster, not only events but also the local time of the local vector clock is lost. Without properly recovering the clock, it may start running again in the past i.e. at a time before the disaster happened. This may lead to local time values that are perceived as duplicates in the replication network which must be prevented because it breaks causality and interferes with causality filters.
Therefore, during metadata recovery, the local time of the local vector clock must be set to maximum of values seen at all other directly connected locations. Furthermore, the replication progress, recorded at directly connected locations, must be reset to the sequence number of the last event in the local event log or to zero if the local event log is completely lost.
Event recovery means replicating events back from locations that are directly connected to the location to be recovered. These directly connected locations must have unfiltered bi-directional replication connections with the location to be recovered. For example, a location A
can only be recovered from a location B
if both replication connections A -> B
and A <- B
are unfiltered.
Event recovery over filtered connections is only supported for terminal locations i.e. locations that are connected to only one other location. When replicating events to the terminal location during recovery, application-specific replication filters must be OR-ed with a filter that accepts events that have been initially emitted at that terminal location.
Restrictions also apply to event recovery from locations with deleted events. Assuming that location A
should be recovered from location B
, event recovery is only allowed if the deletion version vector of B
causally precedes or is equal to the current version vector of A
i.e. DVV-B → CVV-A
or DVV-B = CVV-A
.
Assuming that location A
can be recovered from locations B
and C
but only B
meets this condition, a first round of event recovery must be attempted from B
. After recovery from B
has completed, CVV-A
has an updated value and the condition must be evaluated for C
again. If it is met, event recovery must be attempted from C
in a second round.
Hint: Applications that want to delete events older than n days from their local event logs should consider local event log backups at intervals of m days with m < n in order to meet the conditions for event recovery from locations with deleted events.
Location addition requires additional rules as a new location may introduce a new cycle which may conflict with ongoing event deletion as outlined in Cyclic replication networks and event deletion (see Appendix). For example, consider an acyclic replication network ABC
:
A --- B
|
C
A
emits events e1
, e2
and e3
with e1 → e2 → e3
which are replicated to B
but not yet to C
. Then A
deletes events e1
and e2
which is allowed by the rules established so far. Now, location D
is added
A --- B
| |
D --- C
and two bi-directional replication connections, A - D
and C - D
, are established. This may lead to a situation where e3
is replicated from A
to D
and then to C
. As a consequence, the causality filter at C
will later reject events e1
and e2
from B
.
To prevent this anomaly, further rules must be defined for event replication over new replication connections. This doesn't only apply to connections to and from new locations but also to new replication connections between existing locations: event replication over a new replication connection from location X
to location Y
may only start if the deletion version vector of X
(DVV-X
) causally precedes or is equal to the current version vector of Y
(CVV-Y
) i.e. DVV-X → CVV-Y
or DVV-X = CVV-Y
Applying this rule to the above example, replication between C
and D
would start immediately in both directions and between A
and D
only in direction from D
to A
. Replication from A
to D
only starts after D
received events e1
and e2
from C
, preventing the anomaly that C
rejects events e1
and e2
.
For the special case that location D
is not even interested in getting all past events from A
and C
but only wants to receive new events, it must first set both its CVV
and its DVV
to the least upper bound of the CVV
s of locations A
and C
. More generally, a new location X
that only wants to receive new events from all new directly connected locations Y1
- Yn
must additionally set its CVV-X
and DVV-X
to the LUB(DVV-Y1, ..., DVV-Yn)
, before applying the previous rule.
Locations can be permanently removed from a replication network. After having been permanently removed they are referred to as retired locations. Retired locations do not contribute to a replication network topology even if they have been part of its history. The identifiers of retired locations can also be removed from vector clocks.
In order to enforce the constraints of the extended system model, each location must have a global view of the replication network topology including its full history. Assuming that each location emits system events with information about topology changes, replication filter changes and location retirements, each location can construct such a view with eventual consistency.
The following subsections give some examples of the replication anomalies described in section Current issues.
Context:
- Cyclic replication network with locations
A
,B
andC
- Bi-directional replication between all of them
- Replication in direction
A -> C
is filtered so that only evente3
is replicated.
A --- B
\ /
C
Scenario:
-
A
emits eventse1
,e2
ande3
withe1 → e2 → e3
-
e3
is replicated fromA
toC
-
e3
is replicated fromC
toB
- Event replication from
A
toB
starts
Problem:
-
B
will never storee1
ande2
in its event log because they causally precedee3
Context:
- Cyclic replication network with locations
A
,B
andC
- Bi-directional replication between all of them
A --- B
\ /
C
Scenario:
-
C
emits eventse1
,e2
ande3
withe1 → e2 → e3
-
e1
,e2
ande3
are replicated toB
-
B
deletese1
ande2
-
e3
is replicated fromB
toA
- Event replication from
C
toA
starts
Problem:
-
A
will never storee1
ande2
in its event log because they causally precedee3
Context:
- Acyclic replication network with locations
A
,B
andC
- Bi-directional replication between
A
andB
as well asA
andC
- Replication in direction
A -> B
is filtered so that only eventse1
ande2
are replicated. - Replication in direction
A -> C
is filtered so that only evente3
is replicated.
A --- B
\
C
Scenario:
-
A
emits eventse1
,e2
ande3
withe1 → e2 → e3
-
e1
ande2
are replicated toB
-
e3
is replicated toC
-
A
looses all events (disaster) -
A
is recovered by first replicatinge3
fromC
- Then, replication of events
e1
ande2
fromB
is attempted
Problem:
-
A
will never storee1
ande2
in its local event log because they causally precedee3
An additional requirement, that is not yet covered by the above system model and its extension, is the possibility to switch to another replication partner with a filtered replication connection. A typical use case is fail-over to another replication partner if the current replication partner remains unavailable for too long. For example, location D
is by default connected to location A
and the connection is filtered.
D
/
A --- B
\ /
C
With the above system model, D
is not allowed to have an additional replication connection to B
, for example, neither at the same time nor in the past, as it would introduce a cycle with a filtered connection. However, if switching to another location is subject to further constraints, event loss similar to that explained in Cyclic replication networks with filtered connections can be prevented.
Assuming that with every event replication batch (even if empty), transmitted from A
to D
, A
's CVV-A
is transmitted too and cached at D
, it can be used by B
as condition for switching from A
to B
. Switching to B
is only allowed if CVV-A
causally precedes or is equal to CVV-B
when a connection attempt is made. If B
doesn't meet the condition, D
could retry later or try connecting to C
by comparing CVV-A
to CVV-C
.
Connections that have been established under these conditions over time don't need to be projected onto a current replication network view and therefore don't introduce a cycle in which D
participates. Only the current connection needs to be considered.
A generalization to fail-over is having filtered replication connections from D
to A
and B
at the same time with additional constraints for writing events to A
and B
.
D
/ \
A --- B
\ /
C
In this case, it is important that both connections, D - A
and D - B
, use the same replication filter although the filters in different directions may differ i.e. filters from D
may differ from those to D
. The cycle ABC
is unfiltered.
An event e1
replicated from D
to A
, for example, must not only pass the causality filter at A
but also needs to be validated if the ABC
part of its vector timestamp vts(e1, ABC)
causally precedes or is equal to CVV-A
. If this is not the case, replication is rejected and must be retried later, otherwise it is written to A
.
Although there is now a cycle ABD
containing replication filters, this additional validation together with the identical filter constraint ensures that replication anomalies similar to those explained in Cyclic replication networks with filtered connections cannot occur. The scenario can be extended to an additional replication connection to C
with the same replication filter. A filtered replication connection within the cycle ABC
is not allowed.
A further generalization is possible. Assuming D
is a composite location, being an unfiltered cycle DEF
, any location of DEF
can be connected to any other location in ABC
with a filtered replication connection, provided that all filtered replication connections from ABC
to DEF
use the same replication filter (and vice versa). The following example connects D
with A
and E
with B
:
F
/ \
D --- E
| |
A --- B
\ /
C
Events replicated from D
to A
and from E
to B
must be validated as described above. For an event e2
replicated into the opposite direction, for example from A
to D
, the DEF
part of its vector timestamp vts(e2, DEF)
must be compared to CVV-D
and evaluated accordingly.
Assuming that the
- cycle
ABC
is a replicated application A1 whereA
andC
run in data center 1 andB
runs in data center 2 and - cycle
DEF
is a replicated application A2 whereD
andF
run in data center 1 andE
runs in data center 2
then a network partition between data center 1 and data center 2 still allows these two applications to communicate within their data center.