The process of coupling involves running each couple
to synchronize the
content of the associated Couple Source Table
and a Target Table
. The
process is identical irrespective of how the Couple Source Table
is actually
constructed. A function is automatically produced within Theon by
derivation for the TheonCoupler to implement each couple
based on its
configuration. The coupling process is where the actual synchronisation of
data from the Stream Source Table
(again irrespective of construction) is
made against a final target table. Most stream configurations would have more
than one couple
configured in order to push the data into multiple target
tables, but need not have any when just transferring external data into the
database with no further processing.
The basic synchronisation mechanism is pretty standard and is illustrated in Figure 20.1, “Synchronisation Mechanism”.
On the far left are the three records in the stream (external data). They have
three fields called H
, C1
and C2
. The first, H
is the couple Handle
and is used as the key value to tie records between the stream data and the
local data. The Handle
(or all the columns that make it up) must be unique
across the Couple Source Table
being used (not necessarily unique across
the underlying Stream Source Table
if any). The other fields, C1
and C2
, contain the actual
data to be synchronised. Note that the stream data may have many more fields,
however its only these particular fields being used in this illustrated
couple
. Next the local table also has three records. The fields labelled
C3
and C4
are not part of the synchronisation and so are unaffected by the
process. The fields C1
and C2
are being maintained by the synchronisation
process. The H
field is the local copy of the stream handle field. The PK
field is the local primary key for the table. Note that this is not usually the
same as the synchronisation process handle. While some configurations may use
PK
and H
interchangeably, with external data this is rare. This is because
it is unlikely that the value of H
from external data is compatible with the
local record structure and primary key generation, so for example consider how
new records and primary keys be created consistent with the value of H
locally, particularly if the full extent of H
values is unknown and also if
local records need to be created that are not part of the synchronisation
process (so have no need to use a H
value). This is an important concept to
re-iterate, the value used as the synchronisation key (H
) is not usually the
same as the local table primary key.
After the couple
has been run there is new version of local data. We can see
that in the second record C1
has been updated to match the value of C1
in
the corresponding stream data record. The third record which was not
originally present in the local data has been added - note that a new local
primary key has been created for it (by whatever normal process does this).
The unsynchronised fields are blank (or would default to whatever is defined).
The fourth record which is no longer in the stream data has been marked as
deleted. Note that by default at this stage the synchronisation process does
not delete any records, they are simply marked as deleted. Downstream
processes may need to explicitly refer to this status to exclude record no
longer in the stream data as they would otherwise still be included (in
reports and views based on the local data for example). Finally the metadata
held with the local data indicates the changes with a timestamp of when they
occured.
A single local table (couple
target table) can be striped vertically and
horizontally by more than one couple
(and ususally underlying stream
).
That is, different sets of columns within the table can be managed by
different streams and couples and different sets of records within the table
can be managed by different streams and couples. While it is possible for a
column to be shared by more than one couple this can be problematic (as there
is no explicit coordination or prioritisation across streams). More
commonly sets of records can be shared by different streams and couples,
although this can cause similar issues.
Vertical striping occurs when more than one couple
is maintaining a set of
columns in a single target table. Each couple
may all be associated with a
single stream
but it is much more likely each is associated with a different
stream
(the sources of data for each set of columns are originating from
different external systems for example). This is illustrated in
Figure 20.2, “Vertical Striping (Commoned)”.
Here there are two streams and in this example they have a common handle, that
is there is a unique key that both streams, whatever their external origin,
have to identify the same record in each. In line one Stream2
has caused an
update to C4
in the local table. In line two Stream1
has caused an update
to C1
in the local table. In line three Stream1
creates a new record after
which columns C3
and C4
would initially have no (or the default) value.
Then Stream2
updates columns C3
and C4
. Note that there is no explicit
stream ordering unless imposed by the external stream data and/or local site
management. Hence it is possible, in this example, for Stream2
to have
created the record and Stream1
then updated columns C1
and C2
. In
practice it makes no difference here. In line four Stream1
does not have
record with handle value D
anymore so it is marked for deletion. Note though
that if Stream2
did still have this handle value and it ran after Stream1
then this record would be immediately reinstated in the local table, except
that the previous column values maintained by Stream1
for C1
and C2
would be reset to their local table defaults.
When multiple streams (and/or couples) are updating different column sets in a
single target table it is important to consider how they will interact and the
desired behaviour. Consider, for example, the ordering of the streams, is it
determinate or indeterminate and does it matter which order they run in. It
is possible to configure a couple
as update-only. This means it will not add
records into the local table that are in the stream but missing in the local
table. Similarly it will not mark for deletion records no longer in the stream
data but still in the local table. In the example above Stream2
could have
been configured as update-only. Then in line three it will update the column
values for C3
and C4
only once Stream1
has created that record.
Similarly in line four if it still had a record with handle value D
that
Stream1
did not and had marked as deleted in the local table, then it would
not reinstate that record. This approach is commonly used to define one stream
as a master and other streams simply supplement the records if and only if the
master stream has created them. This can also be used to accommodate
propogation delays with upstream systems, where common data may be available
in one but not in another until some time later - without this local table
records could flip flop between being deleted and created. Other approaches
usually involve making a correlated couple source table - here Stream1
and
Stream2
would be joined (by using a view usually) in some way to ensure
consistency of handle values across both.
Sometimes the relevant streams will not have a common handle however, this is illustrated in Figure 20.3, “Vertical Striping (Independent)”.
Lines one and two work the same as before - the match for each stream is made
on a different handle column but the end result is the same. Lines three and
four show the default behaviour for creation, which is sometimes what is
wanted but often not. In line three Stream1
has a record with handle value
C
which is not in the local table (under H1
) so it is created. However it
does not have any information as to the handle H2
value when creating the
record so that is left blank (or local table definition default). Then in line
four Stream2
has a record with handle value Z
which is not in the local
table (under H2
) so it is also created. Again the handle value to use, for
H1
in this case, is unknown so left blank. The two independent records in
the local table will now be independently maintained by the two streams (this
has indirectly caused horizontal striping, see next section for more details).
If this is not the desired behaviour something needs to be done to prevent it.
The approach to take will depend on the context and how to access the relevent
stream handle values. One option is to correlate the streams to so that each
has the others handle value, which can just be treated as normal column data
to update. If this is possible it may also be possible to simply merge down
the streams anyway and use a single couple. Another option is to define a key
lookup for one of the streams and define the other as update-only. So for
example here Stream1
would create the record with H1
and then as part of
that process include a lookup to get the value for H2
to add at the same
time. Then Stream2
won’t do anything until Stream1
has created a record
with a value in H2
in the local table it can match against and then update.
This is variant of correlation and may or may not be a better approach
depending on the context. Another option is to define the local table default
value for H1
and H2
to do the lookup on a creation trigger. This is not
then something defined as part of TheonCoupler but part of the underlying
schema definition done in Theon. Finally in line five a record is
being deleted. As with previous example if the record is in one stream but not
the other it will flip flop between being deleted and being recreated - the
same approaches to address this apply here.
In both example cases above vertical striping only occurs if the streams (or
the component couples) share a common identity (tag) representing their data
set in the local target table, otherwise horizonatal striping will occur. This
is shown in the illustrations as the tag name mystream
underneath each. In
any stream data set records with an empty handle are ignored. This is how the
two indpendent records created in lines three and four in the illustration in
Figure 20.3, “Vertical Striping (Independent)” can be maintained in the common stream data set (without flip
flopping between being created and deleted). In effect the handle value is
acting as an additional horizontal striping key (as in using parameters).
There would be an issue here if one or other or both of H1
and H2
was
defined in the local target table as having a mandatory value - in which case
they could not be maintained independently in the same stream data set.
It does not normally make sense to have the same column maintained by more than one stream couple concurrently. Rather there are few scenarios where it is of any use, the column value will just flip flop to whatever the value is in the last run stream couple. Again correlated streams can be used to mitigate this effect, but is essentially redundant as only one stream is then needed to manage the value generally. There are however scenarios where "last stream wins" is applicable as the stream ordering is managed and manually varied as necessary. Similarly explicitly (and always) paramaterized stream couples may make it applicable in some circumstances.
However, two streams might maintain the same column sequentially. That is one stream may maintain it for a period of time and another stream maintain it for a different period of time, but they both never maintain it at the same time concurrently. This needs to be done with horizontal striping (either directly or indirectly within a common stream data set as above) and use of the handover mechanism, see section below.
Horizontal striping occurs when more than one couple
is maintaining a set of
records in a single target table. Each couple
may all be associated with a
single stream
but it is much more likely each is associated with a different
stream
(the sources of data for each set of records are originating from
different external systems for example). Basically this looks as illustrated
in Figure 20.4, “Horizontal Striping”.
The details of the couple
process for two separate streams is illustrated in
Figure 20.5, “Horizontal Striping (Distinct Handles)”.
Since each stream has a different handle in the local target table it is not
necessary to distinguish the streams by giving each a unique identity tag,
although it wouldn’t harm. Here each is called mystream
and the record
origin will be marked the same. This only works because handle values that are
empty in the local target table are ignored by that couple for matching
purposes. If this was not the case each stream would need a distinct name,
otherwise they will step on each others toes, flip flopping between
deleting/creating records.
You can see that Stream1
behaves as before. Note that records contain
metadata defining the "owning" stream. Here Stream2
works the same as
Stream1
but maintains a completely independent set of data. The correlation
of Stream2
is done on H
value matching H2
in the local target table, in
Stream1
the H
value is matching on H1
in the local target table. In
Stream1
columns C1
and C2
are mapped to C1
and C2
in the local
target table. In Stream2
columns C1
and C2
are mapped to C3
and C4
in the local target table. When a record is deleted from either stream the
metadata defining the owner is blanked (see Handover).
Sometimes multiple streams will want to use the same handle in the local target table and for operational reasons (often as the handle is subsequently exposed to the users and/or reporting systems) it is not practical to use a repeated column in the local target table (as this is otherwise as option). When this is the case each stream must have a different and unique identity tag to define the data set in the local target table it operates against. This is illustrated in Figure 20.6, “Horizontal Striping (Common Handles)”.
Here both streams correlate the match on the H
column in the local target
table. The values of H
would not be expected to overlap in Stream1
and
Stream2
, that is, each stream would have its own distinct set of values
which do not occur in the other stream. A distinct stream identity tag is
required otherwise when one stream is processed it will see non-empty handle
values in the local target table that are not its own stream data set and
correspondingly mark them as deleted. When the other stream is processed they
will get added back but others from the first stream may then get deleted.
This causes flip flopping.
In summary a stream data set in striping is identified (uniquely) by the stream identity tag and/or the specific handle columns. We will see later that it can be further constrained by handle parameters.
Illustrated in Figure 20.7, “Dynamic Stream Tag” is usage of a dynamic stream identity tag for
horizontal striping. This is where the tag of a stream included variable data,
potentially taken from the content of the stream data. In the example
illustrated here this is used to preserve historical data without
automatically marking it as deleted (since it may no longer be in the stream
data). In this example the stream data switches annually to only contain that
years data. The year is embedded in the tag of the stream. So while in 2012
and with 2012
data the stream maintains records in the local table with a
tag of mystream2012
. When the year changes to 2013
the tag name changes to
mystream2013
and all the records currently marked with mystream2012
are no
longer part of the streams (horizontal stripe) data set and are ignored and
left alone, and crucially not deleted even though those records are not part
of the stream data anymore. This is a very useful technique to minimise the
potential data sets that need to pulled from external upstream sources, by
effectively defining a working set window of data to use. It does however
interfere with subsequent purge operations as the records tagged in
mystream2012
are effectively now orphaned. This can be worked around by
using a dummy stream supplying the now archived tags which is empty and kicks
in after a certain delay and marks the records as deleted, with a purge
defined to kick in after a further delay.
Noted above is that when a record is marked as deleted the record ownership is
blanked (what was the stream identity tag). When a couple
process
identifies a correlating handle value and the local table record is marked as
deleted and the ownership is blank then the couple will claim ownership and
undelete the record. This is illustrated in Figure 20.8, “Stream Handover”.
This allows different streams to maintain the same records over time as data is supplied from different external systems, for example, in particular where there is a common handle value over the lifetime of the data element.
By default columns managed by a couple
have local update override enabled. This
adds per-column meta data to control. See the illustration in Figure 20.9, “Update Override”
The first line shows the state as at the end of the previous illustration, the
couple process has modified C1
to 3
to reflect the value in the stream
data and the Orgin
has been marked as being the stream. In the second line
the user has overridden that value locally to 8
and the Origin
is now
marked as local (in practice the user who made the change). Now on the next
coupling process despite the stream value being 3
because the meta data has
marked the Origin
as local the overridden value is left unchanged. On the
final line the value in the stream changes to 8
matching the locally
modified value. As a result the couple "reclaims" ownership of the value and
the Origin
is reverted to being the stream. Consequently if the value in the
stream data changed again, back to 3
for example, then the local value would
also be changed. Local overrides are only "held" while they differ from the
value in the stream data. Any column can optionally have this "reclaim"
disabled meaning that the local value will be "held" locally even if the
stream data value changes to match. Once locally overridden such values will
never be reclaimed are updated by the sync process again (without manual low
level intervention to alter the corresponding meta data). Any column can also
optionally have the local override facility removed, often the case where
there is no requirement to do so in internal only streams. In this case the
overridden value would have been immediately reverted back to 3
on the next
sync following the local change.
Locally records can be created that are not part of the sync process. Like
local override for update, by default if these records are given a value for
H
and that value eventually matches a value in the stream data then the
stream will "reclaim" the record and update the field values to match
(although by default if the values differ they will be held as local update
overrides as above). For this to work the records must be given a value of H
that is valid and will at some point appear in the stream data. If not they
will never be reclaimed and will be left as local insert overrides. This is
illustrated in Figure 20.10, “Insert Override”. In the first line a new record is created. In
the second line despite it not appearing in the stream data it is retained
(not marked as deleted). In the third line the locally added records handle
value appears in the stream data is reclaimed, but its values for fields are
left as is since they constitute a local update override. Finally in the
fourth record they stream data field values have changed to match and the
field origin is reverted to stream.
As for local update override the default behaviour can be changed so that new
locally created records are added "out of sync" meaning they will never be
reclaimed, even if they have a value of H
that matches a value in the stream
data - in such a situation an additional record will be created with that
value of H
. This will be a problem if H
is being maintained as unique
across the local table for example.
Locally added records can be deleted as normal. However sync added records can
not be simply deleted. If they were the following sync would simply re-instate
them. However, a local delete override facility is available but this type is
not enabled by default. The reason for this is that it traps the normal
database behaviour for the SQL delete
command. On a delete
the record is
deleted as expected but a copy of the relevant couple handles is made into an
axuiliary table (automtically created by TheonCoupler if needed). Stream
processing uses this auxiliary table to filter the couple source table,
excluding records that match, thus preventing the record from being re-instated
again. Since this affects the standard operation of the SQL delete
command
other applications may need to be carefully written to account for it. This
behaviour can also have complex knock on consequences (in particular with
constraint handling) so should be used sparingly only where necessary. As for
the other update overrides, if the record also then disappears from the stream
data then the record will be marked as a normal deletion and the local deletion
override status lost (hence it could be reinstated if it reappeared in the
stream).
See illustration in Figure 20.11, “Delete Override”. In the first line a record is locally deleted. Although the illustration shows no record it (or the handle part) will have been transferred to the auxiliary table. Note that there may be more than one auxiliary table as one is defined for each stream. In the second line despite the record appearing in the stream data it is not re-instated. In the third line it disappears from the stream data, so this results in the meta data being changed to reflect the stream being the origin for the deletion. Finally in the fourth line the record reappears in the stream data and so is dutifully reinstated into the local data. Again it is optionally possible to change this behaviour so the local deletion is "held" effectively for all time (even to the exclusion of a purge operation, see later).
Often the column values provided in the stream data cannot be directly mapped into the local target table. This is where their value is effectively a handle onto a record in another table which has a foreign key relationship with the local target table.
This can be addressed in a couple of ways. An intermediate couple source table
can be constructed to resolve the necessary lookup (a view for example). Or
the native couple
configuration can be used to apply the relationship
directly. The latter approach is usually simpler as it uses the relationship
data which is part of Theon anyway and avoids defining a specific view
just for this purpose. It can’t be used for multiple hops (e.g. more than one
relationship has to be followed to derive the actual column value) or where
the mapping is not based on any schema defined constraint.
The lookup configuration for a couple
can be defined as mandatory or
non-mandatory (the constraint can not be changed though, so a non-mandatory
constraint can be made mandatory for the purpose of the coupler
process but
not vice-versa). It can be applied to a normal column value or the Handle
column value if necessary.
Sometimes one couple
will be populating a local target table which holds
the primary key for a foreign key constraint on another local target
table of a couple
. In this case the two couple
processes must be run in
the correct order to ensure the primary key is in place before the key lookup
of a couple
is used to populate records in the child table. This can be done
by giving the couples in a stream a specific order. This is done automatically
where the couple
configuration has been used to define the key lookup, but
would have to be done manually otherwise (and also where there is a dependency
not based on any schema defined constraint).
The couple
processes are normally executed by TheonCoupler as part of
processing the stream data (when an updated set has been received, or a
scheduled fetch has taken place). It is also possible to run the couple
sync
process manually as a function call in psql
, or preferably via the toolkit
commands. In this context it is possible to pass parameters. Every couple
process has one or more parameters, one for each of the columns used in the
Handle
. These can be used to constrain the process to a subset of the full
data. In essence this adds an additional element to the horizontal striping
constraint. This is illustrated in Figure 20.12, “Paramaterized Couple Call”.
The normal behaviour is shown in the first set, when the couple
process is
called with no parameters (f()
). All records are applied against the local
target table and two have been updated to match. In the second set the
couple
process is called with a single handle value (f©
) and so only
that single record is applied against the local target table and only that one
is updated. If it did not exist in the local target table a record would be
created for it. Normally a couple
process is called with a Handle
value
that exists in the stream data. If it is called with one that does not and
a correlating record exists in the local target table then that single record
will be deleted. Records are not otherwise deleted outside of the data set
space defined by the parameters.
Only one single value match can be made with a parameter and it must be exact,
so with a single Handle
column the reduced data set can only ever be at most
one record. With a multi column Handle
then each parameter supplied that is
non-null must match the corresponding column for a match. In this case more
than one record can be chosen as a subset (provided each Handle
column used
is on its own not unique).
Paramaterized couple
calls are commonly used in event processing where one
event results in a change to the stream data which needs to be immediately
applied to the relevant local target tables. This is commonly much more
efficient than rerunning all the couple
processes with the full data set
(particularly if no other values are known to have changed, or if they have it
is not relevant for the immediate consequence of the event).
Sometimes the stream data does not have quite the right types or values for a direct column to column update. For example the column types may need to be cast, the values may need to be explicitly mapped to align properly with an enumerated column value, or a simple static value may be missing. All these cases could be corrected by creating a view and using that as the couple source table instead of the stream table itself. However for small adjustments like this the couple column to column mapping can be complemented by simple SQL expressions and literals to achieve the same result.
Any source column can be replaced by a single literal SQL expression, this
could include a full sub-select of course since that is allowed in an SQL
expression - however that is usually better done as a view. Where a source
column has been defined for the mapping then the literal SQL expression can
rewrite the value before mapping and an escape character can be used (@
) to
represent the source column value in the expression.
Records maintained by TheonCoupler are not by default deleted. They are just marked as "deleted". This has advantages in that bad upstream data cannot destroy a lot of data. It is also often beneficial when upstream data goes away before it needs to go away locally, in particular when it transitions across different upstream systems and it should not be locally deleted during the transition. This is often the case as the data pulled in externally will have a lot of local data attached to it which might be lost if records are automatically deleted (particularly if in error). A disadvantage is that downstream systems need to explicitly accommodate deleted records - for example by adding handling in reports and user applications to hide them.
The Purge
feature of TheonCoupler can be used to really delete records in
local target tables that are synchronised and no longer have a corresponding
instance in the stream data. This can be immediate or deferred. When a
couple
runs with an immediate purge enabled it will first mark records as
deleted as normal and then run a purge cycle which will actually delete the
records from the local target table. This is done atomically as a single
process along with the rest of the couple
and any other couples associated
with the stream. With a deferred purge enabled the records are marked as
deleted as before, but the corresponding timestamp of deletion is used as a
marker to start a deferred purge cycle. As long as the stream is still running
once the configured period from the original deletion date has been reached
then the records will be really deleted from the local target table.
Different records will be deleted at different times based on their original
deletion date (or most recent deletion date if the record has been recreated
and then deleted again).
A deferred purge can be set to an interval of days, weeks, months or years. It
is particularly useful for cleaning out old data that is no longer relevant
against an organisations set retention period policy for that data. Note
however that the purge will only take place while the stream with the
corresponding identity tag is still running, as it is done as part of the
normal couple
processing. If the stream stops running for any reason, the
identity tag changes or structural changes mean the couple
for the relevant
target table is no longer deployed then the purge will never take place
(automatically, of course meta data for TheonCoupler could still be used to
construct a manual purge in these cases). What local records are deleted along
with a purge depends on the local target table delete cascade rules as defined
through Theon.
A local target table for a couple
will have a number of additional columns
automatically added to it through mechanisms within the XSDDB and exported via
the SchemaTree
normal DDL derivation under Theon. These do not
generally need to be used or exposed as they are managed and used solely by
the relevant couple
and associated processes and events driving the
synchronisation process. So they can be safely ignored, however some are
useful in particular for analysing problems with any TheonCoupler process.
Created
is the timestamp when the record was first created by any couple
.
It is not set for manually added records (done by user outside of a couple
process), column name is th_createdat
Deleted
is the timestamp when the record was most recently marked for
deletion by a couple
, column name is th_deletedat
Increment
is the change index value and used to isolate couple
processes
from normal user processes as well as track the number of changes applied
(very much higher than average values indicate flip flopping for example); new
user created records will have this set automatically in a way that indicates
whether should be included in any future couple
process (a value of 1 means
they will be), but explicitly setting this column on manual creation of records
should be avoided except when specifically trying to influence behaviour,
column name is _th_increment
STREAM-Identity
is the stream identity tag, there is one of these columns
for each STREAM
with a couple
with this table as a local target table,
column name is _th_STREAM_changeset
; holds the stream specific handle value
(potentially this can be dynamically constructed) and used to separate sets of
data in one source table.
COLUMN-Origin
and COLUMN-Change
are added for each COLUMN
being
maintained by a couple
(may be shared by more than one STREAM
for
horizontal striping, but generally unwise without care if shared by more than
one STREAM
for vertical striping); the value is blank for Origin
if the
last change was made by the couple
and the Change
will have the timestamp;
the value will be the local session username of the local user that changed the
value and the timestamp otherwise, used to retain the local update override,
column names are _th_COLUMN_changedby
and _th_COLUMN_changedat
A local target table for a couple
will have an event automatically added
to it. This traps normal operations on the table so as to update the relevant
meta data fields. These should not normally be a concern, but they may
interfere with any other user events.