Chapter 20. Coupling the Stream Source Table

Prev		Next

The process of coupling involves running each couple to synchronize the content of the associated Couple Source Table and a Target Table. The process is identical irrespective of how the Couple Source Table is actually constructed. A function is automatically produced within Theon by derivation for the TheonCoupler to implement each couple based on its configuration. The coupling process is where the actual synchronisation of data from the Stream Source Table (again irrespective of construction) is made against a final target table. Most stream configurations would have more than one couple configured in order to push the data into multiple target tables, but need not have any when just transferring external data into the database with no further processing.

20.1. Mechanism

The basic synchronisation mechanism is pretty standard and is illustrated in Figure 20.1, “Synchronisation Mechanism”.

Figure 20.1. Synchronisation Mechanism

On the far left are the three records in the stream (external data). They have three fields called H, C1 and C2. The first, H is the couple Handle and is used as the key value to tie records between the stream data and the local data. The Handle (or all the columns that make it up) must be unique across the Couple Source Table being used (not necessarily unique across the underlying Stream Source Table if any). The other fields, C1 and C2, contain the actual data to be synchronised. Note that the stream data may have many more fields, however its only these particular fields being used in this illustrated couple. Next the local table also has three records. The fields labelled C3 and C4 are not part of the synchronisation and so are unaffected by the process. The fields C1 and C2 are being maintained by the synchronisation process. The H field is the local copy of the stream handle field. The PK field is the local primary key for the table. Note that this is not usually the same as the synchronisation process handle. While some configurations may use PK and H interchangeably, with external data this is rare. This is because it is unlikely that the value of H from external data is compatible with the local record structure and primary key generation, so for example consider how new records and primary keys be created consistent with the value of H locally, particularly if the full extent of H values is unknown and also if local records need to be created that are not part of the synchronisation process (so have no need to use a H value). This is an important concept to re-iterate, the value used as the synchronisation key (H) is not usually the same as the local table primary key.

After the couple has been run there is new version of local data. We can see that in the second record C1 has been updated to match the value of C1 in the corresponding stream data record. The third record which was not originally present in the local data has been added - note that a new local primary key has been created for it (by whatever normal process does this). The unsynchronised fields are blank (or would default to whatever is defined). The fourth record which is no longer in the stream data has been marked as deleted. Note that by default at this stage the synchronisation process does not delete any records, they are simply marked as deleted. Downstream processes may need to explicitly refer to this status to exclude record no longer in the stream data as they would otherwise still be included (in reports and views based on the local data for example). Finally the metadata held with the local data indicates the changes with a timestamp of when they occured.

20.2. Striping

A single local table (couple target table) can be striped vertically and horizontally by more than one couple (and ususally underlying stream). That is, different sets of columns within the table can be managed by different streams and couples and different sets of records within the table can be managed by different streams and couples. While it is possible for a column to be shared by more than one couple this can be problematic (as there is no explicit coordination or prioritisation across streams). More commonly sets of records can be shared by different streams and couples, although this can cause similar issues.

20.2.1. Vertical

Vertical striping occurs when more than one couple is maintaining a set of columns in a single target table. Each couple may all be associated with a single stream but it is much more likely each is associated with a different stream (the sources of data for each set of columns are originating from different external systems for example). This is illustrated in Figure 20.2, “Vertical Striping (Commoned)”.

Figure 20.2. Vertical Striping (Commoned)

Here there are two streams and in this example they have a common handle, that is there is a unique key that both streams, whatever their external origin, have to identify the same record in each. In line one Stream2 has caused an update to C4 in the local table. In line two Stream1 has caused an update to C1 in the local table. In line three Stream1 creates a new record after which columns C3 and C4 would initially have no (or the default) value. Then Stream2 updates columns C3 and C4. Note that there is no explicit stream ordering unless imposed by the external stream data and/or local site management. Hence it is possible, in this example, for Stream2 to have created the record and Stream1 then updated columns C1 and C2. In practice it makes no difference here. In line four Stream1 does not have record with handle value D anymore so it is marked for deletion. Note though that if Stream2 did still have this handle value and it ran after Stream1 then this record would be immediately reinstated in the local table, except that the previous column values maintained by Stream1 for C1 and C2 would be reset to their local table defaults.

When multiple streams (and/or couples) are updating different column sets in a single target table it is important to consider how they will interact and the desired behaviour. Consider, for example, the ordering of the streams, is it determinate or indeterminate and does it matter which order they run in. It is possible to configure a couple as update-only. This means it will not add records into the local table that are in the stream but missing in the local table. Similarly it will not mark for deletion records no longer in the stream data but still in the local table. In the example above Stream2 could have been configured as update-only. Then in line three it will update the column values for C3 and C4 only once Stream1 has created that record. Similarly in line four if it still had a record with handle value D that Stream1 did not and had marked as deleted in the local table, then it would not reinstate that record. This approach is commonly used to define one stream as a master and other streams simply supplement the records if and only if the master stream has created them. This can also be used to accommodate propogation delays with upstream systems, where common data may be available in one but not in another until some time later - without this local table records could flip flop between being deleted and created. Other approaches usually involve making a correlated couple source table - here Stream1 and Stream2 would be joined (by using a view usually) in some way to ensure consistency of handle values across both.

Sometimes the relevant streams will not have a common handle however, this is illustrated in Figure 20.3, “Vertical Striping (Independent)”.

Figure 20.3. Vertical Striping (Independent)

Lines one and two work the same as before - the match for each stream is made on a different handle column but the end result is the same. Lines three and four show the default behaviour for creation, which is sometimes what is wanted but often not. In line three Stream1 has a record with handle value C which is not in the local table (under H1) so it is created. However it does not have any information as to the handle H2 value when creating the record so that is left blank (or local table definition default). Then in line four Stream2 has a record with handle value Z which is not in the local table (under H2) so it is also created. Again the handle value to use, for H1 in this case, is unknown so left blank. The two independent records in the local table will now be independently maintained by the two streams (this has indirectly caused horizontal striping, see next section for more details). If this is not the desired behaviour something needs to be done to prevent it. The approach to take will depend on the context and how to access the relevent stream handle values. One option is to correlate the streams to so that each has the others handle value, which can just be treated as normal column data to update. If this is possible it may also be possible to simply merge down the streams anyway and use a single couple. Another option is to define a key lookup for one of the streams and define the other as update-only. So for example here Stream1 would create the record with H1 and then as part of that process include a lookup to get the value for H2 to add at the same time. Then Stream2 won’t do anything until Stream1 has created a record with a value in H2 in the local table it can match against and then update. This is variant of correlation and may or may not be a better approach depending on the context. Another option is to define the local table default value for H1 and H2 to do the lookup on a creation trigger. This is not then something defined as part of TheonCoupler but part of the underlying schema definition done in Theon. Finally in line five a record is being deleted. As with previous example if the record is in one stream but not the other it will flip flop between being deleted and being recreated - the same approaches to address this apply here.

In both example cases above vertical striping only occurs if the streams (or the component couples) share a common identity (tag) representing their data set in the local target table, otherwise horizonatal striping will occur. This is shown in the illustrations as the tag name mystream underneath each. In any stream data set records with an empty handle are ignored. This is how the two indpendent records created in lines three and four in the illustration in Figure 20.3, “Vertical Striping (Independent)” can be maintained in the common stream data set (without flip flopping between being created and deleted). In effect the handle value is acting as an additional horizontal striping key (as in using parameters). There would be an issue here if one or other or both of H1 and H2 was defined in the local target table as having a mandatory value - in which case they could not be maintained independently in the same stream data set.

It does not normally make sense to have the same column maintained by more than one stream couple concurrently. Rather there are few scenarios where it is of any use, the column value will just flip flop to whatever the value is in the last run stream couple. Again correlated streams can be used to mitigate this effect, but is essentially redundant as only one stream is then needed to manage the value generally. There are however scenarios where "last stream wins" is applicable as the stream ordering is managed and manually varied as necessary. Similarly explicitly (and always) paramaterized stream couples may make it applicable in some circumstances.

However, two streams might maintain the same column sequentially. That is one stream may maintain it for a period of time and another stream maintain it for a different period of time, but they both never maintain it at the same time concurrently. This needs to be done with horizontal striping (either directly or indirectly within a common stream data set as above) and use of the handover mechanism, see section below.

20.2.2. Horizontal

Horizontal striping occurs when more than one couple is maintaining a set of records in a single target table. Each couple may all be associated with a single stream but it is much more likely each is associated with a different stream (the sources of data for each set of records are originating from different external systems for example). Basically this looks as illustrated in Figure 20.4, “Horizontal Striping”.

Figure 20.4. Horizontal Striping

The details of the couple process for two separate streams is illustrated in Figure 20.5, “Horizontal Striping (Distinct Handles)”.

Figure 20.5. Horizontal Striping (Distinct Handles)

Since each stream has a different handle in the local target table it is not necessary to distinguish the streams by giving each a unique identity tag, although it wouldn’t harm. Here each is called mystream and the record origin will be marked the same. This only works because handle values that are empty in the local target table are ignored by that couple for matching purposes. If this was not the case each stream would need a distinct name, otherwise they will step on each others toes, flip flopping between deleting/creating records.

You can see that Stream1 behaves as before. Note that records contain metadata defining the "owning" stream. Here Stream2 works the same as Stream1 but maintains a completely independent set of data. The correlation of Stream2 is done on H value matching H2 in the local target table, in Stream1 the H value is matching on H1 in the local target table. In Stream1 columns C1 and C2 are mapped to C1 and C2 in the local target table. In Stream2 columns C1 and C2 are mapped to C3 and C4 in the local target table. When a record is deleted from either stream the metadata defining the owner is blanked (see Handover).

Sometimes multiple streams will want to use the same handle in the local target table and for operational reasons (often as the handle is subsequently exposed to the users and/or reporting systems) it is not practical to use a repeated column in the local target table (as this is otherwise as option). When this is the case each stream must have a different and unique identity tag to define the data set in the local target table it operates against. This is illustrated in Figure 20.6, “Horizontal Striping (Common Handles)”.

Figure 20.6. Horizontal Striping (Common Handles)

Here both streams correlate the match on the H column in the local target table. The values of H would not be expected to overlap in Stream1 and Stream2, that is, each stream would have its own distinct set of values which do not occur in the other stream. A distinct stream identity tag is required otherwise when one stream is processed it will see non-empty handle values in the local target table that are not its own stream data set and correspondingly mark them as deleted. When the other stream is processed they will get added back but others from the first stream may then get deleted. This causes flip flopping.

In summary a stream data set in striping is identified (uniquely) by the stream identity tag and/or the specific handle columns. We will see later that it can be further constrained by handle parameters.

Figure 20.7. Dynamic Stream Tag

Illustrated in Figure 20.7, “Dynamic Stream Tag” is usage of a dynamic stream identity tag for horizontal striping. This is where the tag of a stream included variable data, potentially taken from the content of the stream data. In the example illustrated here this is used to preserve historical data without automatically marking it as deleted (since it may no longer be in the stream data). In this example the stream data switches annually to only contain that years data. The year is embedded in the tag of the stream. So while in 2012 and with 2012 data the stream maintains records in the local table with a tag of mystream2012. When the year changes to 2013 the tag name changes to mystream2013 and all the records currently marked with mystream2012 are no longer part of the streams (horizontal stripe) data set and are ignored and left alone, and crucially not deleted even though those records are not part of the stream data anymore. This is a very useful technique to minimise the potential data sets that need to pulled from external upstream sources, by effectively defining a working set window of data to use. It does however interfere with subsequent purge operations as the records tagged in mystream2012 are effectively now orphaned. This can be worked around by using a dummy stream supplying the now archived tags which is empty and kicks in after a certain delay and marks the records as deleted, with a purge defined to kick in after a further delay.

20.2.3. Handover

Noted above is that when a record is marked as deleted the record ownership is blanked (what was the stream identity tag). When a couple process identifies a correlating handle value and the local table record is marked as deleted and the ownership is blank then the couple will claim ownership and undelete the record. This is illustrated in Figure 20.8, “Stream Handover”.

Figure 20.8. Stream Handover

This allows different streams to maintain the same records over time as data is supplied from different external systems, for example, in particular where there is a common handle value over the lifetime of the data element.

20.3. Local Override

By default columns managed by a couple have local update override enabled. This adds per-column meta data to control. See the illustration in Figure 20.9, “Update Override”

Figure 20.9. Update Override

The first line shows the state as at the end of the previous illustration, the couple process has modified C1 to 3 to reflect the value in the stream data and the Orgin has been marked as being the stream. In the second line the user has overridden that value locally to 8 and the Origin is now marked as local (in practice the user who made the change). Now on the next coupling process despite the stream value being 3 because the meta data has marked the Origin as local the overridden value is left unchanged. On the final line the value in the stream changes to 8 matching the locally modified value. As a result the couple "reclaims" ownership of the value and the Origin is reverted to being the stream. Consequently if the value in the stream data changed again, back to 3 for example, then the local value would also be changed. Local overrides are only "held" while they differ from the value in the stream data. Any column can optionally have this "reclaim" disabled meaning that the local value will be "held" locally even if the stream data value changes to match. Once locally overridden such values will never be reclaimed are updated by the sync process again (without manual low level intervention to alter the corresponding meta data). Any column can also optionally have the local override facility removed, often the case where there is no requirement to do so in internal only streams. In this case the overridden value would have been immediately reverted back to 3 on the next sync following the local change.

Figure 20.10. Insert Override

Locally records can be created that are not part of the sync process. Like local override for update, by default if these records are given a value for H and that value eventually matches a value in the stream data then the stream will "reclaim" the record and update the field values to match (although by default if the values differ they will be held as local update overrides as above). For this to work the records must be given a value of H that is valid and will at some point appear in the stream data. If not they will never be reclaimed and will be left as local insert overrides. This is illustrated in Figure 20.10, “Insert Override”. In the first line a new record is created. In the second line despite it not appearing in the stream data it is retained (not marked as deleted). In the third line the locally added records handle value appears in the stream data is reclaimed, but its values for fields are left as is since they constitute a local update override. Finally in the fourth record they stream data field values have changed to match and the field origin is reverted to stream.

As for local update override the default behaviour can be changed so that new locally created records are added "out of sync" meaning they will never be reclaimed, even if they have a value of H that matches a value in the stream data - in such a situation an additional record will be created with that value of H. This will be a problem if H is being maintained as unique across the local table for example.

Figure 20.11. Delete Override

Locally added records can be deleted as normal. However sync added records can not be simply deleted. If they were the following sync would simply re-instate them. However, a local delete override facility is available but this type is not enabled by default. The reason for this is that it traps the normal database behaviour for the SQL delete command. On a delete the record is deleted as expected but a copy of the relevant couple handles is made into an axuiliary table (automtically created by TheonCoupler if needed). Stream processing uses this auxiliary table to filter the couple source table, excluding records that match, thus preventing the record from being re-instated again. Since this affects the standard operation of the SQL delete command other applications may need to be carefully written to account for it. This behaviour can also have complex knock on consequences (in particular with constraint handling) so should be used sparingly only where necessary. As for the other update overrides, if the record also then disappears from the stream data then the record will be marked as a normal deletion and the local deletion override status lost (hence it could be reinstated if it reappeared in the stream).

See illustration in Figure 20.11, “Delete Override”. In the first line a record is locally deleted. Although the illustration shows no record it (or the handle part) will have been transferred to the auxiliary table. Note that there may be more than one auxiliary table as one is defined for each stream. In the second line despite the record appearing in the stream data it is not re-instated. In the third line it disappears from the stream data, so this results in the meta data being changed to reflect the stream being the origin for the deletion. Finally in the fourth line the record reappears in the stream data and so is dutifully reinstated into the local data. Again it is optionally possible to change this behaviour so the local deletion is "held" effectively for all time (even to the exclusion of a purge operation, see later).

20.4. Key Lookup

Often the column values provided in the stream data cannot be directly mapped into the local target table. This is where their value is effectively a handle onto a record in another table which has a foreign key relationship with the local target table.

This can be addressed in a couple of ways. An intermediate couple source table can be constructed to resolve the necessary lookup (a view for example). Or the native couple configuration can be used to apply the relationship directly. The latter approach is usually simpler as it uses the relationship data which is part of Theon anyway and avoids defining a specific view just for this purpose. It can’t be used for multiple hops (e.g. more than one relationship has to be followed to derive the actual column value) or where the mapping is not based on any schema defined constraint.

The lookup configuration for a couple can be defined as mandatory or non-mandatory (the constraint can not be changed though, so a non-mandatory constraint can be made mandatory for the purpose of the coupler process but not vice-versa). It can be applied to a normal column value or the Handle column value if necessary.

Sometimes one couple will be populating a local target table which holds the primary key for a foreign key constraint on another local target table of a couple. In this case the two couple processes must be run in the correct order to ensure the primary key is in place before the key lookup of a couple is used to populate records in the child table. This can be done by giving the couples in a stream a specific order. This is done automatically where the couple configuration has been used to define the key lookup, but would have to be done manually otherwise (and also where there is a dependency not based on any schema defined constraint).

20.5. Parameters

The couple processes are normally executed by TheonCoupler as part of processing the stream data (when an updated set has been received, or a scheduled fetch has taken place). It is also possible to run the couple sync process manually as a function call in psql, or preferably via the toolkit commands. In this context it is possible to pass parameters. Every couple process has one or more parameters, one for each of the columns used in the Handle. These can be used to constrain the process to a subset of the full data. In essence this adds an additional element to the horizontal striping constraint. This is illustrated in Figure 20.12, “Paramaterized Couple Call”.

Figure 20.12. Paramaterized Couple Call

The normal behaviour is shown in the first set, when the couple process is called with no parameters (f()). All records are applied against the local target table and two have been updated to match. In the second set the couple process is called with a single handle value (f©) and so only that single record is applied against the local target table and only that one is updated. If it did not exist in the local target table a record would be created for it. Normally a couple process is called with a Handle value that exists in the stream data. If it is called with one that does not and a correlating record exists in the local target table then that single record will be deleted. Records are not otherwise deleted outside of the data set space defined by the parameters.

Only one single value match can be made with a parameter and it must be exact, so with a single Handle column the reduced data set can only ever be at most one record. With a multi column Handle then each parameter supplied that is non-null must match the corresponding column for a match. In this case more than one record can be chosen as a subset (provided each Handle column used is on its own not unique).

Paramaterized couple calls are commonly used in event processing where one event results in a change to the stream data which needs to be immediately applied to the relevant local target tables. This is commonly much more efficient than rerunning all the couple processes with the full data set (particularly if no other values are known to have changed, or if they have it is not relevant for the immediate consequence of the event).

20.6. Literals

Sometimes the stream data does not have quite the right types or values for a direct column to column update. For example the column types may need to be cast, the values may need to be explicitly mapped to align properly with an enumerated column value, or a simple static value may be missing. All these cases could be corrected by creating a view and using that as the couple source table instead of the stream table itself. However for small adjustments like this the couple column to column mapping can be complemented by simple SQL expressions and literals to achieve the same result.

Any source column can be replaced by a single literal SQL expression, this could include a full sub-select of course since that is allowed in an SQL expression - however that is usually better done as a view. Where a source column has been defined for the mapping then the literal SQL expression can rewrite the value before mapping and an escape character can be used (@) to represent the source column value in the expression.

20.7. Purge

Records maintained by TheonCoupler are not by default deleted. They are just marked as "deleted". This has advantages in that bad upstream data cannot destroy a lot of data. It is also often beneficial when upstream data goes away before it needs to go away locally, in particular when it transitions across different upstream systems and it should not be locally deleted during the transition. This is often the case as the data pulled in externally will have a lot of local data attached to it which might be lost if records are automatically deleted (particularly if in error). A disadvantage is that downstream systems need to explicitly accommodate deleted records - for example by adding handling in reports and user applications to hide them.

The Purge feature of TheonCoupler can be used to really delete records in local target tables that are synchronised and no longer have a corresponding instance in the stream data. This can be immediate or deferred. When a couple runs with an immediate purge enabled it will first mark records as deleted as normal and then run a purge cycle which will actually delete the records from the local target table. This is done atomically as a single process along with the rest of the couple and any other couples associated with the stream. With a deferred purge enabled the records are marked as deleted as before, but the corresponding timestamp of deletion is used as a marker to start a deferred purge cycle. As long as the stream is still running once the configured period from the original deletion date has been reached then the records will be really deleted from the local target table. Different records will be deleted at different times based on their original deletion date (or most recent deletion date if the record has been recreated and then deleted again).

A deferred purge can be set to an interval of days, weeks, months or years. It is particularly useful for cleaning out old data that is no longer relevant against an organisations set retention period policy for that data. Note however that the purge will only take place while the stream with the corresponding identity tag is still running, as it is done as part of the normal couple processing. If the stream stops running for any reason, the identity tag changes or structural changes mean the couple for the relevant target table is no longer deployed then the purge will never take place (automatically, of course meta data for TheonCoupler could still be used to construct a manual purge in these cases). What local records are deleted along with a purge depends on the local target table delete cascade rules as defined through Theon.

20.8. Metadata

A local target table for a couple will have a number of additional columns automatically added to it through mechanisms within the XSDDB and exported via the SchemaTree normal DDL derivation under Theon. These do not generally need to be used or exposed as they are managed and used solely by the relevant couple and associated processes and events driving the synchronisation process. So they can be safely ignored, however some are useful in particular for analysing problems with any TheonCoupler process.

Created is the timestamp when the record was first created by any couple. It is not set for manually added records (done by user outside of a couple process), column name is th_createdat
Deleted is the timestamp when the record was most recently marked for deletion by a couple, column name is th_deletedat
Increment is the change index value and used to isolate couple processes from normal user processes as well as track the number of changes applied (very much higher than average values indicate flip flopping for example); new user created records will have this set automatically in a way that indicates whether should be included in any future couple process (a value of 1 means they will be), but explicitly setting this column on manual creation of records should be avoided except when specifically trying to influence behaviour, column name is _th_increment
STREAM-Identity is the stream identity tag, there is one of these columns for each STREAM with a couple with this table as a local target table, column name is _th_STREAM_changeset; holds the stream specific handle value (potentially this can be dynamically constructed) and used to separate sets of data in one source table.
COLUMN-Origin and COLUMN-Change are added for each COLUMN being maintained by a couple (may be shared by more than one STREAM for horizontal striping, but generally unwise without care if shared by more than one STREAM for vertical striping); the value is blank for Origin if the last change was made by the couple and the Change will have the timestamp; the value will be the local session username of the local user that changed the value and the timestamp otherwise, used to retain the local update override, column names are _th_COLUMN_changedby and _th_COLUMN_changedat

A local target table for a couple will have an event automatically added to it. This traps normal operations on the table so as to update the relevant meta data fields. These should not normally be a concern, but they may interfere with any other user events.