Get ID from slave server

The transform works by asking a hop server for a range of values.

In other words, the transform reserves a range of IDs for a certain given central hop server sequence.

It then increments by one until there are no more values in the range, asks another range, and so forth.

This means that the returned IDs can be sequential, however this is not guaranteed.

The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned.

The last value returned + the size of the range (increment) are stored in a database table to persist the value over several runs of a pipeline.

Context

When processing large amounts of data on multiple distinct systems (for example when running a clustered pipeline) you sometimes still need to generate a unique numeric ID for each one. Since a GUID is not guaranteed to be unique and consumes a lot more memory and space compared to a numeric integer (long) ID, you may prefer to use numerical IDs.

A unique ID identifies a piece of information, which is useful for: looking up data, support cases, incremental processing, error handling (recovering from where you left off by looking up the last inserted ID) and so on. Typically you would use a database to generate such an ID, for example using an Oracle sequence. However, there are cases where you don’t have a database available (such as when you add a unique ID to a text/log file), when the database doesn’t support sequences (column databases, typically), or when retrieving the ID from a database is slow (going back and forth to the database for each row severely limits performance). In all these situations, you need a high-performance way of providing unique IDs to a Hop pipeline. The Get ID From Slave Server transform is designed to fit this need.

Assuming you have (or are now configuring) Hop Server slave servers set up in Hop, this transform retrieves a unique ID from the Hop Server slave server of your choice. It works by asking a slave server for a range of values. In other words, the transform reserves a range of IDs for a certain given central slave server sequence. It then increments by one until there are no more values in the range, asks another range, and so forth. This means that the returned IDs can be sequential, however this is not guaranteed. The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned. The last value returned plus the size of the range (increment) are stored in a database table to persist the value over several runs of a pipeline.

Slave server configuration

The start tag is optional and will default to 0 if you leave it out of the definition. You can define as many sequences as you like.

Once the configuration files are changed as shown above, slave servers receive a new servlet that can be called as follows (authentication required):http://hostname:port/hop/nextSequence/?name=SequenceName&increment=NumberOfIDsToReserveIn case no increment is specified, 10000 IDs are reserved, for example: The servlet will return a simple piece of XML containing both the start of the range as well as the number of IDs reserved, or the increment:

Continuing with this example, the following row will be present in the SEQ_TABLE table:

mysql> select * from SEQ_TABLE where SEQ_NAME=’test’;

SEQ_NAME	SEQ_VALUE
test	580000

Automatic loading and creation

It can be a burden to maintain all your sequences in an XML file. Because of this, it is also possible to automatically load all the sequences from a database table. You can use the following construct to do it:


    <connection>MySQL</connection>
    <schema/>
    <table>SEQ_TABLE</table>
    <sequence_field>SEQ_NAME</sequence_field>

The <autocreate> tag allows any sequence name to be specified in the transform without error. In that case, the sequence with the name specified will be created automatically with the start value from the <autosequence> specification.