SatelliteCollections

    When doing joins in an ArangoDB cluster data has to be exchanged between different servers.

    Joins will be executed on a Coordinator. It will prepare an execution plan and execute it. When executing, the Coordinator will contact all shards of the starting point of the join and ask for their data. The DB-Servers carrying out this operation will load all their local data and then ask the cluster for the other part of the join. This again will be distributed to all involved shards of this join part.

    In sum this results in much network traffic and slow results depending of the amount of data that has to be sent throughout the cluster.

    SatelliteCollections are collections that are intended to address this issue.

    They will facilitate the synchronous replication and replicate all its data to all DB-Servers that are part of the cluster.

    This greatly improves performance for such joins at the costs of increased storage requirements and poorer write performance on this data.

    To create a SatelliteCollection set the replicationFactor of this collection to “satellite”.

    Using arangosh:

    Let’s analyse a normal join not involving SatelliteCollections:

    All shards involved querying the collection will fan out via the Coordinator to the shards of . In sum 8 shards will open 8 connections to the Coordinator asking for the results of the join. The Coordinator will fan out to the 8 shards of . So there will be quite some network traffic.

    In this scenario all shards of nonsatellite will be contacted. However as the join is a satellite join all shards can do the join locally as the data is replicated to all servers reducing the network overhead dramatically.

    Caveats

    The cluster will automatically keep all SatelliteCollections on all servers in sync by facilitating the synchronous replication. This means that write will be executed on the leader only and this server will coordinate replication to the followers. If a follower doesn’t answer in time (due to network problems, temporary shutdown etc.) it may be removed as a follower. This is being reported to the Agency.

    The follower (once back in business) will then periodically check the Agency and know that it is out of sync. It will then automatically catch up. This may take a while depending on how much data has to be synced. When doing a join involving the satellite you can specify how long the DB-Server is allowed to wait for sync until the query is being aborted.

    Check for details.

    During network failure there is also a minimal chance that a query was properly distributed to the DB-Servers but that a previous satellite write could not be replicated to a follower and the leader dropped the follower. The follower however only checks every few seconds if it is really in sync so it might indeed deliver stale results.