Colocation Join

    The original design, implementation and effect can be referred to ISSUE 245 (opens new window).

    The Colocation Join function has undergone a revision, and its design and use are slightly different from the original design. This document mainly introduces Colocation Join’s principle, implementation, usage and precautions.

    • FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
    • BE: Backend, Doris’s back-end node. Responsible for query execution and data storage.
    • Colocation Group (CG): A CG contains one or more tables. Tables within the same group have the same Colocation Group Schema and the same data fragmentation distribution.
    • Colocation Group Schema (CGS): Used to describe table in a CG and general Schema information related to Colocation. Including bucket column type, bucket number and copy number.

    Principle

    The Colocation Join function is to make a CG of a set of tables with the same CGS. Ensure that the corresponding data fragments of these tables will fall on the same BE node. When tables in CG perform Join operations on bucket columns, local data Join can be directly performed to reduce data transmission time between nodes.

    The data of a table will eventually fall into a barrel according to the barrel column value Hash and the number of barrels modeled. Assuming that the number of buckets in a table is 8, there are eight buckets Buckets’. We call such a sequence a Buckets Sequence. Each Bucket has one or more Tablets. When a table is a single partitioned table, there is only one Tablet in a Bucket. If it is a multi-partition table, there will be more than one.

    In order for a table to have the same data distribution, the table in the same CG must ensure the following attributes are the same:

    1. Barrel row and number of barrels

      Bucket column, that is, the column specified in `DISTRIBUTED BY HASH (col1, col2,…)’in the table building statement. Bucket columns determine which column values are used to Hash data from a table into different Tablets. Tables in the same CG must ensure that the type and number of barrel columns are identical, and the number of barrels is identical, so that the data fragmentation of multiple tables can be controlled one by one.

    2. Number of copies

      The number of copies of all partitions of all tables in the same CG must be the same. If inconsistent, there may be a copy of a Tablet, and there is no corresponding copy of other table fragments on the same BE.

    Tables in the same CG do not require consistency in the number, scope, and type of partition columns.

    After fixing the number of bucket columns and buckets, the tables in the same CG will have the same Buckets Sequence. The number of replicas determines the number of replicas of Tablets in each bucket, which BE they are stored on. Suppose that Buckets Sequence is [0, 1, 2, 3, 4, 5, 6, 7], and that BE nodes have [A, B, C, D]4. A possible distribution of data is as follows:

    The data of all tables in CG will be uniformly distributed according to the above rules, which ensures that the data with the same barrel column value are on the same BE node, and local data Join can be carried out.

    When creating a table, you can specify the attribute "colocate_with"="group_name" in PROPERTIES, which means that the table is a Colocation Join table and belongs to a specified Colocation Group.

    Examples:

    1. CREATE TABLE tbl (k1 int, v1 int sum)
    2. DISTRIBUTED BY HASH(k1)
    3. BUCKETS 8
    4. PROPERTIES(
    5. "colocate_with" = "group1"
    6. );

    If the specified group does not exist, Doris automatically creates a group that contains only the current table. If the Group already exists, Doris checks whether the current table satisfies the Colocation Group Schema. If satisfied, the table is created and added to the Group. At the same time, tables create fragments and replicas based on existing data distribution rules in Groups. Group belongs to a database, and its name is unique in a database. Internal storage is the full name of Group dbId_groupName, but users only perceive groupName.

    Delete table

    When the last table in Group is deleted completely (deleting completely means deleting from the recycle bin). Usually, when a table is deleted by the DROP TABLE command, it will be deleted after the default one-day stay in the recycle bin, and the group will be deleted automatically.

    View Group

    The following command allows you to view the existing Group information in the cluster.

    1. SHOW PROC '/colocation_group';
    2. +-------------+--------------+--------------+------------+----------------+----------+----------+
    3. | GroupId | GroupName | TableIds | BucketsNum | ReplicationNum | DistCols | IsStable |
    4. +-------------+--------------+--------------+------------+----------------+----------+----------+
    5. | 10005.10008 | 10005_group1 | 10007, 10040 | 10 | 3 | int(11) | true |
    6. +-------------+--------------+--------------+------------+----------------+----------+----------+
    • GroupId: The unique identity of a group’s entire cluster, with DB ID in the first half and group ID in the second half.
    • GroupName: The full name of Group.
    • Tablet Ids: The group contains a list of Tables’ID.
    • Buckets Num: Number of barrels.
    • Replication Num: Number of copies.
    • DistCols: Distribution columns,
    • IsStable: Is the group stable (for the definition of stability, see section `Collocation replica balancing and repair’).
    1. SHOW PROC '/colocation_group/10005.10008';
    2. +-------------+---------------------+
    3. | BucketIndex | BackendIds |
    4. +-------------+---------------------+
    5. | 0 | 10004, 10002, 10001 |
    6. | 1 | 10003, 10002, 10004 |
    7. | 2 | 10002, 10004, 10001 |
    8. | 3 | 10003, 10002, 10004 |
    9. | 4 | 10002, 10004, 10003 |
    10. | 5 | 10003, 10002, 10001 |
    11. | 6 | 10003, 10004, 10001 |
    12. | 7 | 10003, 10004, 10002 |
    13. +-------------+---------------------+
    • BucketIndex: Subscript to the bucket sequence.
    • Backend Ids: A list of BE node IDs where data fragments are located in buckets.

    You can modify the Colocation Group property of a table that has been created. Examples:

    ALTER TABLE tbl SET ("colocate_with" = "group2");

    • If the table has not previously specified a Group, the command checks the Schema and adds the table to the Group (if the Group does not exist, it will be created).
    • If other groups are specified before the table, the command first removes the table from the original group and adds a new group (if the group does not exist, it will be created).

    You can also delete the Colocation attribute of a table by following commands:

    ALTER TABLE tbl SET ("colocate_with" = "");

    When an ADD PARTITION is added to a table with a Colocation attribute and the number of copies is modified, Doris checks whether the modification violates the Colocation Group Schema and rejects it if it does.

    Colocation Duplicate Balancing and Repair

    Copy distribution of Colocation tables needs to follow the distribution specified in Group, so it is different from common fragmentation in replica repair and balancing.

    Group itself has a Stable attribute, when Stable is true, which indicates that all fragments of the table in the current Group are not changing, and the Colocation feature can be used normally. When Stable is false, it indicates that some tables in Group are being repaired or migrated. At this time, Colocation Join of related tables will degenerate into ordinary Join.

    Replica Repair

    Copies can only be stored on specified BE nodes. So when a BE is unavailable (downtime, Decommission, etc.), a new BE is needed to replace it. Doris will first look for the BE with the lowest load to replace it. After replacement, all data fragments on the old BE in the Bucket will be repaired. During the migration process, Group is marked Unstable.

    Doris will try to distribute the fragments of the Collocation table evenly across all BE nodes. For the replica balancing of common tables, the granularity is single replica, that is to say, it is enough to find BE nodes with lower load for each replica alone. The equilibrium of the Colocation table is at the Bucket level, where all replicas within a Bucket migrate together. We adopt a simple equalization algorithm, which distributes Buckets Sequence evenly on all BEs, regardless of the actual size of the replicas, but only according to the number of replicas. Specific algorithms can be referred to the code annotations in ColocateTableBalancer.java.

    Note 1: Current Colocation replica balancing and repair algorithms may not work well for heterogeneous deployed Oris clusters. The so-called heterogeneous deployment, that is, the BE node’s disk capacity, number, disk type (SSD and HDD) is inconsistent. In the case of heterogeneous deployment, small BE nodes and large BE nodes may store the same number of replicas.

    Note 2: When a group is in an Unstable state, the Join of the table in it will degenerate into a normal Join. At this time, the query performance of the cluster may be greatly reduced. If you do not want the system to balance automatically, you can set the FE configuration item disable_colocate_balance to prohibit automatic balancing. Then open it at the right time. (See Section Advanced Operations for details)

    The Colocation table is queried in the same way as ordinary tables, and users do not need to perceive Colocation attributes. If the Group in which the Colocation table is located is in an Unstable state, it will automatically degenerate to a normal Join.

    Examples are given to illustrate:

    Table 1:

    Table 2:

    1. `k1` datetime NOT NULL COMMENT "",
    2. `k2` int(11) NOT NULL COMMENT "",
    3. `v1` double SUM NOT NULL COMMENT ""
    4. ) ENGINE=OLAP
    5. AGGREGATE KEY(`k1`, `k2`)
    6. DISTRIBUTED BY HASH(`k2`) BUCKETS 8
    7. PROPERTIES (
    8. "colocate_with" = "group1"
    9. );

    View the query plan:

    1. DESC SELECT * FROM tbl1 INNER JOIN tbl2 ON (tbl1.k2 = tbl2.k2);
    2. +----------------------------------------------------+
    3. +----------------------------------------------------+
    4. | PLAN FRAGMENT 0 |
    5. | OUTPUT EXPRS:`tbl1`.`k1` | |
    6. | PARTITION: RANDOM |
    7. | |
    8. | RESULT SINK |
    9. | |
    10. | 2:HASH JOIN |
    11. | | join op: INNER JOIN |
    12. | | hash predicates: |
    13. | | colocate: true |
    14. | | `tbl1`.`k2` = `tbl2`.`k2` |
    15. | | tuple ids: 0 1 |
    16. | | |
    17. | |----1:OlapScanNode |
    18. | | TABLE: tbl2 |
    19. | | PREAGGREGATION: OFF. Reason: null |
    20. | | partitions=0/1 |
    21. | | rollup: null |
    22. | | buckets=0/0 |
    23. | | cardinality=-1 |
    24. | | avgRowSize=0.0 |
    25. | | numNodes=0 |
    26. | | tuple ids: 1 |
    27. | | |
    28. | 0:OlapScanNode |
    29. | TABLE: tbl1 |
    30. | PREAGGREGATION: OFF. Reason: No AggregateInfo |
    31. | partitions=0/2 |
    32. | rollup: null |
    33. | buckets=0/0 |
    34. | cardinality=-1 |
    35. | avgRowSize=0.0 |
    36. | numNodes=0 |
    37. | tuple ids: 0 |
    38. +----------------------------------------------------+

    If not, the query plan is as follows:

    1. +----------------------------------------------------+
    2. | Explain String |
    3. +----------------------------------------------------+
    4. | PLAN FRAGMENT 0 |
    5. | OUTPUT EXPRS:`tbl1`.`k1` | |
    6. | PARTITION: RANDOM |
    7. | |
    8. | RESULT SINK |
    9. | |
    10. | 2:HASH JOIN |
    11. | | hash predicates: |
    12. | | colocate: false, reason: group is not stable |
    13. | | `tbl1`.`k2` = `tbl2`.`k2` |
    14. | | tuple ids: 0 1 |
    15. | |----3:EXCHANGE |
    16. | | tuple ids: 1 |
    17. | | |
    18. | 0:OlapScanNode |
    19. | TABLE: tbl1 |
    20. | PREAGGREGATION: OFF. Reason: No AggregateInfo |
    21. | partitions=0/2 |
    22. | rollup: null |
    23. | buckets=0/0 |
    24. | cardinality=-1 |
    25. | avgRowSize=0.0 |
    26. | numNodes=0 |
    27. | tuple ids: 0 |
    28. | |
    29. | PLAN FRAGMENT 1 |
    30. | OUTPUT EXPRS: |
    31. | PARTITION: RANDOM |
    32. | |
    33. | STREAM DATA SINK |
    34. | EXCHANGE ID: 03 |
    35. | UNPARTITIONED |
    36. | |
    37. | 1:OlapScanNode |
    38. | TABLE: tbl2 |
    39. | PREAGGREGATION: OFF. Reason: null |
    40. | partitions=0/1 |
    41. | rollup: null |
    42. | buckets=0/0 |
    43. | cardinality=-1 |
    44. | avgRowSize=0.0 |
    45. | numNodes=0 |
    46. | tuple ids: 1 |
    47. +----------------------------------------------------+

    The HASH JOIN node displays the corresponding reason: colocate: false, reason: group is not stable. At the same time, an EXCHANGE node will be generated.

    Advanced Operations

    FE Configuration Item

    • disable_colocate_relocate

    Whether to close Doris’s automatic Colocation replica repair. The default is false, i.e. not closed. This parameter only affects the replica repair of the Colocation table, but does not affect the normal table.

    • disable_colocate_balance

    Whether to turn off automatic Colocation replica balancing for Doris. The default is false, i.e. not closed. This parameter only affects the replica balance of the Collocation table, but does not affect the common table.

    User can set these configurations at runtime. See HELP ADMIN SHOW CONFIG; and HELP ADMIN SET CONFIG;.

    • disable_colocate_join

    Whether to turn off the Colocation Join function or not. In 0.10 and previous versions, the default is true, that is, closed. In a later version, it will default to false, that is, open.

    • use_new_tablet_scheduler

    In 0.10 and previous versions, the new replica scheduling logic is incompatible with the Colocation Join function, so in 0.10 and previous versions, if disable_colocate_join = false, you need to set use_new_tablet_scheduler = false, that is, close the new replica scheduler. In later versions, use_new_tablet_scheduler will be equal to true.

    HTTP Restful API

    Doris provides several HTTP Restful APIs related to Colocation Join for viewing and modifying Colocation Group.

    The API is implemented on the FE side and accessed using fe_host: fe_http_port. ADMIN privileges are required.

    1. View all Colocation information for the cluster

    2. Mark Group as Stable or Unstable

      • Mark as Stable

        1. POST /api/colocate/group_stable?db_id=10005&group_id=10008
        2. Returns: 200
      • Mark as Unstable

        1. DELETE /api/colocate/group_stable?db_id=10005&group_id=10008
        2. Returns: 200
    3. Setting Data Distribution for Group

      The interface can force the number distribution of a group.

      1. POST /api/colocate/bucketseq?db_id=10005&group_id=10008
      2. Body:
      3. [[10004,10002],[10003,10002],[10002,10004],[10003,10002],[10002,10004],[10003,10002],[10003,10004],[10003,10004],[10003,10004],[10002,10004]]
      4. Returns: 200

      Body is a Buckets Sequence represented by a nested array and the ID of the BE where the fragments are distributed in each Bucket.

      Note that using this command, you may need to set the FE configuration and disable_colocate_balance to true. That is to shut down the system for automatic Colocation replica repair and balancing. Otherwise, it may be automatically reset by the system after modification.