Sequence Column
To solve this problem, Doris supported a sequence column by allowing the user to specify the sequence column when importing. Under the same key column, columns of the REPLACE aggregate type will be replaced according to the value of the sequence column, larger values can be replaced with smaller values, and vice versa. In this method, the order is determined by the user, and the user controls the replacement order.
Implemented by adding a hidden column , the type of the column is specified by the user while create the table, determines the specific value of the column on import, and replaces the REPLACE column with that value.
When you create the Uniq table, a hidden column __DORIS_SEQUENCE_COL__
is automatically added, depending on the type specified by the user
Import
When importing, fe sets the value of the hidden column during parsing to the value of the ‘order by’ expression (Broker Load and routine Load), or the value of the ‘function_column.sequence_col’ expression (stream load), and the value column will be replaced according to this value. The value of the hidden column __DORIS_SEQUENCE_COL__
can be set as a column in the source data or in the table structure.
The request with the value column needs to read the additional column of __DORIS_SEQUENCE_COL__
, which is used as a basis for the order of replacement aggregation function replacement under the same key column, with the larger value replacing the smaller value and not the reverse.
Cumulative Compaction
Cumulative Compaction works in the same way as the reading process
Base Compaction works in the same way as the reading process
Syntax
Create Table
When you create the Uniq table, you can specify the sequence column type
The sequence_type is used to specify the type of the sequence column, which can be integral and time
stream load
The syntax of the stream load is to add the mapping of hidden columns corresponding to source_sequence in the ‘function_column.sequence_col’ field in the header, for example
curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load
broker load
Set the source_sequence field for the hidden column map at ORDER BY
LOAD LABEL db1.label1
(
DATA INFILE("hdfs://host:port/user/data/*/test.txt")
INTO TABLE `tbl1`
COLUMNS TERMINATED BY ","
(k1,k2,source_sequence,v1,v2)
ORDER BY source_sequence
)
WITH BROKER 'broker'
(
"username"="user",
"password"="pass"
PROPERTIES
(
"timeout" = "3600"
);
routine load
The mapping method is the same as above, as shown below
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
COLUMNS(k1, k2, source_sequence, v1, v2),
WHERE k1 > 100 and k2 like "%doris%"
[ORDER BY source_sequence]
PROPERTIES
(
"desired_concurrent_number"="3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
"kafka_topic" = "my_topic",
"kafka_partitions" = "0,1,2,3",
"kafka_offsets" = "101,0,0,200"
);
If function_column.sequence_type
is set when creating a new table, then the sequence column will be supported. For a table that does not support sequence column, use the following statement if you would like to use this feature: ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")
to enable. If you want to determine if a table supports sequence column, you can set the session variable to display the hidden column SET show_hidden_columns=true
, followed by desc Tablename
, if the output contains the column __DORIS_SEQUENCE_COL__
, it is supported, if not, it is not supported
Let’s take the stream Load as an example to show how to use it
- Create a table that supports sequence column.
- Import data normally:
Import the following data
1 2020-02-22 1 2020-03-05 c
1 2020-02-22 1 2020-02-26 d
1 2020-02-22 1 2020-02-22 e
1 2020-02-22 1 2020-02-22 b
Take the Stream Load as an example here and map the sequence column to the modify_date column
curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load
The results is
MySQL > select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-05 | c |
+---------+------------+----------+-------------+---------+
In this import, the c is eventually retained in the keyword column because the value of the sequence column (the value in modify_date) is the maximum value: ‘2020-03-05’.
- Guarantee of substitution order
After the above steps are completed, import the following data
Query data
MySQL [test]> select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-05 | c |
+---------+------------+----------+-------------+---------+
Because the sequence column for the newly imported data are all smaller than the values already in the table, they cannot be replaced Try importing the following data again
1 2020-02-22 1 2020-02-22 a
1 2020-02-22 1 2020-03-23 w
MySQL [test]> select * from test_table;
+---------+------------+----------+-------------+---------+
| user_id | date | group_id | modify_date | keyword |
+---------+------------+----------+-------------+---------+
| 1 | 2020-02-22 | 1 | 2020-03-23 | w |
At this point, you can replace the original data in the table