Batch Delete

    For scenarios similar to the import of cdc data, insert and delete in the data data generally appear interspersed. In this scenario, our current import method is not enough, even if we can separate insert and delete, it can solve the import problem , But still cannot solve the problem of deletion. Use the batch delete function to solve the needs of these scenarios. There are three ways to merge data import:

    1. APPEND: All data are appended to existing data
    2. DELETE: delete all rows with the same key column value as the imported data
    3. MERGE: APPEND or DELETE according to DELETE ON decision

    This is achieved by adding a hidden column , because we are only doing batch deletion on the unique model, so we only need to add a hidden column whose type is bool and the aggregate function is replace. In be, the various aggregation write processes are the same as normal columns, and there are two read schemes:

    Remove __DORIS_DELETE_SIGN__ when fe encounters extensions such as *, and add the condition of __DORIS_DELETE_SIGN__ != true by default When be reads, a column is added for judgment, and the condition is used to determine whether to delete.

    When importing, set the value of the hidden column to the value of the DELETE ON expression during fe parsing. The other aggregation behaviors are the same as the replace aggregation column

    When reading, add the condition of __DORIS_DELETE_SIGN__ != true to all olapScanNodes with hidden columns, be does not perceive this process and executes normally

    In Cumulative Compaction, hidden columns are treated as normal columns, and the compaction logic remains unchanged

    In Base Compaction, delete the rows marked for deletion to reduce the space occupied by data

    stream load

    The wording of stream load adds a field to set the delete mark column in the columns field in the header. Example -H "columns: k1, k2, label_c3" -H "merge_type: [MERGE|APPEND|DELETE]" -H "delete: label_c3=1"

    broker load

    Set the field to delete the mark column at PROPERTIES

    routine load

    Routine load adds a mapping to the columns field. The mapping method is the same as above, the example is as follows

    1. CREATE ROUTINE LOAD example_db.test1 ON example_tbl
    2. [WITH MERGE|APPEND|DELETE]
    3. COLUMNS(k1, k2, k3, v1, v2, label),
    4. WHERE k1> 100 and k2 like "%doris%"
    5. [DELETE ON label=true]
    6. PROPERTIES
    7. (
    8. "desired_concurrent_number"="3",
    9. "max_batch_interval" = "20",
    10. "max_batch_size" = "209715200",
    11. "strict_mode" = "false"
    12. )
    13. FROM KAFKA
    14. (
    15. "kafka_topic" = "my_topic",
    16. "kafka_partitions" = "0,1,2,3",
    17. "kafka_offsets" = "101,0,0,200"
    18. );

    There are two ways of enabling batch delete support:

    1. By adding enable_batch_delete_by_default=true in the fe configuration file, all newly created tables after restarting fe support batch deletion, this option defaults to false

    2. For tables that have not changed the above fe configuration or for existing tables that do not support the bulk delete function, you can use the following statement: ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE" to enable the batch delete.

    If you want to determine whether a table supports batch delete, you can set a session variable to display the hidden columns SET show_hidden_columns=true, and then use desc tablename, if there is a __DORIS_DELETE_SIGN__ column in the output, it is supported, if not, it is not supported

    1. Since import operations other than stream load may be executed out of order inside doris, if it is not stream load when importing using the MERGE method, it needs to be used with load sequence. For the specific syntax, please refer to the sequence column related documents
    2. DELETE ON condition can only be used with MERGE
    1. Import data normally:
    1. curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: APPEND" -T ~/table1_data http://127.0.0.1: 8130/api/test/table1/_stream_load

    The APPEND condition can be omitted, which has the same effect as the following statement:

    1. curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -T ~/table1_data http://127.0.0.1:8130/api/test/table1 /_stream_load
    1. Delete all data with the same key as the imported data

    Before load:

    1. +--------+----------+----------+------+
    2. | siteid | citycode | username | pv |
    3. +--------+----------+----------+------+
    4. | 3 | 2 | tom | 2 |
    5. | 4 | 3 | bush | 3 |
    6. | 5 | 3 | helen | 3 |

    Load data:

    1. 3,2,tom,0

    After load:

    1. +--------+----------+----------+------+
    2. +--------+----------+----------+------+
    3. | 4 | 3 | bush | 3 |
    4. | 5 | 3 | helen | 3 |
    5. +--------+----------+----------+------+
    1. Import the same row as the key column of the row with site_id=1

    Before load:

    1. +--------+----------+----------+------+
    2. | siteid | citycode | username | pv |
    3. +--------+----------+----------+------+
    4. | 4 | 3 | bush | 3 |
    5. | 5 | 3 | helen | 3 |
    6. | 1 | 1 | jim | 2 |
    7. +--------+----------+----------+------+

    Load data:

    1. 2,1,grace,2
    2. 3,2,tom,2
    3. 1,1,jim,2

    After load:

    1. +--------+----------+----------+------+
    2. | siteid | citycode | username | pv |
    3. +--------+----------+----------+------+
    4. | 4 | 3 | bush | 3 |
    5. | 2 | 1 | grace | 2 |
    6. | 3 | 2 | tom | 2 |