In addition to the tasks covered on this page, you can also use segment compaction to improve the layout of your existing data. Refer to Segment optimization to see if compaction will help in your environment. For an overview and steps to configure manual compaction tasks, see .
Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set.
Druid does not support single-record updates by primary key.
Updating existing data
Once you ingest some data in a dataSource for an interval and create Apache Druid segments, you might want to make changes to the ingested data. There are several ways this can be done.
If you have a dimension where values need to be updated frequently, try first using . A classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a human-readable String value that may need to be updated periodically.
If lookup-based techniques are not sufficient, you will need to reingest data into Druid for the time chunks that you want to update. This can be done using one of the batch ingestion methods in overwrite mode (the default mode). It can also be done using , provided you drop data for the relevant time chunks first.
If you do the reingestion in batch mode, Druid’s atomic update mechanism means that queries will flip seamlessly from the old data to the new data.
We recommend keeping a copy of your raw data around in case you ever need to reingest it.
Druid uses an in the ioConfig
to know where the data to be ingested is located and how to read it. For simple Hadoop batch ingestion, static
or spec types allow you to read data stored in deep storage.
There are other types of inputSpec
to enable reindexing and delta ingestion.
This section assumes you understand how to do batch ingestion without Hadoop using native batch indexing. Native batch indexing uses an inputSource
to know where and how to read the input data. You can use the to read data from segments inside Druid. You can use Parallel task () for all native batch reindexing tasks. Increase the maxNumConcurrentSubTasks
to accommodate the amount of data your are reindexing. See Capacity planning.
Druid supports permanent deletion of segments that are in an “unused” state (see the section of the Architecture page).
The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage.
For more information, please see Kill Task.
Permanent deletion of a segment in Apache Druid has two steps:
- The segment must first be marked as “unused”. This occurs when a segment is dropped by retention rules, and when a user manually disables a segment through the Coordinator API.
- After segments have been marked as “unused”, a Kill Task will delete any “unused” segments from Druid’s metadata store as well as deep storage.
For documentation on retention rules, please see .
A data deletion tutorial is available at Tutorial: Deleting data
Kill Task
The kill task deletes all information about segments and removes them from deep storage. Segments to kill must be unused (used==0) in the Druid segment table.
The available grammar is:
If markAsUnused
is true (default is false), the kill task will first mark any segments within the specified interval as unused, before deleting the unused segments within the interval.
WARNING! The kill task permanently removes all information about the affected segments from the metadata store and deep storage. These segments cannot be recovered after the kill task runs, this operation cannot be undone.
Druid supports retention rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded.
Druid also supports separating Historical processes into tiers, and the retention rules can be configured to assign data for specific intervals to specific tiers.
These features are useful for performance/cost management; a common use case is separating Historical processes into a “hot” tier and a “cold” tier.
Learn more
See the following topics for more information:
- for information on how Druid handles segment versioning.