Segment size optimization
- Druid stores data in segments. If you’re using the best-effort roll-up mode, increasing the segment size might introduce further aggregation which reduces the dataSource size.
It would be best if you can optimize the segment size at ingestion time, but sometimes it’s not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimized size first and optimize them later using .
- Number of rows per segment: it’s generally recommended for each segment to have around 5 million rows. This setting is usually more important than the below “segment byte size”. This is because Druid uses a single thread to process each segment, and thus this setting can directly control how many rows each thread processes, which in turn means how well the query execution is parallelized.
- Segment byte size: it’s recommended to set 300 ~ 700MB. If this value doesn’t match with the “number of rows per segment”, please consider optimizing number of rows per segment rather than this value.
Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of and ).
- Running periodic Hadoop batch ingestion jobs and using a inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found on the Updating existing data section of the data management page.
- For an overview of compaction and how to submit a manual compaction task, see Compaction.