Segment Size Optimization
- When a query is submitted, that query is distributed to all Historicals and realtime tasks which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool to process a single segment. If segment sizes are too large, data might not be well distributed between data servers, decreasing the degree of parallelism possible during query processing. At the other extreme where segment sizes are too small, the scheduling overhead of processing a larger number of segments per query can reduce performance, as the threads that process each segment compete for the fixed slots of the processing pool.
It would be best if you can optimize the segment size at ingestion time, but sometimes it’s not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimized size first and optimize them later using compaction.
- Segment byte size: it’s recommended to set 300 ~ 700MB. If this value doesn’t match with the “number of rows per segment”, please consider optimizing number of rows per segment rather than this value.
Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of and ).
- Running periodic Hadoop batch ingestion jobs and using a inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found on the section of the data management page.
For an overview of compaction and how to submit a manual compaction task, see Compaction.