Statistics of query execution

FE: Frontend, frontend node of Doris. Responsible for metadata management and request access.
BE: Backend, backend node of Doris. Responsible for query execution and data storage.
Fragment: FE will convert the execution of specific SQL statements into corresponding fragments and distribute them to BE for execution. BE will execute corresponding fragments and gather the result of RunningProfile to send back FE.

FE splits the query plan into fragments and distributes them to BE for task execution. BE records the statistics of Running State when executing fragment. BE print the outputs statistics of fragment execution into the log. FE can also collect these statistics recorded by each fragment and print the results on FE’s web page.

Turn on the report switch on FE through MySQL command

The latest 100 statements executed will be listed here. We can view detailed statistics of RunningProfile.

Here is a detailed list of query ID, execution time, execution statement and other summary information. The next step is to print the details of each fragment collected from be.

The fragment ID is listed here; hostname show the be node executing the fragment; active: 10s270msshow the total execution time of the node; non child: 0.14% means the execution time of the execution node itself (not including the execution time of child nodes) as a percentage of the total time.

PeakMemoryUsage indicates the peak memory usage of EXCHANGE_NODE; RowsReturned indicates the number of rows returned by EXCHANGE_NODE; RowsReturnedRate\=RowsReturned/ActiveTime; the meaning of these three statistics in other NODE the same.

Subsequently, the statistics of the child nodes will be printed in turn. here you can distinguish the parent-child relationship by intent.

`Fragment`

AverageThreadTokens: Number of threads used to execute fragment, excluding the usage of thread pool
PeakReservation: Peak memory used by buffer pool
MemoryLimit: Memory limit at query
PeakMemoryUsage: Peak memory usage of instance
RowsProduced: Number of rows that process

`BlockMgr`

BlocksCreated: Number of Block be created by BlockMgr
BlocksRecycled: Number of Block be recycled by BlockMgr
BytesWritten: How many bytes be writen to spill to disk
MaxBlockSize: Max size of one Block
TotalReadBlockTime: Total time read block from disk

`DataStreamSender`

BytesSent: Total bytes data sent
IgnoreRows: Rows filtered
LocalBytesSent: The amount bytes of local node send to it’s self during Exchange
OverallThroughput: Total throughput = BytesSent / Time
SerializeBatchTime: Sending data serialization time
UncompressedRowBatchSize: Size of rowbatch before sending data compression

TupleConvertTime: Time consuming of sending data serialization to insert statement
ResultSendTime: Time consuming of writing through ODBC driver

`EXCHANGE_NODE`

BytesReceived: Size of bytes received by network
DataArrivalWaitTime: Total waiting time of sender to push data
MergeGetNext: When there is a sort in the lower level node, exchange node will perform a unified merge sort and output an ordered result. This indicator records the total time consumption of merge sorting, including the time consumption of MergeGetNextBatch.
MergeGetNextBatch：It takes time for merge node to get data. If it is single-layer merge sort, the object to get data is network queue. For multi-level merge sorting, the data object is child merger.
ChildMergeGetNext: When there are too many senders in the lower layer to send data, single thread merge will become a performance bottleneck. Doris will start multiple child merge threads to do merge sort in parallel. The sorting time of child merge is recorded, which is the cumulative value of multiple threads.
ChildMergeGetNextBatch: It takes time for child merge to get data，If the time consumption is too large, the bottleneck may be the lower level data sending node.
FirstBatchArrivalWaitTime: The time waiting for the first batch come from sender
DeserializeRowBatchTimer: Time consuming to receive data deserialization
SendersBlockedTotalTimer(*): When the DataStreamRecv’s queue buffer is full, wait time of sender
ConvertRowBatchTime: Time taken to transfer received data to RowBatch
RowsReturned: Number of receiving rows
RowsReturnedRate: Rate of rows received

`SORT_NODE`

InMemorySortTime: In memory sort time
InitialRunsCreated: Number of initialize sort run
MergeGetNext: Time cost of MergeSort from multiple sort_run to get the next batch (only show spilled disk)
MergeGetNextBatch: Time cost MergeSort one sort_run to get the next batch (only show spilled disk)
SortDataSize: Total sorted data
TotalMergesPerformed: Number of external sort merges

`AGGREGATION_NODE`

PartitionsCreated: Number of partition split by aggregate
GetResultsTime: Time to get aggregate results from each partition
HTResizeTime: Time spent in resizing hashtable
HTResize: Number of times hashtable resizes
HashBuckets: Number of buckets in hashtable
HashBucketsWithDuplicate: Number of buckets with duplicatenode in hashtable
HashCollisions: Number of hash conflicts generated
HashDuplicateNodes: Number of duplicate nodes with the same buckets in hashtable
HashFailedProbe: Number of failed probe operations
HashFilledBuckets: Number of buckets filled data
HashProbe: Number of hashtable probe
HashTravelLength: The number of steps moved when hashtable queries

`HASH_JOIN_NODE`

ExecOption: The way to construct a HashTable for the right child (synchronous or asynchronous), the right child in Join may be a table or a subquery, the same is true for the left child
BuildBuckets: The number of Buckets in HashTable
BuildRows: the number of rows of HashTable
BuildTime: Time-consuming to construct HashTable
ProbeRows: Traverse the number of rows of the left child for Hash Probe
ProbeTime: Time consuming to traverse the left child for Hash Probe, excluding the time consuming to call GetNext on the left child RowBatch
PushDownComputeTime: The calculation time of the predicate pushdown condition
PushDownTime: The total time consumed by the predicate push down. When Join, the right child who meets the requirements is converted to the left child’s in query

`CROSS_JOIN_NODE`

ExecOption: The way to construct RowBatchList for the right child (synchronous or asynchronous)
BuildRows: The number of rows of RowBatchList (ie the number of rows of the right child)
BuildTime: Time-consuming to construct RowBatchList
LeftChildRows: the number of rows of the left child
LeftChildTime: The time it takes to traverse the left child and find the Cartesian product with the right child, not including the time it takes to call GetNext on the left child RowBatch

`UNION_NODE`

MaterializeExprsEvaluateTime: When the field types at both ends of the Union are inconsistent, the time spent to evaluates type conversion exprs and materializes the results

`ANALYTIC_EVAL_NODE`

EvaluationTime: Analysis function (window function) calculation total time
GetNewBlockTime: It takes time to apply for a new block during initialization. Block saves the cache line window or the entire partition for analysis function calculation
PinTime: the time it takes to apply for a new block later or reread the block written to the disk back to the memory
UnpinTime: the time it takes to flush the data of the block to the disk when the memory pressure of the block that is not in use or the current operator is high

`OLAP_SCAN_NODE`

The OLAP_SCAN_NODE is responsible for specific data scanning tasks. One OLAP_SCAN_NODE will generate one or more OlapScanner. Each Scanner thread is responsible for scanning part of the data.

Some or all of the predicate conditions in the query will be pushed to OLAP_SCAN_NODE. Some of these predicate conditions will continue to be pushed down to the storage engine in order to use the storage engine’s index for data filtering. The other part will be kept in OLAP_SCAN_NODE to filter the data returned from the storage engine.

The profile of the OLAP_SCAN_NODE node is usually used to analyze the efficiency of data scanning. It is divided into three layers: OLAP_SCAN_NODE, OlapScanner, and SegmentIterator according to the calling relationship.

The profile of a typical OLAP_SCAN_NODE is as follows. Some indicators will have different meanings depending on the storage format (V1 or V2).

The predicate push down and index usage can be inferred from the related indicators of the number of data rows in the profile. The following only describes the profile in the reading process of segment V2 format data. In segment V1 format, the meaning of these indicators is slightly different.

When reading a segment V2, if the query has key_ranges (the query range composed of prefix keys), first filter the data through the SortkeyIndex index, and the number of filtered rows is recorded in RowsKeyRangeFiltered.
After that, use the Bitmap index to perform precise filtering on the columns containing the bitmap index in the query condition, and the number of filtered rows is recorded in RowsBitmapIndexFiltered.
After that, according to the equivalent (eq, in, is) condition in the query condition, use the BloomFilter index to filter the data and record it in RowsBloomFilterFiltered. The value of RowsBloomFilterFiltered is the difference between the total number of rows of the Segment (not the number of rows filtered by the Bitmap index) and the number of remaining rows after BloomFilter, so the data filtered by BloomFilter may overlap with the data filtered by Bitmap.
After that, use the ZoneMap index to filter the data according to the query conditions and delete conditions and record it in RowsStatsFiltered.
RowsConditionsFiltered is the number of rows filtered by various indexes, including the values of RowsBloomFilterFiltered and RowsStatsFiltered.
So far, the Init phase is completed, and the number of rows filtered by the condition to be deleted in the Next phase is recorded in RowsDelFiltered. Therefore, the number of rows actually filtered by the delete condition are recorded in RowsStatsFiltered and respectively.
RawRowsRead is the final number of rows to be read after the above filtering.
RowsRead is the number of rows finally returned to Scanner. RowsRead is usually smaller than RawRowsRead, because returning from the storage engine to the Scanner may go through a data aggregation. If the difference between RawRowsRead and RowsRead is large, it means that a large number of rows are aggregated, and aggregation may be time-consuming.
RowsReturned is the number of rows finally returned by ScanNode to the upper node. RowsReturned is usually smaller than RowsRead. Because there will be some predicate conditions on the Scanner that are not pushed down to the storage engine, filtering will be performed once. If the difference between RowsRead and RowsReturned is large, it means that many rows are filtered in the Scanner. This shows that many highly selective predicate conditions are not pushed to the storage engine. The filtering efficiency in Scanner is worse than that in storage engine.

Many indicators under OlapScanner, such as IOTimer, BlockFetchTime, etc., are the accumulation of all Scanner thread indicators, so the value may be relatively large. And because the Scanner thread reads data asynchronously, these cumulative indicators can only reflect the cumulative working time of the Scanner, and do not directly represent the time consumption of the ScanNode. The time-consuming ratio of ScanNode in the entire query plan is the value recorded in the Active field. Sometimes it appears that IOTimer has tens of seconds, but Active is actually only a few seconds. This situation is usually due to:
- IOTimer is the accumulated time of multiple Scanners, and there are more Scanners.
- The upper node is time-consuming. For example, the upper node takes 100 seconds, while the lower ScanNode only takes 10 seconds. The field reflected in Active may be only a few milliseconds. Because while the upper layer is processing data, ScanNode has performed data scanning asynchronously and prepared the data. When the upper node obtains data from ScanNode, it can obtain the prepared data, so the Active time is very short.
NumScanners represents the number of Tasks submitted by the Scanner to the thread pool. It is scheduled by the thread pool in RuntimeState. The two parameters doris_scanner_thread_pool_thread_num and doris_scanner_thread_pool_queue_size control the size of the thread pool and the queue length respectively. Too many or too few threads will affect query efficiency. At the same time, some summary indicators can be divided by the number of threads to roughly estimate the time consumption of each thread.
TabletCount indicates the number of tablets to be scanned. Too many may mean a lot of random read and data merge operations.
UncompressedBytesRead indirectly reflects the amount of data read. If the value is large, it means that there may be a lot of IO operations.
CachedPagesNum and TotalPagesNum can check the hitting status of PageCache. The higher the hit rate, the less time-consuming IO and decompression operations.

AllocTime: Memory allocation time
CumulativeAllocationBytes: Cumulative amount of memory allocated
CumulativeAllocations: Cumulative number of memory allocations
PeakReservation: Peak of reservation
PeakUnpinnedBytes: Amount of memory data of unpin
PeakUsedReservation: Peak usage of reservation