Query Tracing and Replaying

Currently, the trace file can be replayed by using the db_bench. The queries records in the traces file are replayed to the target DB instance according to the time stamps. It can replay the workload nearly the same as the workload being collected, which will provide a more production-like testing case.

An simple example to use the tracing APIs:

To replay the trace:

Trace Analyzing, Visualizing, and Modeling

After the user finishes the tracing steps by using the trace_replay APIs, the user will get one binary trace file. In the trace file, Get, Seek, and SeekForPrev are tracked with separate trace record, while queries of Put, Merge, Delete, SingleDelete, and DeleteRange are packed into WriteBatches. One tool is needed to 1) interpret the trace into the human readable format for further analyzing, 2) provide rich and powerful in-memory processing options to analyze the trace and output the corresponding results, and 3) be easy to add new analyzing options and query types to the tool.

The RocksDB team developed the initial version of the tool: trace_analyzer. It provides the following analyzing options and output results.

Note that most of the generated analyzing results output files will be separated in different column families and different query types, which means, one query type in one column family will have its own output files. Usually, one specified output option will generate one output file.

The trace analyzer options

 -analyze_delete (Analyze the Delete query.) type: bool default: false
 -analyze_get (Analyze the Get query.) type: bool default: false
 -analyze_iterator ( Analyze the iterate query like seek() and
   seekForPrev().) type: bool default: false
 -analyze_merge (Analyze the Merge query.) type: bool default: false
 -analyze_put (Analyze the Put query.) type: bool default: false
 -analyze_range_delete (Analyze the DeleteRange query.) type: bool
   default: false
 -analyze_single_delete (Analyze the SingleDelete query.) type: bool
   default: false
 -convert_to_human_readable_trace (Convert the binary trace file to a human
   readable txt file for further processing. This file will be extremely
   large (similar size as the original binary trace file). You can specify
   'no_key' to reduce the size, if key is not needed in the next step
   File name: <prefix>_human_readable_trace.txt
   Format:[type_id cf_id value_size time_in_micorsec <key>].) type: bool
   default: false
 -key_space_dir (<the directory stores full key space files>
   The key space files should be: <column family id>.txt) type: string
   default: ""
 -no_key ( Does not output the key to the result files to make smaller.)
   type: bool default: false
 -no_print (Do not print out any result) type: bool default: false
 -output_access_count_stats (Output the access count distribution statistics
   to file.
   File name:  <prefix>-<query
   type>-<cf_id>-accessed_key_count_distribution.txt
   Format:[access_count number_of_access_count]) type: bool default: false
 -output_dir (The directory to store the output files.) type: string
   default: ""
 -output_ignore_count (<threshold>, ignores the access count <= this value,
   it will shorter the output.) type: int32 default: 0
 -output_key_distribution (Output the key size distribution.) type: bool
   default: false
 -output_key_stats (Output the key access count statistics to file
   for accessed keys:
   file name: <prefix>-<query type>-<cf_id>-accessed_key_stats.txt
   Format:[cf_id value_size access_keyid access_count]
   for the whole key space keys:
   File name: <prefix>-<query type>-<cf_id>-whole_key_stats.txt
   Format:[whole_key_space_keyid access_count]) type: bool default: false
 -output_prefix (The prefix used for all the output files.) type: string
   default: "trace"
 -output_prefix_cut (The number of bytes as prefix to cut the keys.
   if it is enabled, it will generate the following:
   for accessed keys:
   File name: <prefix>-<query type>-<cf_id>-accessed_key_prefix_cut.txt
   Format:[acessed_keyid access_count_of_prefix number_of_keys_in_prefix
   average_key_access prefix_succ_ratio prefix]
   for whole key space keys:
   File name: <prefix>-<query type>-<cf_id>-whole_key_prefix_cut.txt
   Format:[start_keyid_in_whole_keyspace prefix]
   if 'output_qps_stats' and 'top_k' are enabled, it will output:
   File name: <prefix>-<query
   type>-<cf_id>-accessed_top_k_qps_prefix_cut.txt
   Format:[the_top_ith_qps_time QPS], [prefix qps_of_this_second].)
   type: int32 default: 0
 -output_qps_stats (Output the query per second(qps) statistics
   For the overall qps, it will contain all qps of each query type. The time
   is started from the first trace record
   File name: <prefix>_qps_stats.txt
   Format: [qps_type_1 qps_type_2 ...... overall_qps]
   For each cf and query, it will have its own qps output
   File name: <prefix>-<query type>-<cf_id>_qps_stats.txt
   Format:[query_count_in_this_second].) type: bool default: false
 -output_time_series (Output the access time in second of each key, such
   that we can have the time series data of the queries
   File name: <prefix>-<query type>-<cf_id>-time_series.txt
   Format:[type_id time_in_sec access_keyid].) type: bool default: false
 -output_value_distribution (Out put the value size distribution, only
   available for Put and Merge.
   File name: <prefix>-<query
   type>-<cf_id>-accessed_value_size_distribution.txt
   Format:[Number_of_value_size_between x and x+value_interval is: <the
   count>]) type: bool default: false
 -print_correlation (intput format: [correlation pairs][.,.]
   Output the query correlations between the pairs of query types listed in
   the parameter, input should select the operations from:
   get, put, delete, single_delete, rangle_delete, merge. No space between
   the pairs separated by commar. Example: =[get,get]... It will print out
   the number of pairs of 'A after B' and the average time interval between
   the two query) type: string default: ""
 -print_overall_stats ( Print the stats of the whole trace, like total
   requests, keys, and etc.) type: bool default: true
 -print_top_k_access (<top K of the variables to be printed> Print the top k
   accessed keys, top k accessed prefix and etc.) type: int32 default: 1
 -trace_path (The trace file path.) type: string default: ""
 -value_interval (To output the value distribution, we need to set the value
   intervals and make the statistic of the value size distribution in
   different intervals. The default is 8.) type: int32 default: 8

One Example

./trace_analyzer -analyze_get -output_access_count_stats -output_dir=/data/trace/result -output_key_stats -output_qps_stats -convert_to_human_readable_trace -output_value_distribution -output_key_distribution -print_overall_stats -print_top_k_access=3 -output_prefix=test -trace_path=/data/trace/trace

Query Type Options

User can specify which type queries that should be analyzed and use “-analyze_”.

Output Human Readable Traces

The original binary trace stores the encoded data structures and content, to interpret the trace, the tool should use the RocksDB library. Thus, to simplify the further analyzing of the trace, user can specify

-convert_to_human_readable_trace

The original trace will be converted to a txt file, the content is “[type_id cf_id value_size time_in_micorsec ]”. If the key is not needed, user can specify “-no_key” to reduce the file size. This option is independent to all other option, once it is specified, the converted trace will be generated. If the original key is included, the txt file size might be similar or even larger than the original trace file size.

Input and Output Options

To analyze a trace file, user need to indicate the path to the trace file by

-trace_path=<path to the trace>

To store the output files, user can specify a directory (make sure the directory exist before running the analyzer) to store these files

-output_dir=<the path to the output directory>

If user wants to analyze the accessed keys together with the existing keyspace. User needs to specify a directory that stores the keyspace files. The file should be in the name of “.txt” and each line is one key. Usually, user can use the “./ldb scan” of the LDB tool to dump out all the existing keys. To specify the directory

-key_space_dir=<the path to the key space directory>

To collect the output files more easily, user can specify the “prefix” for all the output files

-output_prefix=<the prefix, like "trace1">

If user does not want to print out the general statistics to the screen, user can specify

-no_print

The Analyzing Options

The general information of the workloads like the total number of keys, the number of analyzed queries of each column family, the key and value size statistics (average and medium), the number of keys being accessed are printed to screen when this option is specified

-print_overall_stats

To get the total access count of each key and the size of the value, user can specify

It will output the access count of each key to a file and the keys are sorted in lexicographical order. Each key will be assigned with an integer as the ID for further processing

In some workloads, the composition of the keys has some common part. For example, in MyRocks, the first X bytes of the key is the table index_num. We can use the first X bytes to cut the keys into different prefix range. By specifying the number of bytes to cut the key space, the trace_analyzer will generate a file. One record in the file represents a cut of prefix, the corresponding KeyID, and the prefix content are stored. There will be two separate files if the -key_space_dir is specified. One file is for the accessed keys, the other one is for the whole key space. Usually, the prefix cut file is used together with the accessed_key_stats.txt and whole_key_stats.txt respectively.

-output_prefix_cut=<number of bytes as prefix>

If user wants to visualize the accesses of the keys in the tracing timeline, user can specify:

-output_time_series

Each access to one key will be stored as one record in the time series file.

If the user is interested to know about the detailed QPS changing during the tracing time, user can specify:

-output_qps_stats

For each query type of one column family, one file with the query number per second will be generated. Also, one file with the QPS of each query type on all column families as well as the overall QPS are output to a separate file. The average QPS and peak QPS will be printed out.

Sometimes, user might be interested to know about the TOP statistics. user can specify

-print_top_k_access

The top K accessed keys, the access number will be printed. Also, if the prefix_cut option is specified, the top K accessed prefix with their total access count are printed. At the same time, the top K prefix with the highest average access is printed.

If the user is interested to know about the value size distribution (only applicable for Put and Merge ) user can specify

-output_value_distribution

Since the value size varies a lot, User might just want to know how many values are in each value size range. User can specify the value_interval=x to generate the number of values between [0,x), [x,2x)……

The key size distribution is output to the file if user specify

-output_key_distribution

After processing the trace with trace_analyzer, user will get a couple of output files. Some of the files can be used to visualize the workload such as the heatmap (the hotness or coldness of keys), time series graph (the overview of the key access in timeline), the QPS of the analyzed queries.

Here, we use the open source plot tool GNUPLOT as an example to generate the graphs. More details about the GNUPLOT can be found here (http://gnuplot.info/). User can directly write the GNUPLOT command to draw the graph, or to make it simple, user can use the following shell script to generate the GNUPLOT source file (before using the script, make sure the file name and some of the content are replaced with the effective ones).

To draw the heat map of accessed keys


# The query type
ops="iterator"
# The column family ID
cf="9"
# The column family name if known, if not, replace it with some prefix
cf_name="rev:cf-assoc-deleter-id1-type"
form="accessed"
# The column number that will be plotted
use="4"
# The higher bound of Y-axis
y=2
# The higher bound of X-axis
x=29233
echo "set output '${cf_name}-${ops}-${form}-key_heatmap.png'" > plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set term png size 2000,500" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set title 'CF: ${cf_name} ${form} Key Space Heat Map'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xlabel 'Key Sequence'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set ylabel 'Key access count'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
# If the preifx cut is avialable, it will draw the prefix cut
while read f1 f2
do
    echo "set arrow from $f1,0 to $f1,$y nohead lc rgb 'red'" >> plot-${cf_name}-${ops}-${form}-heatmap.gp
done < "trace.1532381594728669-${ops}-${cf}-${form}_key_prefix_cut.txt"
echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}_key_stats.txt' using ${use} notitle w dots lt 2" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
gnuplot plot-${cf_name}-${ops}-${form}-heatmap.gp

To draw the time series map of accessed keys

#!/bin/bash
# The query type
ops="iterator"
# The higher bound of X-axis
x=29233
# The column family ID
cf="8"
# The column family name if known, if not, replace it with some prefix
cf_name="rev:cf-assoc-deleter-id1-type"
# The type of the output file
form="time_series"
# The column number that will be plotted
use="3:2"
# The total time of the tracing duration, in seconds
y=88000
echo "set output '${cf_name}-${ops}-${form}-key_heatmap.png'" > plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set term png size 3000,3000" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set title 'CF: ${cf_name} time series'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xlabel 'Key Sequence'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set ylabel 'Key access count'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
# If the preifx cut is avialable, it will draw the prefix cut
while read f1 f2
do
    echo "set arrow from $f1,0 to $f1,$y nohead lc rgb 'red'" >> plot-${cf_name}-${ops}-${form}-heatmap.gp
done < "trace.1532381594728669-${ops}-${cf}-accessed_key_prefix_cut.txt"
echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}.txt' using ${use} notitle w dots lt 2" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
gnuplot plot-${cf_name}-${ops}-${form}-heatmap.gp

To plot out the QPS

#!/bin/bash
# The query type
ops="iterator"
# The higher bound of the QPS
y=5
# The column family ID
cf="9"
# The column family name if known, if not, replace it with some prefix
cf_name="rev:cf-assoc-deleter-id1-type"
# The type of the output file
form="qps_stats"
# The column number that will be plotted
use="1"
# The total time of the tracing duration, in seconds
x=88000
echo "set output '${cf_name}-${ops}-${form}-IO_per_second.png'" > plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set term png size 2000,1200" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set title 'CF: ${cf_name} QPS Over Time'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xlabel 'Time in second'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set ylabel 'QPS'">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}-heatmap.gp
echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}.txt' using ${use} notitle with linespoints" >>plot-${cf_name}-${ops}-${form}-heatmap.gp
gnuplot plot-${cf_name}-${ops}-${form}-heatmap.gp

We can use different tools, script, and models to fit the workload statistics. Typically, user can use the distributions of key access count and key-range access count to fit in the model. Also, QPS can be modeled. Here, we use Matlab as an example to fit the key access count, key-range access count, and QPS. Note that, before we model the key-range hotness, we need to decide how to cut the whole key-space into key-ranges. Our suggestion is that the key-range size can be similar to the number of KV-pairs in a SST file.

User can try different distributions to fit the model. In this example, we use two-term exponential model to fit the access distribution, and two-term sin() to fit the QPS

% This script is used to fit the key access count distribution
% to the two-term exponential distirbution and get the parameters
% The input file with surfix: accessed_key_stats.txt
txt = textscan(fileID,'%f %f %f %f');
fclose(fileID);
% Get the number of keys that has access count x
t2=sort(txt{4},'descend');
% The number of access count that is used to fit the data
% The value depends on the accuracy demond of your model fitting
% and the value of count should be always not greater than
% the size of t2
count=30000;
% Generate the access count x
x=1:1:count;
x=x';
% Adjust the matrix and uniformed
y=t2(1:count);
y=y/(sum(y));
figure;
% fitting the data to the exp2 model
f=fit(x,y,'exp2')
%plot out the original data and fitted line to compare
plot(f,x,y);

To fit the key access count distribution to the model, user can run the following script:

To fit the key-range average access count to the model, user can run the following script:

% This script is used to fit the prefix average access count distribution
% to the two-term exponential distirbution and get the parameters
% The input file with surfix: accessed_key_prefix_cut.txt
fileID = fopen('trace-get-4-accessed_key_prefix_cut.txt');
txt = textscan(fileID,'%f %f %f %f %s');
fclose(fileID);
% The per key access (average) of each prefix, sorted
t2=sort(txt{4},'descend');
% The number of access count that is used to fit the data
% The value depends on the accuracy demond of your model fitting
% and the value of count should be always not greater than
% the size of t2
count=1000;
% Generate the access count x
x=1:1:count;
x=x';
% Adjust the matrix and uniformed
y=t2(0:count);
y=y/(sum(y));
x=x(1:count);
% fitting the data to the exp2 model
figure;
f=fit(x,y,'exp2')
%plot out the original data and fitted line to compare
plot(f,x,y);

To fit the QPS to the model, user can run the following script:

% This script is used to fit the qps of the one query in one of the column
% family to the sin'x' model. 'x' can be 1 to 10. With the higher value
% of the 'x', you can get more accurate fitting of the qps. However,
% the model will be more complex and some times will be overfitted.
% The suggestion is to use sin1 or sin2
% The input file shoud with surfix: qps_stats.txt
fileID = fopen('trace-get-4-io_stats.txt');
txt = textscan(fileID,'%f');
fclose(fileID);
t1=txt{1};
% The input is the queries per second. If you directly use the qps
% you may got a high value of noise. Here, 'n' is the number of qps
% that you want to combined to one average value, such that you can
% reduce it to queries per n*seconds.
n=10;
s1 = size(t1, 1);
M  = s1 - mod(s1, n);
t2  = reshape(t1(1:M), n, []);
y = transpose(sum(t2, 1) / n);
% Up to this point, you need to move the data down to the x-axis,
% the offset is the ave. So the model will be 
% s(x) =  a1*sin(b1*x+c1) + a2*sin(b2*x+c2) + ave
ave = mean(y);
y=y-ave;
% Adjust the matrix
count = size(y,1);
x=1:1:count;
x=x';
% Fit the model to 'sin2' in this example and draw the point and
% fitted line to compare
figure;
s = fit(x,y,'sin2')
plot(s,x,y);

Users can use the model for further analyzing or use it to generate the synthetic workload.

Synthetic Workload Generation based on Models

In the previous section, users can use the fitting functions of Matlab to fit the traced workload to different models, such that we can use a set of parameters and functions to profile the workload. We focus on the 4 variables to profile the workload: 1) value size; 2) KV-pair access hotness; 3) QPS; 4) Iterator scan length. According to our current research, the value size and Iterator scan length follows the Generalized Pareto Distribution. The probability density function is: f(x) = (1/sigma)(1+k(x-theta)\sigma)^(-1-1/k). The KV-pair access follows power-law, in which about 80% of the KV-pair has an access count less than 4. We sort the keys based on the access count in descending order and fit them to the models. The two-term power model fit the KV-pair access distribution best. The probability density function is: f(x) = ax^b+c. The Sine function fits the QPS best. F(x) = Asin(Bx + C) + D.

Here is one example of the parameters we get from the workload collected in Facebook social graph:

Value Size: sigma = 226.409, k = 0.923$, theta = 0
KV-pair Access: a = 0.001636, b = -0.7094 , and c = 3.217*10^-9
QPS: $A = 147.9, B = 8.3*10^-5, C = -1.734, D = 1064.2
Iterator scan length: sigma = 1.747, k = 0.0819, theta = 0

We developed a benchmark called “mixgraph” in db_bench, which can use the four set of parameters to generate the synthetic workload. The workload is statistically similar to the original one. Note that, only the workload that can be fit to the models used for the four variables can be used in the mixgraph. For example, if the value size follows the power distribution instead of Generalized Pareto Distribution, we cannot use the mixgraph to generate the workloads.

More details about the workloads, benchmarks, and models, please refer to the published paper at FAST2020 ()

To enable the “mixgraph” benchmark, user needs to specify:

./db_bench —benchmarks="mixgraph"

To set the parameters of the value size distribution (Generalized Pareto Distribution only), user needs to specify:

-value_k=<> -value_sigma=<> -value_theta=<>

To set the parameters of the KV-pair access distribution (power distribution only and C==0), user needs to specify:

-key_dist_a=<> -key_dist_b=<>

To set the parameters of key-range distributions which follows the two-term exponential distribution (f(x)=aexp(bx)+cexp(dx)), user needs to specify both the key-range hotness distribution parameters and the number of key-ranges (keyrange_num):

-keyrange_dist_a=<> -keyrange_dist_b=<> -keyrange_dist_c=<> -keyrange_dist_d=<> -keyrange_num=<>

To set the parameters of the QPS (Sine), user needs to specify:

-sine_a=<> -sine_b=<> -sine_c=<> -sine_d=<> -sine_mix_rate_interval_milliseconds=<>

The mix rate is used to set the time interval that how lone we should correct the rate according to the distribution, the smaller it is, the better it will fit.

To set the parameters of the iterator scan length distribution (Generalized Pareto Distribution only), user needs to specify:

-iter_k=<> -iter_sigma=<> -iter_theta=<>

User need to specify the query ratio between the Get, Put, and Seek. Such that we can generate the mixed workload that can be similar to the social graph workload (so called mix graph). Make sure that the sum of the ratio is 1.

-mix_get_ratio=<> -mix_put_ratio=<> -mix_seek_ratio=<>

Finally, user need to specify how many queries they want to execute:

and what’s the total KV-pairs are in the current testing DB

The num together with the aforementioned distributions decided the queries.

Here is one example that can be directly used to generate the workload which simulate the queries of ZippyDB described in the published paper link. The workloads has 0.42 billion queries and 50 million KV-pairs in total. We use 30 key-ranges. Note that, if user runs the benchmark following the 24 hours Sine period, it will take about 22-24 hours. In order to speedup the benchmarking, user can increase the sine_d to a larger value such as 45000 to increase the workload intensiveness and also reduce the sine_b accordingly.

 ./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=4500 --perf_level=2 -reads=420000000 -num=50000000 -key_size=48

RocksDB Trace, Replay, Analyzer, and Workload Generation

Query Tracing and Replaying

Trace Analyzing, Visualizing, and Modeling

Synthetic Workload Generation based on Models