Presto Verifier

    During each Presto release, Verifier is run to ensure that there is no correctness regression.

    In a MySQL database, create the following table and load it with the queries you would like to run:

    Next, create a file:

    1. source-query.suites=suite
    2. source-query.database=jdbc:mysql://localhost:3306/mydb?user=my_username&password=my_password
    3. control.hosts=127.0.0.1
    4. control.http-port=8080
    5. control.jdbc-port=8080
    6. control.application-name=verifier-test
    7. test.hosts=127.0.0.1
    8. test.http-port=8081
    9. test.jdbc-port=8081
    10. test.application-name=verifier-test
    11. test-id=1

    Download presto-verifier-0.272.1-executable.jar and rename it to verifier.jar. To run the Verifier:

    Verifier Procedures

    The following steps summarize the workflow of Verifier.

    • Importing Source Queries

      • Reads the list of source queries (query pairs with configuration) from the MySQL table.
    • Query Pre-processing and Filtering

      • Applies overrides to the catalog, schema, username, and password of each query.

      • Filters queries according to whitelist and blacklist. Whitelist is applied before blacklist.

      • Filters out queries with invalid syntax.

      • Filters out queries not supported for validation. Select, Insert, and CreateTableAsSelect are supported.

    • Query rewriting

      • Rewrites Select queries to CreateTableAsSelect

        • Column names are determined by running the Select query with LIMIT 0.

        • Artificial names are used for unnamed columns.

    1. - Rewrites `Insert` and `CreateTableAsSelect` queries to have their table names replaced.
    2. - Constructs a setup query to create the table necessary for an `Insert` query.
    • Query Execution

      • Conceptually, Verifier is configured with a control cluster and a test cluster. However, they may be pointed to the same Presto cluster for certain tests.

      • For each source query, executes the following queries in order.

        • Control setup queries

        • Control query

        • Test setup queries

        • Test query

        • Control and test teardown queries

    • Results Comparison

      • For Select, Insert, and CreateTableAsSelect queries, results are written into temporary tables.

      • Constructs and runs the checksum queries for both control and test.

      • Verifies table schema and row count are the same for the control and the test result table.

      • Verifies checksums are matching for each column. See Column Checksums for special handling of different column types.

      • See for handling of non-deterministic queries.

    • Emitting Results

      • Verification results can be exported as JSON, or human readable text.

    For each column in the control/test query, one or more columns are generated in the checksum queries.

    • Floating Point Columns

      • For DOUBLE and REAL columns, 4 columns are generated for verification:

        • Sum of the finite values of the column

        • NAN count of the column

        • Positive infinity count of the column

        • Negative infinity count of the column

      • Checks if NAN count, positive and negative infinity count matches.

      • Checks the nullity of control sum and test sum.

      • If either control mean or test mean very close 0, checks if both are close to 0.

      • Checks the relative error between control sum and test sum.

    1. - For an array column `arr` of type `array(E)`:
    2. - If `E` is not orderable, array checksum is `checksum(arr)`.
    3. - If `E` is orderable, array checksum `coalesce(checksum(try(array_sort(arr))), checksum(arr))`.
    • Map Columns

      • 4 columns are generated for verification:

        • Sum of the cardinality

        • Checksum of the map

        • Array checksum of the key set

        • Array checksum of the value set

    • Row Columns

      • Checksums row fields recursively according to the type of the fields.
    • For all other column types, generates a simple checksum using the checksum() function.

    Determinism

    A result mismatch, either a row count mismatch or a column mismatch, can be caused by non-deterministic query features. To avoid false alerts, we perform determinism analysis for the control query. If a query is found non-deterministic, we skip the verification as it does not provide insights.

    Determinism analysis follows the following steps. If a query is found non-deterministic at any point, the analysis will conclude.

    • Non-deterministic catalogs can be specified with determinism.non-deterministic-catalog. If a query references any table from those catalogs, the query is considered non-deterministic.

    • Runs the control query again and compares the results with the initial control query run.

    • If a query has a LIMIT n clause but no ORDER BY clause at the top level:

      • If the resulting row count is greater than n, treats the control query as non-deterministic.

    The differences in configuration, including cluster size, can cause a query to succeed on the control cluster but fail on the test cluster. A checksum query can also fail, which may be due to limitation of Presto or Presto Verifier. Thus, we allow Verifier to automatically resolve certain query failures.

    • EXCEEDED_GLOBAL_MEMORY_LIMIT: Resolves if the control query uses more memory than the test query.

    • EXCEEDED_TIME_LIMIT: Resolves unconditionally.

    • TOO_MANY_HIVE_PARTITIONS: Resolves if the test cluster does not have enough workers to make sure the number of partitions assigned to each worker stays within the limit.

    • COMPILER_ERROR, GENERATED_BYTECODE_TOO_LARGE: Resolves if the control checksum query fails with this error. If the control query has too many columns, generated checksum queries might be too large in certain cases.

    In cases of result mismatches, Verifier may be giving noisy signals, and we allow Verifier to automatically resolve certain mismatches.

    • Structured-typed Columns: If array element or map key/value contains floating point types, column checksum is unlikely to match.

      • For an array column, resolve if the element type contains floating point types and the cardinality checksum matches.

      • For a map column, resolve the mismatch when both of the following conditions are true:

        • The cardinality checksum matches.

        • The checksum of the key or value that does not contains floating point types matches.

      • Resolve a test case only when all columns are resolved.

    Explain Mode

    In explain mode, Verifier checks whether source queries can be explained instead of whether they produces the same results. Verification is marked as succeeded when both control query and test query can be explained.

    The field matchType in the output event can be used as an indicator whether there are plan differences between the control run and the test run.

    For non-DML queries, the control query and the plan comparison are skipped.

    Verifier can be extended for further behavioral changes in addition to configuration properties.

    shows the components that be extended. Implement the abstract class and create a command line wrapper similar to PrestoVerifier.

    Configuration Reference

    The following configurations control the behavior of query metadata modification before verification starts. Counterparts are also available for test queries with prefix control being replaced with test.

    The following configurations control the behavior of query execution on the control cluster. Counterparts are also available for test clusters with prefix control being replaced with test.