TiDB in Kubernetes Sysbench Performance Test

    • To test the performance of TiDB on a typical public cloud platform
    • To test the influences that the public cloud platform, network, CPU and different Pod networks have on the performance of TiDB

    In this test:

    • TiDB 3.0.1 and TiDB Operator 1.0.0 are used.
    • Three instances are deployed for PD, TiDB, and TiKV respectively.
    • Each component is configured as below. Components not configured use the default values.

    PD:

    TiDB:

    1. level = "error"
    2. [prepared-plan-cache]
    3. enabled = true
    4. [tikv-client]
    5. max-batch-wait-time = 2000000

    TiKV:

    1. log-level = "error"
    2. [server]
    3. status-addr = "0.0.0.0:20180"
    4. grpc-concurrency = 6
    5. [readpool.storage]
    6. normal-concurrency = 10
    7. [rocksdb.defaultcf]
    8. block-cache-size = "14GB"
    9. [rocksdb.writecf]
    10. block-cache-size = "8GB"
    11. [rocksdb.lockcf]
    12. block-cache-size = "1GB"
    13. [raftstore]
    14. apply-pool-size = 3
    15. store-pool-size = 3

    TiDB parameter configuration

    Machine types

    For the test in single AZ (Available Zone), the following machine types are chosen:

    For the test (2019.08) where the result in multiple AZs is compared with that in a single AZ, the c2 machine is not simultaneously available in three AZs within the same GCP region, so the following machine types are chosen:

    ComponentInstance typeCount
    PDn1-standard-43
    TiKVn1-standard-163
    TiDBn1-standard-163
    Sysbenchn1-standard-163

    Sysbench, the pressure test platform, has a high demand on CPU in the high concurrency read test. Therefore, it is recommended that you use machines with high configuration and multiple cores so that the test platform does not become the bottleneck.

    Note

    The usable machine types vary among GCP regions. In the test, the disk also performs differently. Therefore, only the machines in us-central1 are applied for test.

    Disk

    The NVMe disks on GKE are still in the Alpha phase, so it requires special application to use them and is not for general usage. In this test, the iSCSI interface type is used for all local SSD disks. With reference to the , the discard,nobarrier option has been added to the mounting parameter. Below is a complete example:

    Network

    GKE uses a more scalable and powerful mode as its network mode. In the performance comparison, TiDB is tested with Kubernetes Pod and Host respectively.

    CPU

    • In the test on a single AZ cluster, the c2-standard-16 machine mode is chosen for TiDB/TiKV.
    • In the comparison test on a single AZ cluster and on multiple AZs cluster, the c2-standard-16 machine type cannot be simultaneously adopted in three AZs within the same GCP region, so n1-standard-16 machine type is chosen.

    Operation system and parameters

    GKE supports two operating systems: COS (Container Optimized OS) and Ubuntu. The Point Select test is conducted on both systems and the results are compared. Other tests are only conducted on Ubuntu.

    The core is configured as below:

    1. sysctl net.core.somaxconn=32768
    2. sysctl vm.swappiness=0
    3. sysctl net.ipv4.tcp_syncookies=0

    The maximum number of files is configured as 1000000.

    In this test, the version of sysbench is 1.0.17.

    Before the test, the prewarm command of oltp_common is used to warm up data.

    Initialization

    ${tidb_host} is the address of the TiDB database, which is specified according to actual test needs. For example, Pod IP, Service domain name, Host IP, and Load Balancer IP (the same below).

    Warming-up

    1. sysbench \
    2. --mysql-host=${tidb_host} \
    3. --mysql-port=4000 \
    4. --mysql-user=root \
    5. --mysql-db=sbtest \
    6. --threads=16 \
    7. --report-interval=10 \
    8. --db-driver=mysql \
    9. --rand-type=uniform \
    10. --rand-seed=$RANDOM \
    11. --tables=16 \
    12. --table-size=10000000 \
    13. oltp_common \

    Pressure test

    1. sysbench \
    2. --mysql-host=${tidb_host} \
    3. --mysql-port=4000 \
    4. --mysql-user=root \
    5. --mysql-db=sbtest \
    6. --time=600 \
    7. --threads=${threads} \
    8. --report-interval=10 \
    9. --db-driver=mysql \
    10. --rand-type=uniform \
    11. --rand-seed=$RANDOM \
    12. --tables=16 \
    13. --table-size=10000000 \
    14. ${test} \
    15. run

    ${test} is the test case of sysbench. In this test, oltp_point_select, oltp_update_index, oltp_update_no_index, and oltp_read_write are chosen as ${test}.

    In single AZ

    Pod Network vs Host Network

    Kubernetes allows Pods to run in Host network mode. This way of deployment is suitable when a TiDB instance occupies the whole machine without causing any Pod conflict. The Point Select test is conducted in both modes respectively.

    In this test, the operating system is COS.

    Pod Network:

    ThreadsQPS95% latency(ms)
    150246386.440.95
    300346557.391.55
    600396715.662.86
    900407437.964.18
    1200415138.005.47
    1500419034.436.91

    Host Network:

    ThreadsQPS95% latency(ms)
    150255981.111.06
    300366482.221.50
    600421279.842.71
    900438730.813.96
    1200441084.135.28
    1500447659.156.67

    QPS comparison:

    Latency comparison:

    Pod vs Host Network

    From the images above, the performance in Host network mode is slightly better than that in Pod network.

    Ubuntu vs COS

    GKE provides Ubuntu and COS for each node. In this test, the Point Select test of TiDB is conducted on both systems.

    The network mode is Host.

    COS:

    ThreadsQPS95% latency(ms)
    150255981.111.06
    300366482.221.50
    600421279.842.71
    900438730.813.96
    1200441084.135.28
    1500447659.156.67

    Ubuntu:

    QPS comparison:

    Latency comparison:

    COS vs Ubuntu

    From the images above, TiDB performs better on Ubuntu than on COS in the Point Select test.

    Note

    • This test is conducted only for the single test case and indicates that the performance might be affected by different operating systems, different optimization, and default settings. Therefore, PingCAP makes no recommendation for the operating system.
    • COS is officially recommended by GKE, because it is optimized for containers and improved substantially on security and disk performance.

    Kubernetes Service vs GCP LoadBalancer

    After TiDB is deployed on Kubernetes, there are two ways of accessing TiDB: via Kubernetes Service inside the cluster, or via Load Balancer IP outside the cluster. TiDB is tested in both ways.

    In this test, the operating system is Ubuntu and the network mode is Host.

    Service:

    ThreadsQPS95% latency(ms)
    150290690.510.74
    300422941.171.10
    600476663.442.14
    900484405.993.25
    1200489220.934.33
    1500489988.975.47

    Load Balancer:

    ThreadsQPS95% latency(ms)
    150255981.111.06
    300366482.221.50
    600421279.842.71
    900438730.813.96
    1200441084.135.28
    1500447659.156.67

    QPS comparison:

    Service vs Load Balancer

    Latency comparison:

    From the images above, TiDB performs better when accessed via Kubernetes Service than accessed via GCP Load Balancer in the Point Select test.

    n1-standard-16 vs c2-standard-16

    In the Point Select read test, TiDB’s CPU usage exceeds 1400% (16 cores) while TiKV’s CPU usage is about 1000% (16 cores).

    In this test, the operating system is Ubuntu and the Pod network is Host. TiDB is accessed via Kubernetes Service.

    n1-standard-16:

    ThreadsQPS95% latency(ms)
    150203879.491.37
    300272175.712.3
    600287805.134.1
    900295871.316.21
    1200294765.838.43
    1500298619.3110.27

    c2-standard-16:

    ThreadsQPS95% latency(ms)
    150290690.510.74
    300422941.171.10
    600476663.442.14
    900484405.993.25
    1200489220.934.33
    1500489988.975.47

    QPS comparison:

    n1-standard-16 vs c2-standard-16

    Latency comparison:

    The Point Select test is conducted on different operating systems and in different network modes, and the test results are compared. In addition, other tests in the OLTP test set are also conducted on Ubuntu in Host network mode where the TiDB cluster is accessed via Kubernetes Service.

    OLTP Update Index

    OLTP Update Index

    OLTP Update Non Index

    ThreadsQPS95% latency(ms)
    1509230.6023.95
    30016543.6354.83
    60023551.0161.08
    90031100.1065.65
    120033942.6054.83
    150042603.13125.52

    OLTP Update No Index

    OLTP Read Write

    ThreadsQPS95% latency(ms)
    15060732.8469.29
    30091005.9890.78
    600110517.67167.44
    900119866.38235.74
    1200125615.89282.25
    1500128501.34344.082

    OLTP Read Write

    Performance comparison between single AZ and multiple AZs

    The network latency on communication across multiple AZs in GCP is slightly higher than that within the same zone. In this test, machines of the same configuration are used in different deployment plans under the same standard. The purpose is to learn how the latency across multiple AZs might affect the performance of TiDB.

    Single AZ:

    ThreadsQPS95% latency(ms)
    150203879.491.37
    300272175.712.30
    600287805.134.10
    900295871.316.21
    1200294765.838.43
    1500298619.3110.27

    Multiple AZs:

    ThreadsQPS95% latency(ms)
    150141027.101.93
    300220205.852.91
    600250464.345.47
    900257717.417.70
    1200258835.2410.09
    1500280114.0012.75

    QPS comparison:

    Single Zonal vs Regional

    Latency comparison:

    From the images above, the impact of network latency goes down as the concurrency pressure increases. In this situation, the extra network latency is no longer the main bottleneck of performance.

    This is a test of TiDB using sysbench running in Kubernetes deployed on a typical public cloud platform. The purpose is to learn how different factors might affect the performance of TiDB. On the whole, these influencing factors include the following items:

    • In the VPC-Native mode, TiDB performs slightly better in Host network than in Pod network. (The difference, ~7%, is measured in QPS. Performance differences caused by the factors below are also measured by QPS.)
    • In Host network, TiDB performs better (~9%) in the read test on Ubuntu provided by GCP than on COS.
    • The TiDB performance is slightly lower (~5%) if it is accessed outside the cluster via Load Balancer.
    • Increased latency among nodes in multiple AZs has a certain impact on the TiDB performance (30% ~ 6%; the impact diminishes as the concurrent number increases).
    • The QPS performance is greatly improved (50% ~ 60%) if the Point Select read test is conducted on machines of computing type (compared with general types), because the test mainly consumes CPU resources.

    Sysbench Performance Test - 图19Note

    • The sysbench test case cannot fully represent the actual business scenarios. It is recommended that you simulate the actual business for test and make consideration based on all the costs behind (machines, the difference between operating systems, the limit of Host network, and so on).