1.2.3.1.3 在阿里云上部署 TiDB 集群

本文介绍了如何使用个人电脑（Linux 或 macOS 系统）在阿里云上部署 TiDB 集群。

1. 环境需求

>= 3.0.15 并且配置 aliyun-cli
>= 1.12
helm >= 2.11.0
>= 1.6
terraform 0.12.*

你可以使用阿里云的服务来进行操作，云命令行中已经预装并配置好了所有工具。

完整部署集群需要具备以下权限：

AliyunECSFullAccess
AliyunESSFullAccess
AliyunVPCFullAccess
AliyunSLBFullAccess
AliyunCSFullAccess
AliyunEIPFullAccess
AliyunECIFullAccess
AliyunVPNGatewayFullAccess
AliyunNATGatewayFullAccess

2. 概览

默认配置下，会创建：

一个新的 VPC
一台 ECS 实例作为堡垒机
一个托管版 ACK（阿里云 Kubernetes）集群以及一系列 worker 节点：
- 属于一个伸缩组的 2 台 ECS 实例（2 核 2 GB）托管版 Kubernetes 的默认伸缩组中必须至少有两台实例，用于承载整个的系统服务，例如 CoreDNS
- 属于一个伸缩组的 3 台实例，用于部署 PD
- 属于一个伸缩组的 3 台 ecs.i2.2xlarge 实例，用于部署 TiKV
- 属于一个伸缩组的 2 台 ecs.c5.4xlarge 实例用于部署 TiDB
- 属于一个伸缩组的 1 台 ecs.c5.xlarge 实例用于部署监控组件
- 一块 100 GB 的云盘用作监控数据存储

除了默认伸缩组之外的其它所有实例都是跨可用区部署的。而伸缩组 (Auto-scaling Group) 能够保证集群的健康实例数等于期望数值。因此，当发生节点故障甚至可用区故障时，伸缩组能够自动为我们创建新实例来确保服务可用性。

3. 安装部署

(1) 设置目标 Region 和阿里云密钥（也可以在运行 terraform 命令时根据命令提示输入）：

(2) 使用 Terraform 进行安装：

```shell
git clone --depth=1 https://github.com/pingcap/tidb-operator && \
cd tidb-operator/deploy/aliyun
```
```shell
terraform init
```
`apply` 过程中需要输入 `yes` 来确认执行：
```shell
terraform apply
```
假如在运行 `terraform apply` 时出现报错，可根据报错信息（例如缺少权限）进行修复后再次运行 `terraform apply`。
```
Apply complete! Resources: 3 added, 0 changed, 1 destroyed.
Outputs:
bastion_ip = 47.96.174.214
cluster_id = c2d9b20854a194f158ef2bc8ea946f20e
kubeconfig_file = /tidb-operator/deploy/aliyun/credentials/kubeconfig
monitor_endpoint = 121.199.195.236:3000
region = cn-hangzhou
ssh_key_file = /tidb-operator/deploy/aliyun/credentials/my-cluster-keyZ.pem
tidb_endpoint = 172.21.5.171:4000
tidb_version = v3.0.0
vpc_id = vpc-bp1v8i5rwsc7yh8dwyep5
```

```shell
export KUBECONFIG=$PWD/credentials/kubeconfig
```
```shell
kubectl version
```
```shell
helm ls
```

通过堡垒机可连接 TiDB 集群进行测试，相关信息在安装完成后的输出中均可找到：

ssh -i credentials/<cluster_name>-key.pem root@<bastion_ip>

5. 监控

访问 <monitor_endpoint> 就可以查看相关的 Grafana 监控面板。相关信息可在安装完成后的输出中找到。默认帐号密码为：

用户名：admin
密码：admin

6. 升级 TiDB 集群

在 terraform.tfvars 中设置 tidb_version 参数，并再次运行 terraform apply 即可完成升级。

升级操作可能会执行较长时间，可以通过以下命令来持续观察进度：

kubectl get pods --namespace <tidb_cluster_name> -o wide --watch

7. TiDB 集群水平伸缩

按需在 terraform.tfvars 中设置 tikv_count 和 tidb_count 数值，再次运行 terraform apply 即可完成 TiDB 集群的水平伸缩。

假如 Kubernetes 集群没有创建成功，那么在 destroy 时会出现报错，无法进行正常清理。此时需要手动将 Kubernetes 资源从本地状态中移除：

terraform state list

销毁集群操作需要执行较长时间。

9. 配置

在 terraform.tfvars 中设置 operator_helm_values：

  operator_helm_values = "./my-operator-values.yaml"

在 main.tf 中设置 operator_helm_values：

  operator_helm_values = file("./my-operator-values.yaml")

同时，在默认配置下 Terraform 脚本会创建一个新的 VPC，假如要使用现有的 VPC，可以在 variable.tf 中设置 vpc_id。注意，当使用现有 VPC 时，没有设置 vswitch 的可用区将不会部署 Kubernetes 节点。

TiDB 集群会使用 ./my-cluster.yaml 作为集群的 values.yaml 配置文件，修改该文件即可配置 TiDB 集群。支持的配置项可参考 Kubernetes 上的 TiDB 集群配置。

10. 管理多个 TiDB 集群

需要在一个 Kubernetes 集群下管理多个 TiDB 集群时，需要编辑 ./main.tf，按实际需要新增 tidb-cluster 声明，示例如下：

module "tidb-cluster-dev" {
  source = "../modules/aliyun/tidb-cluster"
  providers = {
    helm = helm.default
  }
  cluster_name = "dev-cluster"
  ack          = module.tidb-operator
  pd_count                   = 1
  tikv_count                 = 1
  tidb_count                 = 1
  override_values            = file("dev-cluster.yaml")
}
module "tidb-cluster-staging" {
  source = "../modules/aliyun/tidb-cluster"
  providers = {
    helm = helm.default
  cluster_name = "staging-cluster"
  ack          = module.tidb-operator
  pd_count                   = 3
  tikv_count                 = 3
  tidb_count                 = 2
  override_values            = file("staging-cluster.yaml")
}

注意，多个 TiDB 集群之间 cluster_name 必须保持唯一。下面是 tidb-cluster 模块的所有可配置参数：

11. 管理多个 Kubernetes 集群

推荐针对每个 Kubernetes 集群都使用单独的 Terraform 模块进行管理（一个 Terraform Module 即一个包含 .tf 脚本的目录）。

deploy/aliyun 实际上是将 deploy/modules 中的数个可复用的 Terraform 脚本组合在了一起。当管理多个集群时（下面的操作在 tidb-operator 项目根目录下进行）：

(1) 首先针对每个集群创建一个目录，如：

(2) 参考 deploy/aliyun 的 main.tf，编写自己的脚本，下面是一个简单的例子：

```hcl
provider "alicloud" {
    region     = <YOUR_REGION>
    access_key = <YOUR_ACCESS_KEY>
    secret_key = <YOUR_SECRET_KEY>
}
module "tidb-operator" {
    source     = "../modules/aliyun/tidb-operator"
    region          = <YOUR_REGION>
    access_key      = <YOUR_ACCESS_KEY>
    secret_key      = <YOUR_SECRET_KEY>
    cluster_name    = "example-cluster"
    key_file        = "ssh-key.pem"
    kubeconfig_file = "kubeconfig"
}
provider "helm" {
    alias    = "default"
    insecure = true
    install_tiller = false
    kubernetes {
        config_path = module.tidb-operator.kubeconfig_filename
    }
}
module "tidb-cluster" {
    source = "../modules/aliyun/tidb-cluster"
    providers = {
        helm = helm.default
    }
    cluster_name = "example-cluster"
    ack          = module.tidb-operator
}
module "bastion" {
    source = "../modules/aliyun/bastion"
    bastion_name             = "example-bastion"
    key_name                 = module.tidb-operator.key_name
    vpc_id                   = module.tidb-operator.vpc_id
    vswitch_id               = module.tidb-operator.vswitch_ids[0]
    enable_ssh_to_worker     = true
    worker_security_group_id = module.tidb-operator.security_group_id
}

上面的脚本可以自由定制，比如，假如不需要堡垒机则可以移除 module "bastion" 相关声明。

目前，，service cidr 和节点型号等配置在集群创建后均无法修改。