FAQ
Node Requirement
CPU Architecture: only. Pigsty does not support ARM
yet.
CPU Number: 1 core for common node, at least 2 for admin node.
Memory: at least 1GB for the common node and 2GB for the admin node.
Using at least 3~4 x (2C / 4G / 100G) nodes for serious production deployment is recommended.
OS Requirement
Pigsty is now developed and tested on CentOS 7.9, Rocky 8.6 & 9.0. RHEL, Alma, Oracle, and any EL-compatible distribution also work.
We strongly recommend using EL 7.9, 8.6, and 9.0 to avoid meaningless efforts on RPM troubleshooting.
And PLEASE USE FRESH NEW NODES to avoid any unexpected issues.
Versioning Policy
!> Please always use a version-specific , do not use the GitHub master
branch unless you know what you are doing.
Pigsty uses semantic version numbers such as: <major>. <minor>. <release>
. Alpha/Beta/RC are suffixed to the version number -a1
, -b1
, -rc1
.
Major updates mean fundamental changes and massive features; minor version updates suggest new features, bump package versions, and minor API changes. Release version updates mean bug fixes and doc updates, and it does change offline package versions (i.e., v1.0.1 and v1.0.0 will use the same pkg.tgz
).
Pigsty tries to release a Minor Release every 1-3 months and a Major Release every 1-2 years.
Download
Where to download the Pigsty source code?
!> bash -c "$(curl -fsSL http://download.pigsty.cc/get)"
The above command will automatically download the latest stable version of pigsty.tgz
and extract it to the ~/pigsty
dir. You can also manually download a specific version of Pigsty source code from the following location.
If you need to install it in an environment without the Internet, you can download it in advance and upload it to the production server via scp/sftp.
How to accelerate RPM downloading from the upstream repo?
Consider using the upstream repo mirror of your region. Define them with and region.
For example, you can use region
= china
, and the baseurl with key = china
will be used instead of the default
.
If a firewall or GFW blocks some repo, consider using a to bypass that.
Where to download pigsty offline packages
Offline packages can be downloaded during bootstrap, or you can download them directly via:
Configuration
What does configure do?
!> Detect the environment, generate the configuration, enable the offline package (optional), and install the essential tool Ansible.
After downloading the Pigsty source package and unpacking it, you may have to execute ./configure
to complete the environment configuration. This is optional if you already know how to configure Pigsty properly.
The configure procedure will detect your node environment and generate a pigsty config file: pigsty.yml
for you.
What is the Pigsty config file?
!> pigsty.yml
under the pigsty home dir is the default config file.
Pigsty uses a single config file pigsty.yml,
to describe the entire environment, and you can define everything there. There are many config examples in for your reference.
You can pass the -i <path>
to playbooks to use other configuration files. For example, you want to install redis according to another config: redis.yml
:
./redis.yml -i files/pigsty/redis.yml
How to use the CMDB as config inventory
The default config file path is specified in ansible.cfg: inventory = pigsty.yml
You can switch to a dynamic CMDB inventory with , and switch back to the local config file with bin/inventory_conf. You must also load the current config file inventory to CMDB with .
If CMDB is used, you must edit the inventory config from the database rather than the config file.
What is the IP address placeholder in the config file?
!> Pigsty uses 10.10.10.10
as a placeholder for the current node IP, which will be replaced with the primary IP of the current node during the configuration.
When the configure
detects multiple NICs with multiple IPs on the current node, the config wizard will prompt for the primary IP to be used, i.e., the IP used by the user to access the node from the internal network. Note that please do not use the public IP.
This IP will be used to replace 10.10.10.10
in the config file template.
Which parameters need your attention?
!> Usually, in a singleton installation, there is no need to make any adjustments to the config files.
Pigsty provides 265 config parameters to customize the entire infra/node/etcd/minio/pgsql. However, there are a few parameters that can be adjusted in advance if needed:
- When accessing web service components, the domain name is infra_portal (some services can only be accessed using the domain name through the Nginx proxy).
- Pigsty assumes that a
/data
dir exists to hold all data; you can adjust these paths if the data disk mount point differs from this. - Don’t forget to change those passwords in the config file for your production deployment.
What was executed during installation?
!> When running make install
, the ansible-playbook install.yml will be invoked to install everything on all nodes
Which will:
- Install
INFRA
module on the current node. - Install
NODE
module on the current node. - Install
ETCD
module on the current node. - The
MinIO
module is optional, and will not be installed by default. - Install
PGSQL
module on the current node.
How to resolve RPM conflict?
There may have a slight chance that rpm conflict occurs during node/infra/pgsql packages installation.
The simplest way to resolve this is to install without offline packages, which will download directly from the upstream repo.
If there are only a few problematic RPMs, you can use a trick to fix the yum repo quickly:
rm -rf /www/pigsty/repo_complete # delete the repo_complete flag file to mark this repo incomplete
rm -rf SomeBrokenRPMPackages # delete problematic RPMs
./infra.yml -t repo_upstream # write upstream repos. you can also use /etc/yum.repos.d/backup/*
./infra.yml -t repo_pkg # download rpms according to your current OS
How to create local VMs with vagrant
!> The first time you use Vagrant to pull up a particular OS repo, it will download the corresponding BOX.
Pigsty sandbox uses generic/rocky9
image box by default, and Vagrant will download the rocky/9
box for the first time the VM is started.
Using a proxy may increase the download speed. Box only needs to be downloaded once, and will be reused when recreating the sandbox.
RPMs error on Aliyun CentOS 7.9
!> Aliyun CentOS 7.9 server has DNS caching service nscd
installed by default. Just remove it.
Run yum remove -y nscd
on all nodes to resolve this issue, and with Ansible, you can batch.
ansible all -b -a 'yum remove -y nscd'
Monitoring
Performance impact of monitoring exporter
Not very much, 200ms per 10 ~ 15 seconds.
How to monitor an existing PostgreSQL instance?
Check PGSQL Monitor for details.
How to remove monitor targets from prometheus?
./pgsql-rm.yml -t prometheus -l <cls> # remove prometheus targets of cluster 'cls'
Or
bin/pgmon-rm <ins> # shortcut for removing prometheus targets of pgsql instance 'ins'
INFRA
Which components are included in INFRA
- Ansible for automation, deployment, and administration;
- Nginx for exposing any WebUI service and serving the yum repo;
- Self-Signed CA for SSL/TLS certificates;
- Prometheus for monitoring metrics
- Grafana for monitoring/visualization
- Loki for logging collection
- AlertManager for alerts aggregation
- Chronyd for NTP time sync on the admin node.
- DNSMasq for DNS registration and resolution.
- ETCD as DCS for PGSQL HA; (dedicated module)
- PostgreSQL on meta nodes as CMDB; (optional)
- Docker for stateless applications & tools (optional)
How to restore Prometheus targets
If you accidentally deleted the Prometheus targets dir, you can register monitoring targets to Prometheus again with the:
How to restore Grafana datasource
PGSQL Databases in pg_databases are registered as Grafana datasource by default.
If you accidentally deleted the registered postgres datasource in Grafana, you can register them again with
./pgsql.yml -t register_grafana # register all pgsql database (in pg_databases) as grafana datasource
How to restore the HAProxy admin page proxy
The haproxy admin page is proxied by Nginx under the default server.
If you accidentally deleted the registered haproxy proxy settings in /etc/nginx/conf.d/haproxy
, you can restore them again with
./node.yml -t register_nginx # register all haproxy admin page proxy settings to nginx on infra nodes
How to restore the DNS registration
PGSQL cluster/instance domain names are registered to /etc/hosts.d/<name>
on infra nodes by default.
You can restore them again with the following:
./pgsql.yml -t pg_dns # register pg DNS names to dnsmasq on infra nodes
How to expose new Nginx upstream service
If you wish to expose a new WebUI service via the Nginx portal, you can add the service definition to the parameter.
And re-run ./infra.yml -t nginx_config,nginx_launch
to update & apply the Nginx configuration.
If you wish to access with HTTPS, you must remove files/pki/csr/pigsty.csr
, to force re-generating the Nginx SSL/TLS certificate to include the new upstream’s domain name.
How to configure NTP service?
!> If NTP is not configured, use a public NTP service or sync time with the admin node.
If your nodes already have NTP configured, you can leave it there by setting node_ntp_enabled
to false
.
Otherwise, if you have Internet access, you can use public NTP services such as pool.ntp.org
.
If you don’t have Internet access, at least you can sync time with the admin node with the following:
node_ntp_servers: # NTP servers in /etc/chrony.conf
- pool cn.pool.ntp.org iburst
- pool ${admin_ip} iburst # assume non-admin nodes do not have internet access
How to force sync time on nodes?
!> Use chronyc
to sync time. You have to configure the NTP service first.
ansible all -b -a 'chronyc -a makestep' # sync time
You can replace all
with any group or host IP address to limit execution scope.
Remote nodes are not accessible via SSH commands.
!> Specify a different port via the host instance-level .
Consider using Ansible connection parameters if the target machine is hidden behind an SSH springboard machine, or if some customizations have been made that cannot be accessed directly using ssh ip
. Additional SSH ports can be specified with ansible_port
or ansible_host
for SSH Alias.
pg-test:
vars: { pg_cluster: pg-test }
10.10.10.11: {pg_seq: 1, pg_role: primary, ansible_host: node-1 }
10.10.10.12: {pg_seq: 2, pg_role: replica, ansible_port: 22223, ansible_user: admin }
10.10.10.13: {pg_seq: 3, pg_role: offline, ansible_port: 22224 }
Password required for remote node SSH and SUDO
!> Use the -k
and -K
parameters, enter the password at the prompt, and refer to admin provisioning.
When performing deployments and changes, the admin user used must have ssh
and sudo
privileges for all nodes. Password-free is not required. You can pass in ssh and sudo passwords via the -k|-K
parameter when executing the playbook or even use another user to run the playbook via -e
ansible_host=<another_user>
. However, Pigsty strongly recommends configuring SSH passwordless login with passwordless sudo
for the admin user.
Create an admin user with the existing admin user.
!> ./node.yml -k -K -e ansible_user=<another_admin> -t node_admin
This will create an admin user specified by with the existing one on that node.
Exposing node services with HAProxy
!> You can expose service with haproxy_services in node.yml
.
And here’s an example of exposing MinIO service with it:
Why my nodes /etc/yum.repos.d/* are nuked?
Pigsty will try to include all dependencies in the local yum repo on infra nodes. This repo file will be added according to node_repo_local_urls. And existing repo files will be removed by default according to the default value of . This will prevent the node from using the Internet repo or some stupid issues.
If you want to keep existing repo files, just set node_repo_remove to false
.
ETCD
What is the impact of ETCD failure? [ETCD](/en/docs/etcd) availability is critical for the PGSQL cluster’s HA, which is guaranteed by using multiple nodes. With a 3-node ETCD cluster, if one node is down, the other two nodes can still function normally; and with a 5-node ETCD cluster, two-node failure can still be tolerated. If more than half of the ETCD nodes are down, the ETCD cluster and its service will be unavailable. Before Patroni 3.0, this could lead to a global [PGSQL](/en/docs/pgsql) outage; all primary will be demoted and reject write requests.
Since pigsty 2.0, the patroni 3.0 DCS failsafe mode is enabled by default, which will LOCK the PGSQL cluster status if the ETCD cluster is unavailable and all PGSQL members are still known to the primary.
The PGSQL cluster can still function normally, but you must recover the ETCD cluster ASAP. (you can’t configure the PGSQL cluster through patroni if etcd is down)
How to use existing external etcd cluster? The hard-coded group, `etcd`, will be used as DCS servers for PGSQL. You can initialize them with `etcd.yml` or assume it is an existing external etcd cluster.
To use an existing external etcd cluster, define them as usual and make sure your current etcd cluster certificate is signed by the same CA as your self-signed CA for PGSQL.
How to add a new member to the existing etcd cluster?
!> Check
How to remove a member from an existing etcd cluster?
!> Check Remove member from etcd cluster
etcdctl member remove <etcd_server_id> # kick member out of the cluster (on admin node)
./etcd.yml -l <ins_ip> -t etcd_purge # purge etcd instance
MINIO
Fail to launch multi-node / multi-driver MinIO cluster.
In Multi-Driver or mode, MinIO will refuse to start if the data dir is not a valid mount point.
How to deploy a multi-node multi-drive MinIO cluster?
!> Check Create Multi-Node Multi-Driver MinIO Cluster
How to add a member to the existing MinIO cluster?
!> You’d better plan the MinIO cluster before deployment… Since this requires a global restart
Check this:
How to use a HA MinIO deployment for PGSQL?
!> Access the HA MinIO cluster with an optional load balancer and different ports.
Here is an example: Access MinIO Service
ABORT due to existing redis instance
!> use redis_clean = true
and redis_safeguard = false
to force clean redis data
This happens when you run redis.yml
to init a redis instance that is already running, and redis_clean is set to false
.
If redis_clean
is set to true
(and the redis_safeguard
is set to false
, too), the redis.yml
playbook will remove the existing redis instance and re-init it as a new one, which makes the redis.yml
playbook fully idempotent.
ABORT due to redis_safeguard enabled
!> This happens when removing a redis instance with set to true
.
You can disable redis_safeguard to remove the Redis instance. This is redis_safeguard is what it is for.
How to add a single new redis instance on this node?
!> Use bin/redis-add <ip> <port>
to deploy a new redis instance on node.
How to remove a single redis instance from the node?
!> bin/redis-rm <ip> <port>
to remove a single redis instance from node
PGSQL
ABORT due to postgres exists
!> Set pg_clean
= true
and pg_safeguard
= false
to force clean postgres data during pgsql.yml
This happens when you run pgsql.yml
on a node with postgres running, and pg_clean is set to false
.
If pg_clean
is true (and the is false
, too), the pgsql.yml
playbook will remove the existing pgsql data and re-init it as a new one, which makes this playbook fully idempotent.
You can still purge the existing PostgreSQL data by using a special task tag pg_purge
./pgsql.yml -t pg_clean # honor pg_clean and pg_safeguard
./pgsql.yml -t pg_purge # ignore pg_clean and pg_safeguard
ABORT due to pg_safeguard enabled
!> Disable pg_safeguard
to remove the Postgres instance.
If is enabled, you can not remove the running pgsql instance with bin/pgsql-rm
and pgsql-rm.yml
playbook.
To disable pg_safeguard
, you can set pg_safeguard
to false
in the inventory or pass -e pg_safeguard=false
as cli arg to the playbook:
./pgsql-rm.yml -e pg_safeguard=false -l <cls_to_remove> # force override pg_safeguard
Fail to wait for postgres/patroni primary
This usually happens when the cluster is misconfigured, or the previous primary is improperly removed. (e.g., trash metadata in DCS with the same cluster name).
You must check /pg/log/*
to find the reason.
Fail to wait for postgres/patroni replica
There are several possible reasons:
Failed Immediately: Usually, this happens because of misconfiguration, network issues, broken DCS metadata, etc…, you have to inspect /pg/log
to find out the actual reason.
Failed After a While: This may be due to source instance data corruption. Check PGSQL FAQ: How to create replicas when data is corrupted?
Timeout: If the wait for postgres replica
task takes 30min or more and fails due to timeout, This is common for a huge cluster (e.g., 1TB+, which may take hours to create a replica). In this case, the underlying creating replica procedure is still proceeding. You can check cluster status with pg list <cls>
and wait until the replica catches up with the primary. Then continue the following tasks:
./pgsql.yml -t pg_hba,pg_backup,pgbouncer,pg_vip,pg_dns,pg_service,pg_exporter,pg_register
How enable hugepage for PostgreSQL?
!> use node_hugepage_count
and node_hugepage_ratio
or /pg/bin/pg-tune-hugepage
If you plan to enable hugepage, consider using node_hugepage_count
and node_hugepage_ratio
and apply with ./node.yml -t node_tune
.
It’s good to allocate enough hugepage before postgres start, and use pg_tune_hugepage
to shrink them later.
If your postgres is already running, you can use /pg/bin/pg-tune-hugepage
to enable hugepage on the fly.
sync; echo 3 > /proc/sys/vm/drop_caches # drop system cache (ready for performance impact)
sudo /pg/bin/pg-tune-hugepage # write nr_hugepages to /etc/sysctl.d/hugepage.conf
pg restart <cls> # restart postgres to use hugepage
How to guarantee zero data loss during failover?
!> Use crit.yml
template, or setting pg_rpo
to 0
, or Config Cluster with synchronous mode.
Consider using and Quorum Comit to guarantee 0 data loss during failover.
How to survive from disk full?
!> rm -rf /pg/dummy
will free some emergency space.
The is set to 64MB
by default. Consider increasing it to 8GB
or larger in the production environment.
It will be placed on /pg/dummy
same disk as the PGSQL main data disk. You can remove that file to free some emergency space. At least you can run some shell scripts on that node.
How to create replicas when data is corrupted?
!> Disable clonefrom
on bad instances and reload patroni config.
Pigsty sets the cloneform: true
tag on all instances’ patroni config, which marks the instance available for cloning replica.
If this instance has corrupt data files, you can set clonefrom: false
to avoid pulling data from the evil instance. To do so:
$ vi /pg/bin/patroni.yml
tags:
nofailover: false
clonefrom: true # ----------> change to false
noloadbalance: false
nosync: false
version: '15'
spec: '4C.8G.50G'
conf: 'oltp.yml'
$ systemctl reload patroni
How to create replicas when data is corrupted?
!> Disable clonefrom
on bad instances and reload patroni config.
Pigsty sets the tag on all instances’ patroni config, which marks the instance available for cloning replica.
If this instance has corrupt data files, you can set clonefrom: false
to avoid pulling data from the evil instance. To do so:
Last modified 2023-04-07: bump en docs to v2.0.2 (5a16652)