Architecture

    PGSQL for production environments is organized in clusters, which clusters are logical entities consisting of a set of database instances associated by primary-replica. Each database cluster is a self-organizing business service unit consisting of at least one database instance.


    Let’s get started with ER diagram. There are four types core entities in Pigsty’s PGSQL module:

    • : An autonomous PostgreSQL business unit, used as top level namespace for other entities.
    • PGSQL Service: A named abstraction of cluster ability, route traffics and expose postgres services with node ports.
    • : A single postgres server which is a group of running processes & database files on single node.
    • PGSQL Node: An abstraction of hardware resource, which can be bare metal, virtual machine, or even k8s pods.

    Naming Convention

    • Cluster name should be a valid domain name, without any dot:
    • Service name should be prefixed with cluster name, and suffixed with a single word: such as primary, replica, offline, delayed, join by -
    • Instance name is prefixed with cluster name and suffixed with an integer, join by -, e.g., ${cluster}-${seq}.
    • Node is identified by its IP address, and its hostname is usually the same as instance name since they are 1:1 deployed.

    Pigsty uses identity parameters to identify entities: PG_ID.

    In addition to the node IP address, three parameters: , pg_role, and are the minimum set of parameters necessary to define a postgres cluster. Take the sandbox testing cluster pg-test as an example:

    There are:

    • One Cluster: The cluster is named as pg-test.
    • Two Roles: primary and replica.
    • Three Instances: The cluster consists of three instances: pg-test-1, pg-test-2, pg-test-3.
    • Three Nodes: The cluster is deployed on three nodes: , 10.10.10.12, and 10.10.10.13.
    • Four services:
      • read-write service:
      • directly connected management service: pg-test-default
      • offline read service:

    And in the monitoring system (prometheus/grafana/loki), corresponding metrics will be labeled with these identities:

    1. pg_up{cls="pg-meta", ins="pg-meta-1", ip="10.10.10.10", job="pgsql"}
    2. pg_up{cls="pg-test", ins="pg-test-1", ip="10.10.10.11", job="pgsql"}
    3. pg_up{cls="pg-test", ins="pg-test-2", ip="10.10.10.12", job="pgsql"}
    4. pg_up{cls="pg-test", ins="pg-test-3", ip="10.10.10.13", job="pgsql"}

    Here is how PostgreSQL module components and their interactions. From top to bottom:

    • Cluster DNS is resolved by DNSMASQ on infra nodes
    • Cluster VIP is manged by vip-manager, which will bind to cluster primary.
      • vip-manager will acquire cluster leader info written by patroni from etcd cluster directly
    • Cluster service are exposed by Haproxy on nodes, services are distinguished by node ports (543x).
      • Haproxy port 9101 : monitoring metrics & stats & admin page
      • Haproxy port 5433 : default service that route to primary pgbouncer:
      • Haproxy port 5434 : default service that route to replica pgbouncer: replica
      • Haproxy port 5436 : default service that route to primary postgres:
      • Haproxy port 5438 : default service that route to offline postgres: offline
      • HAProxy will route traffic based on health check information provided by patroni.
    • Pgbouncer is a connection pool middleware that buffering connections, exposing extra metrics, and bring extra flexibility @ port 6432
      • Pgbouncer is stateless, and deployed with postgres server in a 1:1 manner through local unix socket.
      • Default/Offline service will always bypass pgbouncer and connect to target postgres directly.
    • Postgres provides relational database services @ port 5432
      • Install PGSQL module on multiple nodes will automatically form a HA cluster based on streaming replication
      • PostgreSQL is supervised by patroni by default.
    • Patroni will supervise PostgreSQL server @ port 8008 by default
      • Patroni spawn postgres servers as child process
      • Patroni use etcd as DCS: config storage, failure detection, and leader election.
      • Patroni will provide postgres information through health check. Which is used by HAProxy
      • Patroni metrics will be scraped by prometheus on infra nodes
    • PG Exporter will expose postgres metrics @ port 9630
      • PostgreSQL’s metrics will be scraped by prometheus on infra nodes
    • Pgbouncer Exporter will expose pgbouncer metrics @ port 9631
      • Pgbouncer metrics will be scraped by prometheus on infra nodes
    • pgBackRest will work on local repo by default ()
      • If local (default) is used as backup repo, pgBackRest will create local repo under primary’s pg_fs_bkup
      • If minio is used as backup repo, pgBackRest will create repo on dedicate minio cluster in
    • Postgres related logs (postgres,pgbouncer,patroni,pgbackrest) are exposed by promtail @ port 9080
      • Promtail will send log to loki on infra nodes

    pigsty-infra


    Pigsty’s PostgreSQL cluster has battery-included high-availability powered by , etcd, and

    When the primary fails, one of the replica will be promoted to primary automatically, and read-write traffic will be route to the new primary immediately. The impact is: write queries will be blocked for 15 ~ 40s until the new leader is elected.

    When a replica fails, read-only traffic will be route to the other replicas, if all replicas fail, read-only traffic will fall back to the primary. The impact would be very small: a few running queries on that replica will abort due to connection reset.

    Failure detection is done by patroni and etcd, the leader will hold a lease, and if it fails, the lease will be released due to timeout, and the other instance will elect a new leader to take over.

    The ttl can be tuned with pg_rto, which is 30s by default, increasing it will cause longer failover wait time, while decreasing it will increase the false-positive failover rate (e.g. network jitter).

    Pigsty will use availability first mode by default, which means when primary fails, it will try to failover ASAP, data not replicated to the replica may be lost (usually 100KB), the max potential data loss is controlled by , which is 1MB by default.

    Last modified 2023-02-27: add v2.0 images and docs (5b09f12)