Cluster Status Monitoring

    You need a user granted a role including the authorization of Cluster Management. For example, you can log in to the console as directly or create a new role with the authorization and assign it to a user.

    Cluster Status Monitoring

    1. If you have enabled the with member clusters imported, you can select a specific cluster to view its application resources. If you have not enabled the feature, refer to the next step directly.

    2. Choose Cluster Status under Monitoring & Alerting to see the overview of cluster status monitoring, including Cluster Node Status, Component Status, Cluster Resource Usage, etcd Monitoring, and Service Component Monitoring.

    1. Cluster Nodes Status displays the status of all nodes, separately marking the active ones. You can go to the Cluster Nodes page to view the real-time resource usage of all nodes by clicking Node Online Status.

    2. In Cluster Nodes, click the node name to view usage details in Running Status, including Resource Usage, Allocated Resources, and Health Status.

    3. Click the Monitoring tab to view how the node is functioning during a certain period based on different metrics, including CPU Usage, Average CPU Load, Memory Usage, Disk Usage, Inode Usage, IOPS, Disk Throughput, and Network Bandwidth.

      Tip

      You can customize the time range from the drop-down list in the upper-right corner to view historical data.

    Component status

    KubeSphere monitors the health status of various service components in the cluster. When a key component malfunctions, the system may become unavailable. The monitoring mechanism of KubeSphere ensures the platform can notify tenants of any occurring issues in case of a component failure, so that they can quickly locate the problem and take corresponding action.

    1. You can see all the components are listed in this part. Components marked in green are those functioning normally while those marked in orange require special attention as it signals potential issues.

      Tip

      Components marked in orange may turn to green after a period of time, the reasons of which may be different, such as image pulling retries or pod recreations. You can click the component to see its service details.

    Cluster resource usage

    Cluster Resource Usage displays the information including CPU Usage, Memory Usage, Disk Usage, and Pods of all nodes in the cluster. Click the pie chart on the left to switch indicators, which shows the trend during a period in a line chart on the right.

    Monitoring data in Physical Resource Monitoring help users better observe their physical resources and establish normal standards for resource and cluster performance. KubeSphere allows users to view cluster monitoring data within the last 7 days, including CPU Usage, Memory Usage, Average CPU Load (1 minute/5 minutes/15 minutes), Disk Usage, Inode Usage, Disk Throughput (read/write), IOPS (read/write), Network Bandwidth, and Pod Status. You can customize the time range and time interval to view historical monitoring data of physical resources in KubeSphere. The following sections briefly introduce each monitoring indicator.

    CPU usage

    CPU usage shows how CPU resources are used in a period. If you notice that the CPU usage of the platform during a certain period soars, you must first locate the process that is occupying CPU resources the most. For example, for Java applications, you may expect a CPU usage spike in the case of memory leaks or infinite loops in the code.

    Memory is one of the important components on a machine, serving as a bridge for communications with the CPU. Therefore, the performance of memory has a great impact on the machine. Data loading, thread concurrency and I/O buffering are all dependent on memory when a program is running. The size of available memory determines whether the program can run normally and how it is functioning. Memory usage reflects how memory resources are used within a cluster as a whole, displayed as a percentage of available memory in use at a given moment.

    Average CPU load

    Average CPU load is the average number of processes in the system in a runnable state and an uninterruptible state per unit time. Namely, it is the average number of active processes. Note that there is no direct relation between the average CPU load and the CPU usage. Ideally, the average load should be equal to the number of CPUs. Therefore, you need to consider the number of CPUs when you look into the average load. A system is overloaded only when the average load is greater than the number of CPUs.

    KubeSphere provides users with three different time periods to view the average load: 1 minute, 5 minutes, and 15 minutes. Normally, it is recommended that you review all of them to gain a comprehensive understanding of average CPU load:

    • If the curves of 1 minute / 5 minutes / 15 minutes are similar within a certain period, it indicates that the CPU load of the cluster is relatively stable.
    • If the value of 1 minute in a certain period, or at a specific time point is much greater than that of 15 minutes, it means that the load in the last 1 minute is increasing, and you need to keep observing. Once the value of 1 minute exceeds the number of CPUs, it may mean that the system is overloaded. You need to further analyze the source of the problem.

    Disk usage

    KubeSphere workloads such as and all rely on persistent volumes. Some components and services also require a persistent volume. Such backend storage relies on disks, such as block storage or network shared storage. In this connection, providing a real-time monitoring environment for disk usage is an important part of maintaining the high reliability of data.

    Inode usage

    Each file must have an inode, which is used to store the file’s meta-information, such as the file’s creator and creation date. The inode will also consume hard disk space, and many small cache files can easily lead to the exhaustion of inode resources. Also, the inode may be used up, but the hard disk is not full. In this case, new files cannot be created on the hard disk.

    In KubeSphere, the monitoring of inode usage can help you detect such situations in advance, as you can have a clear view of cluster inode usage. The mechanism prompts users to clean up temporary files in time, preventing the cluster from being unable to work due to inode exhaustion.

    The monitoring of disk throughput and IOPS is an indispensable part of disk monitoring, which is convenient for cluster administrators to adjust data layout and other management activities to optimize the overall performance of the cluster. Disk throughput refers to the speed of the disk transmission data stream (shown in MB/s), and the transmission data are the sum of data reading and writing. When large blocks of discontinuous data are being transmitted, this indicator is of great importance for reference.

    IOPS

    IOPS (Input/Output Operations Per Second) represents a performance measurement of the number of read and write operations per second. Specifically, the IOPS of a disk is the sum of the number of continuous reads and writes per second. This indicator is of great significance for reference when small blocks of discontinuous data are being transmitted.

    Network bandwidth

    The network bandwidth is the ability of the network card to receive or send data per second, shown in Mbps (megabits per second).

    Pod status

    Pod status displays the total number of pods in different states, including Running, Completed and Warning. The pod tagged Completed usually refers to a Job or a CronJob. The number of pods marked Warning, which means an abnormal state, requires special attention.

    etcd Monitoring

    etcd monitoring helps you to make better use of etcd, especially to locate performance problems. The etcd service provides metrics interfaces natively, and the KubeSphere monitoring system features a highly graphic and responsive dashboard to display its native data.

    API Server is the hub for the interaction of all components in a Kubernetes cluster. The following table lists the main indicators monitored for the API Server.

    Scheduler Monitoring

    Scheduler monitors the Kubernetes API of newly created pods and determines which nodes these new pods run on. It makes this decision based on available data, including the availability of collected resources and the resource requirements of the Pod. Monitoring data for scheduling delays ensures that you can see any delays facing the scheduler.

    You can sort nodes in ascending and descending order by indicators such as CPU usage, average CPU load, memory usage, disk usage, inode usage, and Pod usage. This enables administrators to quickly find potential problems or identify a node’s insufficient resources.