Monitoring & Metrics

    Cilium and Hubble metrics can be enabled independently of each other.

    Cilium metrics provide insights into the state of Cilium itself, namely of the cilium-agent and cilium-operator processes. To run Cilium with Prometheus metrics enabled, deploy it with the global.prometheus.enabled=true Helm value set.

    Cilium metrics are exported under the cilium_ Prometheus namespace. When running and collecting in Kubernetes they will be tagged with a pod name and namespace.

    You can enable metrics for cilium-agent with the Helm value global.prometheus.enabled=true. To enable metrics for cilium-operator, use global.operatorPrometheus.enabled=true.

    The ports can be configured via global.prometheus.port or global.operatorPrometheus.port respectively.

    When metrics are enabled, all Cilium components will have the following annotations. They can be used to signal Prometheus whether to scrape metrics:

    1. prometheus.io/scrape: "true"
    2. prometheus.io/port: "9090"

    Prometheus will pick up the Cilium metrics automatically if the following option is set in the scrape_configs section:

    While Cilium metrics allow you to monitor the state Cilium itself, Hubble metrics on the other hand allow you to monitor the network behavior of your Cilium-managed Kubernetes pods with respect to connectivity and security.

    To deploy Cilium with Hubble metrics enabled, you need to enable Hubble with global.hubble.enabled=true and provide a set of Hubble metrics you want to enable via global.hubble.metrics.enabled.

    Some of the metrics can also be configured with additional options. See the Hubble exported metrics section for the full list of available metrics and their options.

    1. helm install cilium cilium/cilium --version 1.8.10 \
    2. --namespace kube-system \
    3. --set global.hubble.enabled=true \
    4. --set global.hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"

    Note

    L7 metrics such as HTTP, are only emitted for pods that enable .

    When deployed with a non-empty global.hubble.metrics.enabled Helm value, the Cilium chart will create a Kubernetes headless service named hubble-metrics with the prometheus.io/scrape:'true' annotation set:

    Set the following options in the scrape_configs section of Prometheus to have it scrape all Hubble metrics from the endpoints automatically:

    1. scrape_configs:
    2. - job_name: 'kubernetes-endpoints'
    3. scrape_interval: 30s
    4. kubernetes_sd_configs:
    5. - role: endpoints
    6. relabel_configs:
    7. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    8. action: keep
    9. regex: true
    10. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    11. action: replace
    12. target_label: __address__
    13. regex: (.+)(?::\d+);(\d+)
    14. replacement: $1:$2

    If you don’t have an existing Prometheus and Grafana stack running, you can deploy a stack with:

    It will run Prometheus and Grafana in the cilium-monitoring namespace. If you have either enabled Cilium or Hubble metrics, they will automatically be scraped by Prometheus. You can then expose Grafana to access it via your browser.

      Open your browser and access http://localhost:3000/

      Configuration

      To expose any metrics, invoke cilium-agent with the --prometheus-serve-addr option. This option takes a IP:Port pair but passing an empty IP (e.g. :9090) will bind the server to all available interfaces (there is usually only one in a container).

      Exported Metrics

      Endpoint
      Services
      NameLabelsDescription
      services_events_total Number of services events labeled by action type
      Datapath
      NameLabelsDescription
      datapath_errors_totalarea, name, familyTotal number of errors occurred in datapath management
      datapath_conntrack_gc_runs_totalstatusNumber of times that the conntrack garbage collector process was run
      datapath_conntrack_gc_key_fallbacks_total The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
      datapath_conntrack_gc_entriesfamilyThe number of alive and deleted conntrack entries at the end of a garbage collector run
      datapath_conntrack_gc_duration_secondsstatusDuration in seconds of the garbage collector process
      BPF
      NameLabelsDescription
      bpf_syscall_duration_secondsoperation, outcomeDuration of BPF system call performed
      bpf_map_ops_totalmapName, operation, outcomeNumber of BPF map operations performed
      bpf_maps_virtual_memory_max_bytes Max memory used by BPF maps installed in the system
      bpf_progs_virtual_memory_max_bytes Max memory used by BPF programs installed in the system

      Both bpf_maps_virtual_memory_max_bytes and bpf_progs_virtual_memory_max_bytes are currently reporting the system-wide memory usage of BPF that is directly and not directly managed by Cilium. This might change in the future and only report the BPF memory usage directly managed by Cilium.

      Drops/Forwards (L3/L4)
      NameLabelsDescription
      reason, directionTotal dropped packets
      drop_bytes_totalreason, directionTotal dropped bytes
      forward_count_totaldirectionTotal forwarded packets
      forward_bytes_totaldirectionTotal forwarded bytes
      Policy
      NameLabelsDescription
      policy_count Number of policies currently loaded
      policy_regeneration_total Total number of policies regenerated successfully
      policy_regeneration_time_stats_secondsscopePolicy regeneration time stats labeled by the scope
      policy_max_revision Highest policy revision number in the agent
      policy_import_errors Number of times a policy import has failed
      policy_endpoint_enforcement_status Number of endpoints labeled by policy enforcement status
      Policy L7 (HTTP/Kafka)
      NameLabelsDescription
      proxy_redirectsprotocolNumber of redirects installed for endpoints
      proxy_upstream_reply_seconds Seconds waited for upstream server to reply to a request
      policy_l7_totaltypeNumber of total L7 requests/responses
      Identity
      NameLabelsDescription
      identity_count Number of identities currently allocated
      Events external to Cilium
      NameLabelsDescription
      event_tssourceLast timestamp when we received an event
      Controllers
      SubProcess
      NameLabelsDescription
      subprocess_start_totalsubsystemNumber of times that Cilium has started a subprocess
      Kubernetes
      NameLabelsDescription
      kubernetes_events_received_totalscope, action, validity, equalNumber of Kubernetes events received
      kubernetes_events_totalscope, action, outcomeNumber of Kubernetes events processed
      k8s_cnp_status_completion_secondsattempts, outcomeDuration in seconds in how long it took to complete a CNP status update
      IPAM
      NameLabelsDescription
      ipam_events_total Number of IPAM events received labeled by action and datapath family type
      KVstore
      NameLabelsDescription
      kvstore_operations_duration_secondsaction, kind, outcome, scopeDuration of kvstore operation
      kvstore_events_queue_secondsaction, scopeDuration of seconds of time received event was blocked before it could be queued
      kvstore_quorum_errors_totalerrorNumber of quorum errors
      Agent
      NameLabelsDescription
      agent_bootstrap_secondsscope, outcomeDuration of various bootstrap phases
      api_process_time_seconds Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
      FQDN
      NameLabelsDescription
      qdn_gc_deletions_total Number of FQDNs that have been cleaned on FQDN garbage collector job
      API Rate Limiting
      NameLabelsDescription
      cilium_api_limiter_adjustment_factorapi_callMost recent adjustment factor for automatic adjustment
      cilium_api_limiter_processed_requests_totalapi_call, outcomeTotal number of API requests processed
      cilium_api_limiter_processing_duration_secondsapi_call, valueMean and estimated processing duration in seconds
      cilium_api_limiter_rate_limitapi_call, valueCurrent rate limiting configuration (limit and burst)
      cilium_api_limiter_requests_in_flightapi_call valueCurrent and maximum allowed number of requests in flight
      cilium_api_limiter_wait_duration_secondsapi_call, valueMean, min, and max wait duration
      cilium_api_limiter_wait_history_duration_secondsapi_callHistogram of wait duration per API call processed

      Configuration

      cilium-operator can be configured to serve metrics by running with the option --enable-metrics. By default, the operator will expose metrics on port 6942, the port can be changed with the option --metrics-address.

      Exported Metrics

      IPAM
      NameLabelsDescription
      ipam_ipstypeNumber of IPs allocated
      ipam_allocation_opssubnetIdNumber of IP allocation operations
      ipam_interface_creation_opssubnetId, statusNumber of interfaces creation operations
      ipam_available Number of interfaces with addresses available
      ipam_nodes_at_capacity Number of nodes unable to allocate more addresses
      ipam_resync_total Number of synchronization operations with external IPAM API
      ipam_api_duration_secondsoperation, responseCodeDuration of interactions with external IPAM API
      ipam_api_rate_limit_duration_secondsoperationDuration of rate limiting while accessing external IPAM API

      Configuration

      Hubble metrics are served by a Hubble instance running inside cilium-agent. The command-line options to configure them are --enable-hubble, --hubble-metrics-server, and --hubble-metrics. --hubble-metrics-server takes an IP:Port pair, but passing an empty IP (e.g. :9091) will bind the server to all available interfaces. --hubble-metrics takes a comma-separated list of metrics.

      Some metrics can take additional semicolon-separated options per metric, e.g. --hubble-metrics="dns:query;ignoreAAAA,http:destinationContext=pod-short" will enable the the dns metric with the query and ignoreAAAA options, and the http metric with the destinationContext=pod-short option.

      Context Options

      Most Hubble metrics can be configured to add the source and/or destination context as a label. The options are called sourceContext and destinationContext. The possible values are:

      Exported Metrics

      Hubble metrics are exported under the hubble_ Prometheus namespace.

      dns
      NameLabelsDescription
      dns_queries_totalrcode, qtypes, ips_returnedNumber of DNS queries observed
      dns_responses_totalrcode, qtypes, ips_returnedNumber of DNS responses observed
      dns_response_types_totaltype, qtypesNumber of DNS response types
      Options
      Option KeyOption ValueDescription
      queryN/AInclude the query as label “query”
      ignoreAAAAN/AIgnore any AAAA requests/responses

      This metric supports Context Options.

      drop
      NameLabelsDescription
      drop_totalreason, protocolNumber of drops
      Options

      This metric supports .

      flow
      NameLabelsDescription
      flows_processed_totaltype, subtype, verdictTotal number of flows processed
      Options

      This metric supports Context Options.

      http
      NameLabelsDescription
      http_requests_totalmethod, protocolCount of HTTP requests
      http_responses_totalmethod, statusCount of HTTP responses
      http_request_duration_secondsmethodQuantiles of HTTP request duration in seconds
      Options

      This metric supports .

      icmp
      NameLabelsDescription
      icmp_totalfamily, typeNumber of ICMP messages
      Options

      This metric supports Context Options.

      port-distribution
      NameLabelsDescription
      port_distribution_totalprotocol, portNumbers of packets distributed by destination port
      Options

      This metric supports .

      tcp
      NameLabelsDescription
      tcp_flags_totalflag, familyTCP flag occurrences
      Options