Involve Prometheus into Linkis

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

In the context of microservice, it provides the service discovery feature, enabling to find targets dynamically from service register center, like Eureka, Consul, etc, and pull the metrics from API endpoint over http protocol.

This diagram illustrates the architecture of Prometheus and some of its ecosystem components:

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data.

Involve Prometheus into Linkis - 图2

In the context of Linkis, we will use Eureka (Service Discover)SD in Prometheus to retrieve scrape targets using the Eureka REST API. And Prometheus will periodically check the REST endpoint and create a target for every app instance.

Modify the configuration item in linkis-env.sh of Linkis.

After running the install.sh, it’s expected to see the configuration related to prometheus is appended inside the following files:

  1. ## application-linkis.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:${prometheus.endpoint}}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## application-eureka.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:/actuator/prometheus}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. wds.linkis.server.user.restful.uri.pass.auth=/api/rest_j/v1/actuator/prometheus,
  5. ...

Modify${LINKIS_HOME}/conf/application-linkis.yml, add prometheus as exposed endpoints.

  1. ## application-linkis.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #Add prometheus
  7. include: refresh,info,health,metrics,prometheus

Modify${LINKIS_HOME}/conf/application-eureka.yml, add prometheus as exposed endpoints.

  1. ## application-eureka.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #Add prometheus
  7. include: refresh,info,health,metrics,prometheus

Modify${LINKIS_HOME}/conf/linkis.properties, remove the comment # before prometheus.enable

  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. ...

After start the services, it’s expected to access the prometheus endpoint of each microservice in the Linkis, for example, .

注意

The prometheus endpoint of gateway/eureka don’t include the prefix api/rest_j/v1, and the complete endpoint will be

Usually the monitoring setup for a cloud native application will be deployed on kubernetes with service discovery and high availability (e.g. using a kubernetes operator like Prometheus Operator). To quickly prototype dashboards and experiment with different metric type options (e.g. histogram vs gauge) you may need a similar setup locally. This sector explains how to setup locally a Prometheus/Alert Manager and Grafana monitoring stack with Docker Compose.

First, lets define a general component of the stack as follows:

  • An Alert Manager container that exposes its UI at 9093 and read its configuration from alertmanager.conf

  • A Prometheus container that exposes its UI at 9090 and read its configuration from prometheus.yml and its list of alert rules from alert_rules.yml

  • The following docker-compose.yml file summaries the configuration of all those components:

  1. ## docker-compose.yml ##
  2. version: "3"
  3. networks:
  4. default:
  5. external: true
  6. name: my-network
  7. image: prom/prometheus:latest
  8. container_name: prometheus
  9. volumes:
  10. - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
  11. - ./config/alertrule.yml:/etc/prometheus/alertrule.yml
  12. - ./prometheus/prometheus_data:/prometheus
  13. command:
  14. - '--config.file=/etc/prometheus/prometheus.yml'
  15. ports:
  16. - "9090:9090"
  17. alertmanager:
  18. image: prom/alertmanager:latest
  19. container_name: alertmanager
  20. volumes:
  21. - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  22. ports:
  23. - "9093:9093"
  24. grafana:
  25. image: grafana/grafana:latest
  26. container_name: grafana
  27. environment:
  28. - GF_SECURITY_ADMIN_PASSWORD=123456
  29. - GF_USERS_ALLOW_SIGN_UP=false
  30. volumes:
  31. - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
  32. - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
  33. - ./grafana/grafana_data:/var/lib/grafana
  34. ports:
  35. - "3000:3000"

Second, to define some alerts based on metrics in Prometheus, you can group then into an alert_rules.yml, so you could validate those alerts are properly triggered in your local setup before configuring them in the production instance. As an example, the following configration convers the usual metrics used to monitor Linkis services.

  • a. Down instance
  • b. High Cpu for each JVM instance (>80%)
  • c. High Heap memory for each JVM instance (>80%)
  • d. High NonHeap memory for each JVM instance (>80%)
  • e. High Waiting thread for each JVM instance (100)
  1. ## alertrule.yml ##
  2. groups:
  3. - name: LinkisAlert
  4. rules:
  5. - alert: LinkisNodeDown
  6. expr: last_over_time(up{job="linkis", application=~"LINKISI.*", application!="LINKIS-CG-ENGINECONN"}[1m])== 0
  7. for: 15s
  8. labels:
  9. severity: critical
  10. service: Linkis
  11. instance: "{{ $labels.instance }}"
  12. annotations:
  13. summary: "instance: {{ $labels.instance }} down"
  14. description: "Linkis instance(s) is/are down in last 1m"
  15. value: "{{ $value }}"
  16. - alert: LinkisNodeCpuHigh
  17. expr: system_cpu_usage{job="linkis", application=~"LINKIS.*"} >= 0.8
  18. for: 1m
  19. labels:
  20. severity: warning
  21. service: Linkis
  22. instance: "{{ $labels.instance }}"
  23. annotations:
  24. summary: "instance: {{ $labels.instance }} cpu overload"
  25. description: "CPU usage is over 80% for over 1min"
  26. value: "{{ $value }}"
  27. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) >= 50
  28. for: 1m
  29. severity: warning
  30. service: Linkis
  31. instance: "{{ $labels.instance }}"
  32. annotations:
  33. summary: "instance: {{ $labels.instance }} memory(heap) overload"
  34. description: "Memory usage(heap) is over 80% for over 1min"
  35. value: "{{ $value }}"
  36. - alert: LinkisNodeNonHeapMemoryHigh
  37. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) >= 60
  38. for: 1m
  39. labels:
  40. severity: warning
  41. service: Linkis
  42. instance: "{{ $labels.instance }}"
  43. annotations:
  44. summary: "instance: {{ $labels.instance }} memory(nonheap) overload"
  45. description: "Memory usage(nonheap) is over 80% for over 1min"
  46. value: "{{ $value }}"
  47. - alert: LinkisWaitingThreadHigh
  48. expr: jvm_threads_states_threads{job="linkis", application=~"LINKIS.*", state="waiting"} >= 100
  49. for: 1m
  50. labels:
  51. severity: warning
  52. service: Linkis
  53. instance: "{{ $labels.instance }}"
  54. annotations:
  55. summary: "instance: {{ $labels.instance }} waiting threads is high"
  56. description: "waiting threads is over 100 for over 1min"
  57. value: "{{ $value }}"

Note: Since once the service instance is shutdown, it will not be one of the target of Prometheus Eureka SD, and up metrics will not return any data after a short time. Thus we will collect if the up=0 in the last one minute to determine whether the service is alive or not.

Third, and most importantly define Prometheus configuration in prometheus.yml file. This will defines:

  • the global settings like scrapping interval and rules evaluation interval
  • the connection information to reach AlertManager and the rules to be evaluated
  • the connection information to application metrics endpoint. This is an example configration file for Linkis:
  1. ## prometheus.yml ##
  2. # my global config
  3. global:
  4. scrape_interval: 30s # By default, scrape targets every 15 seconds.
  5. evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  6. alerting:
  7. alertmanagers:
  8. - static_configs:
  9. - targets: ['alertmanager:9093']
  10. # Load and evaluate rules in this file every 'evaluation_interval' seconds.
  11. rule_files:
  12. - "alertrule.yml"
  13. # A scrape configuration containing exactly one endpoint to scrape:
  14. # Here it's Prometheus itself.
  15. scrape_configs:
  16. - job_name: 'prometheus'
  17. static_configs:
  18. - targets: ['localhost:9090']
  19. - job_name: linkis
  20. eureka_sd_configs:
  21. # the endpoint of your eureka instance
  22. - server: {{linkis-host}}:20303/eureka
  23. relabel_configs:
  24. - source_labels: [__meta_eureka_app_name]
  25. target_label: application
  26. - source_labels: [__meta_eureka_app_instance_metadata_prometheus_path]
  27. action: replace
  28. target_label: __metrics_path__

Forth, the following configuration defines how alerts will be sent to external webhook.

Finally, after defining all the configuration file as well as the docker compose file we can start the monitoring stack with

On Prometheus page, it’s expected to see all the Linkis service instances as shown below: Involve Prometheus into Linkis - 图4

When the Grafana is accessible, you need to import the prometheus as datasource in Grafana, and import the dashboard template with id 11378, which is normally used for springboot service(2.1+). Then you can view one living dashboard of Linkis there.