[Prometheus & Grafana] Chapter 7. Configuration File (prometheus.yml)
Note: This post is a summary based on the official Prometheus (v3.2.1) and Grafana documentation. For precise details, please refer to the official docs.
In Chapter 6. Installation, Prometheus was installed and the up query returned 1. At that point, Prometheus was scraping nothing but itself -- exactly as the default config file instructed. What to collect, how often, and through which path are all dictated by prometheus.yml. This chapter dissects that single YAML file.
7.1 Top-Level Structure
The top level of prometheus.yml is divided into a handful of purpose-built sections. Each one governs a distinct axis of Prometheus's behavior.
global: # Defaults applied to every job
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
rule_files: # Paths to Recording/Alerting Rule files
- '/etc/prometheus/rules/*.yml'
scrape_configs: # What to scrape and how (the core)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
alerting: # Alertmanager to send alerts to
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
remote_write: # Write to remote storage
- url: 'http://remote-storage:9201/write'
remote_read: # Read from remote storage
- url: 'http://remote-storage:9201/read'
In practice, the two sections touched daily are global and scrape_configs. The rest get added when alerting, long-term retention, or federation become necessary. The sections below walk through each in order.
7.2 global: Default Values
The global section defines the defaults that apply when an individual job does not specify its own. Re-declaring the same key at the job level overrides the global value.
global:
scrape_interval: 15s # Scrape frequency (default 1m)
scrape_timeout: 10s # Scrape timeout (default 10s)
evaluation_interval: 15s # Rule evaluation frequency (default 1m)
external_labels:
monitor: 'production-monitor'
region: 'ap-northeast-2'
scrape_interval vs evaluation_interval
The names look alike, but the two serve different roles: one sets how often data is pulled, the other how often rules are computed.
| Setting | Role | Impact |
|---|---|---|
scrape_interval |
How often metrics are pulled from targets | Shorter = higher resolution, higher load |
evaluation_interval |
How often Recording/Alerting Rules are evaluated | Affects alert detection latency |
Recommended values depend on the environment. General infrastructure monitoring sits well at 15s ~ 30s, while detailed application monitoring fits 5s ~ 15s. Keep evaluation_interval equal to or a multiple of scrape_interval -- evaluating more often than scraping just re-runs rules against unchanged data.
external_labels
external_labels act as a provenance tag attached to time series leaving the server. They are automatically appended to data sent via federation or to Alertmanager, letting you tell which Prometheus a series came from when running several servers.
external_labels:
cluster: 'production'
region: 'ap-northeast-2'
environment: 'prod'
One caveat matters here. external_labels are not added to the local TSDB. They apply only on outbound transmission, so they will not appear in local queries.
7.3 scrape_configs: Scrape Configuration
scrape_configs is the heart of prometheus.yml. Which targets to scrape, over which path, how often, and with what authentication -- it is all defined here.
Basic Structure
scrape_configs:
- job_name: 'my-service' # Required. Attached as the job label
scrape_interval: 10s # Overrides global
metrics_path: '/metrics' # Default /metrics
scheme: 'https' # Default http
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
labels:
env: 'production'
team: 'backend'
job_name must be unique across the entire file and is automatically added as a job label to every collected metric.
metrics_path and scheme
Most exporters serve /metrics, but plenty do not. Spring Boot Actuator uses /actuator/prometheus, and federation uses /federate.
metrics_path: '/actuator/prometheus' # Spring Boot Actuator
Setting scheme to https switches to TLS, which requires a matching tls_config pointing at the certificates.
scheme: 'https'
tls_config:
ca_file: '/etc/prometheus/ca.crt'
cert_file: '/etc/prometheus/client.crt'
key_file: '/etc/prometheus/client.key'
Authentication
When a target demands authentication, configure Basic Auth or a Bearer Token. Tokens can be read from a file, which keeps secrets out of the config file itself.
scrape_configs:
- job_name: 'authenticated-service'
basic_auth:
username: 'prometheus'
password: 'secret'
# Or a Bearer Token from a file
authorization:
type: 'Bearer'
credentials_file: '/etc/prometheus/token'
7.4 honor_labels and honor_timestamps
These two options sit untouched until the moment you attach federation or a Pushgateway -- then they become unavoidable. They settle the question of who owns the labels and timestamps of a scraped metric.
honor_labels
When a scraped metric already carries job or instance labels, they collide with the same-named labels Prometheus wants to attach. honor_labels decides which side wins.
| honor_labels | Behavior on conflict |
|---|---|
false (default) |
Original labels are pushed to exported_<name>; Prometheus labels are used |
true |
Original labels are kept as-is; Prometheus labels are ignored |
The decision rule is simple. Use true when the source's labels must be preserved, and false when Prometheus needs to identify the target precisely. Federation and Pushgateway use true; ordinary exporter scraping keeps the default false.
honor_timestamps
This decides whether to trust a timestamp embedded in the scraped metric. The default true uses the metric's own timestamp; false overwrites it with the scrape time. Keep the default unless you have a specific reason not to.
7.5 relabel_configs and metric_relabel_configs
Relabeling is the most powerful -- and most confusing -- feature in Prometheus. It adds, rewrites, and drops labels dynamically, and can even decide whether a target gets scraped at all. There are two similarly named variants, and the decisive difference is when they apply.
| relabel_configs | metric_relabel_configs | |
|---|---|---|
| Timing | Before scraping | After scraping |
| Subject | Target labels (service discovery output) | Labels of collected metrics |
| Purpose | Filter targets, transform labels | Drop unwanted metrics, clean up labels |
Put in one sentence: relabel_configs decides "whether and where to scrape this target," while metric_relabel_configs decides "which of the already-scraped metrics to keep."
relabel_configs: Before Scraping
It massages the __meta_* labels produced by service discovery to filter targets or promote them into real labels. This is especially common in Kubernetes.
relabel_configs:
# Scrape only pods with the prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Promote the pod name into a pod label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
metric_relabel_configs: After Scraping
It selects which of the already-collected metrics to store. Use it to discard cardinality-exploding metrics or drop noisy go_* internals to save storage.
metric_relabel_configs:
# Keep only the core node_ metrics
- source_labels: [__name__]
regex: 'node_cpu.*|node_memory.*|node_disk.*|node_network.*'
action: keep
# Drop go_ runtime metrics
- source_labels: [__name__]
regex: 'go_.*'
action: drop
Common action Types
The action key determines what a relabeling rule does. The seven most-used actions are below.
| Action | Description |
|---|---|
keep |
Keep only targets/metrics matching the regex |
drop |
Remove targets/metrics matching the regex |
replace |
Replace a label value (default action) |
labelmap |
Map matching label names to new names |
labeldrop |
Delete labels matching the regex |
labelkeep |
Keep only labels matching the regex |
hashmod |
Assign a value via hashing (used for sharding) |
7.6 rule_files: Loading Rule Files
rule_files is a list of paths to files holding Recording Rules and Alerting Rules. It supports glob patterns, so an entire directory can be loaded at once.
rule_files:
- '/etc/prometheus/rules/recording_rules.yml'
- '/etc/prometheus/rules/*.yml' # glob pattern
How to write the rule files themselves is covered in Chapter 13-14. For now, the point is that pointing at a path is enough -- Prometheus loads them automatically.
7.7 alerting: Alertmanager Integration
The alerting section defines where firing alerts go. Prometheus never sends alerts directly; it merely hands them off to Alertmanager.
alerting:
alert_relabel_configs:
- source_labels: [severity]
regex: 'info'
action: drop # Do not forward severity=info alerts
alertmanagers:
- static_configs:
- targets:
- 'alertmanager1:9093'
- 'alertmanager2:9093'
timeout: 10s
alert_relabel_configs applies relabeling to alerts just before dispatch -- commonly used to filter out severity=info alerts, as above.
Listing multiple Alertmanager instances provides high availability. Prometheus sends each alert to all instances, and the Alertmanager cluster deduplicates them on its own. As a result, no alert is dropped even if one instance dies.
7.8 remote_write / remote_read
A local TSDB alone struggles with long-term retention and a unified view across servers. remote_write and remote_read are the conduits linking Prometheus to remote storage like Thanos or Mimir.
remote_write
It pushes locally stored data to remote storage. Because the volume is high, throughput is tuned with queue settings, and write_relabel_configs typically filters out expensive metrics in advance.
remote_write:
- url: 'http://thanos-receive:19291/api/v1/receive'
queue_config:
max_samples_per_send: 5000
max_shards: 200
write_relabel_configs:
- source_labels: [__name__]
regex: 'expensive_metric.*'
action: drop
remote_read
It reads data from remote storage and transparently merges it with local data when serving PromQL queries. Setting read_recent: false reads recent data from local and only older data from remote, reducing load.
remote_read:
- url: 'http://thanos-query:9090/api/v1/read'
read_recent: false
7.9 Reloading the Configuration
Changing the config does not require restarting Prometheus. A restart creates a collection gap and severs in-flight queries. There are two zero-downtime reload methods, and validation should always come first.
| Method | Command | Prerequisite |
|---|---|---|
| SIGHUP signal | kill -HUP $(pidof prometheus) or systemctl reload prometheus |
None |
| HTTP API | curl -X POST http://localhost:9090/-/reload |
--web.enable-lifecycle enabled |
| Pre-validation | promtool check config prometheus.yml |
(always recommended before reload) |
Validate Before Reloading
Before reloading, verify the syntax with promtool check config. It is the last line of defense before throwing a broken config at a running server.
./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: prometheus.yml is valid prometheus config file syntax
If the config file contains an error, the reload is rejected and the existing config stays in effect. A reload succeeds only when every rule file referenced in rule_files is valid too. So pushing a bad config will not crash a running Prometheus -- but you must check the logs to confirm whether the change actually took effect.
Summary
| Section | Role | Key Point |
|---|---|---|
global |
Global defaults | scrape_interval (collection) vs evaluation_interval (evaluation); external_labels apply only on outbound transmission |
scrape_configs |
Scrape definitions | job_name must be unique; metrics_path/scheme/auth settings |
honor_labels |
Label conflict handling | true for federation/Pushgateway, false for ordinary scraping |
relabel_configs |
Before scraping | Target filtering, __meta_* manipulation |
metric_relabel_configs |
After scraping | Cardinality control, dropping unwanted metrics |
alerting |
Alertmanager integration | Sent to all instances at once; Alertmanager handles deduplication |
remote_write/read |
Remote storage | Long-term retention and unified view, queue and filter settings |
| Reload | Zero-downtime apply | SIGHUP or HTTP API, with promtool check config first |
You can now fill out prometheus.yml by hand -- but hardcoding targets one by one in static_configs collapses the moment the fleet exceeds a few dozen servers. The next chapter, Chapter 8. Service Discovery, covers automatically discovering scrape targets in dynamic environments like Kubernetes, EC2, and Consul.
Prometheus & Grafana(7 / 9)
View full list
- [Prometheus & Grafana] Chapter 1. Why Monitoring Matters
- [Prometheus & Grafana] Chapter 2. Prometheus and Grafana Architecture
- [Prometheus & Grafana] Chapter 3. Data Model
- [Prometheus & Grafana] Chapter 4. Metric Types
- [Prometheus & Grafana] Chapter 5. Jobs and Instances
- [Prometheus & Grafana] Chapter 6. Installation
- [Prometheus & Grafana] Chapter 7. Configuration File (prometheus.yml)
- [Prometheus & Grafana] Chapter 8. Service Discovery
- [Prometheus & Grafana] Chapter 9. PromQL Basics