[Prometheus & Grafana] Chapter 5. Jobs and Instances

Note: This post is a summary based on the official Prometheus (v3.2.1) and Grafana documentation. For precise details, please refer to the official docs.


Every metric Prometheus collects carries two labels that identify its origin: job and instance. These are not arbitrary tags -- they reflect Prometheus's fundamental model for organizing scrape targets. Understanding this model is the final piece of Part 02's data model coverage.


5.1 Instance: The Scrape Endpoint

An Instance is a single endpoint that Prometheus can scrape. It typically corresponds to one process and is identified by a <host>:<port> pair.

localhost:9090        <- Prometheus itself
10.0.1.5:9100        <- Node Exporter
10.0.1.5:4000        <- Web application

Each of these is an Instance. One host can run multiple Instances on different ports, and each is tracked independently.


5.2 Job: A Logical Group of Same-Purpose Instances

A Job is a logical group of replicated Instances that serve the same purpose. Running multiple copies of the same process for scalability or availability is standard practice -- a Job bundles them under one name.

# prometheus.yml
scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets:
          - '10.0.1.5:5670'
          - '10.0.1.5:5671'
          - '10.0.2.5:5670'
          - '10.0.2.5:5671'

The api-server Job above contains four Instances. The tree structure makes the relationship clear.

Job: api-server
├── Instance: 10.0.1.5:5670
├── Instance: 10.0.1.5:5671
├── Instance: 10.0.2.5:5670
└── Instance: 10.0.2.5:5671

Job: node-exporter
├── Instance: 10.0.1.5:9100
└── Instance: 10.0.2.5:9100

A Job groups what is logically the same service. An Instance pinpoints exactly which process within that service produced a given metric.


5.3 Auto-Generated Labels: job and instance

Prometheus automatically attaches two labels to every scraped metric.

Label Value Example
job job_name from the scrape config api-server
instance <host>:<port> of the scrape target 10.0.1.5:5670

Every collected metric is therefore traceable to its exact origin.

http_requests_total{job="api-server", instance="10.0.1.5:5670", method="GET"} = 1234
http_requests_total{job="api-server", instance="10.0.1.5:5671", method="GET"} = 5678

honor_labels

A conflict arises when a scrape target already exposes its own job or instance labels. The honor_labels setting resolves this.

honor_labels Behavior
false (default) Renames the target's labels to exported_job, exported_instance; uses Prometheus-assigned labels
true Uses the target's labels as-is; Prometheus-assigned labels are discarded

Federation setups typically use honor_labels: true to preserve the original labels from upstream Prometheus servers.


5.4 Auto-Generated Metrics

Beyond labels, Prometheus generates several metrics per scrape target automatically. These are essential for monitoring the health of the monitoring system itself.

The up Metric

The most important auto-generated metric. It indicates whether a scrape succeeded.

Value Meaning
1 Scrape successful -- instance is up
0 Scrape failed -- instance is down or unreachable
# Find all downed instances
up == 0

# Healthy instance ratio for a specific Job
avg(up{job="api-server"})

An alert rule on up == 0 is often the first alert any Prometheus deployment configures. If up is 0, everything else about that target is unknown.

Other Auto-Generated Metrics

Metric Description
scrape_duration_seconds Time taken to complete the scrape
scrape_samples_scraped Number of samples collected
scrape_samples_post_metric_relabeling Samples remaining after metric relabeling
scrape_series_added New time series added in this scrape (v2.10+)

extra-scrape-metrics Feature Flag

Enabling --enable-feature=extra-scrape-metrics exposes additional scrape diagnostics.

Metric Description
scrape_timeout_seconds Configured scrape timeout
scrape_sample_limit Configured sample limit (0 = unlimited)
scrape_body_size_bytes Uncompressed size of the last scrape response

Practical PromQL

These auto-generated metrics become powerful when combined in queries.

# Targets where scraping takes over 3 seconds (timeout risk)
scrape_duration_seconds > 3

# Targets where sample count doubled compared to 1 hour ago (cardinality explosion suspect)
scrape_samples_scraped / scrape_samples_scraped offset 1h > 2

# Instance health summary by Job
count by (job) (up == 1)
count by (job) (up == 0)

The scrape_duration_seconds > 3 query is particularly useful. If a target consistently approaches the scrape timeout, it will eventually start failing -- catching it early prevents gaps in your data.


Part 02 Recap

This chapter concludes Part 02. The table below summarizes every concept covered across Chapter 3. Data Model, Chapter 4. Metric Types, and this chapter.

Concept Definition Key Point
Time Series Time-ordered values identified by metric name + labels Fundamental data unit
Metric Name Describes what is measured prefix + base unit + suffix convention
Labels Key-value pairs for multi-dimensional distinction Cardinality management is essential
Counter Monotonically increasing cumulative value Use with rate(), _total suffix
Gauge Mutable snapshot value Direct query, predict_linear()
Histogram Bucket-based distribution Server-side aggregation, histogram_quantile()
Summary Client-side quantiles Not aggregatable, precise quantiles
Job Logical group of same-purpose instances Auto job label
Instance Single scrape endpoint Auto instance label, up metric

Part 02 established the theoretical foundation -- what Prometheus stores and how it categorizes that data. Part 03 shifts to practice. The next chapter covers installation of Prometheus and Grafana.

Comments