Sunday, March 15, 2026

Oracle GoldenGate Performance Metrics Service & Observability

GoldenGate Performance Metrics Service & Observability

Topic boundary and naming matter

Current GoldenGate Microservices material uses the term Performance Metrics Service. Older training and long-lived operator vocabulary often say Performance Metrics Server, and deployment internals still expose names such as PMSRVR. Treat those as naming-era differences, not different components.

The boundary is simple but operationally important: this service collects and stores GoldenGate runtime metrics and exposes them through drill-down views and service endpoints. It is central, but it is not the only observability surface in the product.

Mental model. Use the Performance Metrics Service to answer, "What are services and processes doing over time?" Use path pages, reports, logs, and heartbeat tables to answer, "What failed, where, and is the target actually current?"

What this article covers

The metrics plane, the supporting observability surfaces around it, and a disciplined workflow for diagnosing lag and monitoring ambiguity.

What it does not cover

End-to-end deployment build steps, Extract or Replicat creation details, or broad platform observability architecture beyond GoldenGate itself.

Architecture of the metrics plane inside a deployment

GoldenGate Microservices treats observability as a deployment-local capability. Deployment creation can include the Performance Metrics Service and a selected local datastore, while Extract, Replicat, and the core microservices publish runtime metrics into that local plane.

Older GoldenGate material describes the metrics store in terms of Berkeley DB or LMDB, and current deployment workflows still expose a data-store choice. The practical point is not the brand of the store but the locality of the service: the metrics layer belongs to the deployment, not to a shared enterprise monitoring cluster.

Oracle's newer Microservices monitoring material also calls out Unix Domain Sockets as the default local communication mechanism on Unix from 21c-era behavior onward. That matters because it reinforces two operational assumptions: the metrics plane is intentionally local, and a broken metrics surface should be investigated inside the deployment before blaming remote tools.

Design implication. Skipping the Performance Metrics Service during deployment design is not just skipping a convenience screen. It removes GoldenGate's built-in time-series and drill-down metrics surface for that deployment.

Which surface answers which question

Outages stretch when teams ask the right question in the wrong place. The fastest diagnosis usually comes from choosing the surface that owns the symptom first.

Question Best first surface Why it belongs there Common mistake
Is the monitoring plane itself healthy? Deployment health and metrics-service health views They distinguish a monitoring failure from a replication failure. Assuming blank charts always mean the data path is broken.
Is Extract or Replicat running, stopped, or abended? Administration Service, Admin Client, or process status REST These give authoritative current process state. Starting with historical graphs when state is the first unknown.
Is transport between deployments the bottleneck? Distribution Service and Receiver Service path pages Path ownership, incoming-path detail, and network behavior live here. Trying to infer path health from process charts alone.
Did throughput or resource behavior change over time? Performance Metrics Service drill-down tabs This is the best trend and comparative view for services and processes. Reading one status snapshot as if it were a trend diagnosis.
Is the target truly current? Automatic heartbeat tables and GG_LAG Heartbeat lag measures end-to-end replication flow, not just process posture. Declaring victory because process lag looks low.
What exact error or parameter context caused the issue? Process report and Service Manager Diagnosis These hold evidence text, timeline, and runtime context. Restarting or retuning before reading the report.

How to read the Performance Metrics Service correctly

The service is strongest as a drill-down trend surface. Its overview and detail pages are built to show how service and process behavior changes over time, not to replace every other form of evidence.

Microservice pages

Use Process Performance, Thread Performance, and Status and Configuration to judge whether the microservices themselves are healthy and balanced.

Extract detail

Expect trail-file, database, cache, and queue-oriented views. Useful when capture pressure, trail movement, or internal buffering is in doubt.

Replicat detail

Use trail and database-oriented views for apply behavior, but still verify end-to-end truth with heartbeat lag when target freshness is under dispute.

Limits of the page

Pause and clear controls help with live viewing, not evidence preservation. When you need durable proof, pivot into reports, logs, or an external retention surface.

Metric family Where it commonly appears Why it matters
Process Performance Microservices, Extracts, Replicats Confirms that a slowdown is real and shows broad resource behavior.
Thread Performance Microservices, Extracts, Replicats Useful when a process is running but behaving unevenly under load.
Status and Configuration Microservices, Extracts, Replicats Stops teams from tuning the wrong object or misreading runtime context.
Trail Files Extracts and Replicats Separates capture, transport, and apply movement.
Database, Cache, and Queue statistics Primarily process-specific views Explain why a process is busy, not just that it is busy.
Interpretation warning. Charts are excellent at answering "when did behavior change?" They are weaker at answering "what exact failure text caused that change?"
📘
eBook
Exadata DBA Guide
A comprehensive PDF guide for Oracle DBAs covering Exadata architecture, Smart Scan, Flash Cache, Storage HA and performance tuning.
Get the PDF →

Adjacent observability surfaces that complete the picture

GoldenGate observability is deliberately plural. The right workflow crosses the Performance Metrics Service, service-specific pages, deployment logs, and database-side lag signals.

Admin Service & CLI

Use INFO, LAG, STATS, and VIEW REPORT for authoritative process state and targeted evidence gathering.

Distribution & Receiver

Use path status, incoming-path statistics, and target-initiated path visibility to localize transport ownership.

Service Manager Diagnosis

Use it to correlate lag messages, heartbeat activity, status changes, and service-level errors across the deployment timeline.

Reports and logs

Use process reports for parameter context, mappings, and runtime messages. Use service logs when sequence and timing matter.

Heartbeat tables

Use GG_HEARTBEAT and GG_LAG to prove end-to-end freshness at the database level.

REST, StatsD, and OCI

Use REST for automation, StatsD for export into external platforms, and OCI metrics where cloud integration is the operating model.

Common blind spot. Teams often use the Performance Metrics Service as if it were the whole observability system and then miss a path-level issue that Distribution or Receiver already makes obvious.

A disciplined investigation sequence under pressure

When latency rises or a dashboard looks wrong, do not jump straight to restart commands. Use a fixed sequence that prevents category errors.

Step 01

Validate deployment and metrics-plane health before trusting any chart.

Step 02

Check Extract and Replicat state through Admin Service or REST.

Step 03

Inspect Distribution or Receiver path ownership when trail movement is in doubt.

Step 04

Query heartbeat lag on the destination that matters to the application.

Step 05

Read reports and deployment messages before changing parameters.

Phase Healthy signal If not healthy
Metrics-plane validation Responsive health views and current metrics pages Treat stale or empty charts as an observability-layer issue first.
Process validation Processes are running and lag is explainable Move directly to the affected process report or service log.
Transport validation Path statistics advance coherently Investigate the owning path service before touching process parameters.
Freshness validation Heartbeat lag aligns with business expectation If heartbeat lag and process lag disagree, trust the disagreement and explain it.

Reusable command and query bundles

These bundles are intentionally short. Each one answers one class of question cleanly instead of mixing every possible check into a hard-to-interpret blob.

Adminclient Fast process-state and report bundle
CONNECT <deployment-endpoint> deployment OBS_EDGE1 as oggops password "replace-me"

INFO ALL
INFO EXTRACT EORDSRC, DETAIL
INFO REPLICAT RORDAP1, DETAIL
LAG EXTRACT EORDSRC
LAG REPLICAT RORDAP1
STATS EXTRACT EORDSRC, TOTAL
STATS REPLICAT RORDAP1, TOTAL
VIEW REPORT EORDSRC
VIEW REPORT RORDAP1
REST Health and process-status checks
curl -k -u oggops:replace-me "<admin-endpoint>/services/v2/config/health/check"
curl -k -u oggops:replace-me "<admin-endpoint>/services/v2/config/health"
curl -k -u oggops:replace-me "<admin-endpoint>/services/v2/extracts/EORDSRC/info/status"
curl -k -u oggops:replace-me "<admin-endpoint>/services/v2/replicats/RORDAP1/info/status"
curl -k -u oggops:replace-me "<metrics-endpoint>/services/v2/mpoints/ADMINSRVR/serviceHealth"
Heartbeat Prove end-to-end lag
DBLOGIN USERIDALIAS trg_oggops
INFO HEARTBEATTABLE

SELECT remote_database,
       local_database,
       incoming_path,
       incoming_heartbeat_age,
       incoming_lag,
       current_local_ts
FROM   gg_lag
ORDER  BY remote_database, incoming_path;
Clock note. GoldenGate documents that heartbeat timestamps are stored in UTC and that clock skew can produce negative lag values. If that happens, fix time sync before arguing about the SQL.

Failure patterns and version-aware notes

Symptom Likely misunderstanding Inspect next
Charts are empty or stale The team assumes a data-path failure instead of a monitoring-plane problem. Deployment health and metrics-service health.
Extract lag is low but target data is stale Process lag is being treated as end-to-end freshness. Replicat status, path statistics, and heartbeat lag.
Receiver seems quiet in target-initiated transport Path ownership is assumed to be source-centric in every topology. Receiver path detail and target-side path definition.
Heartbeat lag is negative The query is blamed instead of the clocks. Time synchronization on source and target hosts.

Naming across releases

Older material says Performance Metrics Server; current documentation says Performance Metrics Service; internals may still say PMSRVR.

21c-era Unix behavior

Unix Domain Sockets become the default local communication path to the metrics service on Unix in newer documentation.

OCI extension

OCI GoldenGate adds cloud metrics and alarms, but those do not replace local path, report, and heartbeat evidence.

The operating standard to keep

A mature GoldenGate observability posture is not "we have dashboards." It is a repeatable habit of correlating the right built-in surfaces in the right order.

For any serious incident, gather at least one signal from service health, one from process state, one from path ownership, and one from heartbeat lag. Preserve reports and deployment messages before making intrusive changes. Use the Performance Metrics Service as the trend and drill-down hub, not as the only truth source.

If a team uses one vague phrase for everything from a dashboard gap to a path stall, fix the language first. Better observability language usually produces better troubleshooting behavior.

In Oracle GoldenGate Microservices, the Performance Metrics Service is the time-series observability hub, not the entire monitoring plane. Serious operations work still depends on Administration Service state, Distribution and Receiver path views, Service Manager Diagnosis, process reports, and heartbeat-derived lag at the database layer.

Test your understanding

Select an answer and click Check.

Q1 — Where does Performance Metrics Service sit in the GoldenGate Microservices deployment model?

Q2 — Which external metrics protocol does Performance Metrics Service natively support for forwarding metrics?

Q3 — What is the primary metric used to detect replication pipeline slowness in GoldenGate?

Q4 — How can you query current process metrics via the REST API?

No comments:

Post a Comment