Tuesday, January 30, 2024

Database Replay - Real Application Testing (RAT)

Oracle Database Replay and RAT - Practical Deep Dive
Oracle Database Replay deep dive

Database Replay and Real Application TestingHow to validate upgrades, patches, migrations, and risky changes with real workload behavior

Database Replay is one of the most practical risk-reduction tools in the Oracle DBA toolbox. Instead of trusting synthetic benchmarks, isolated SQL tests, or intuition, you capture a real production workload, restore a test system to the same logical starting point, replay that workload, and analyze whether performance, errors, timing, and transactional behavior still look safe.

What Real Application Testing actually is

Real Application Testing, usually shortened to RAT, is the umbrella feature set Oracle uses for validating database changes with production-like evidence. It has two major parts:

Database Replay

Captures external client calls from a live system and replays them on a test system with original timing, concurrency, and transaction dependencies.

SQL Performance Analyzer

Extracts SQL into a SQL tuning set and compares statement-level performance before and after a change.
Key insight

Think of Database Replay as the system-level test and SQL Performance Analyzer as the statement-level microscope. When used together, they answer both: "Did the workload still behave correctly?" and "Which SQL got better or worse?"

That pairing is why RAT is so valuable before upgrades, release updates, platform migrations, Exadata moves, storage changes, parameter changes, or application releases that alter workload shape.

Recollect: Database Replay is not just about performance. It is also about preserving concurrency patterns, commit ordering options, connection behavior, and error visibility in a way that hand-written test scripts rarely can.

Why Database Replay matters more than synthetic load tests

Many test environments rely on hand-made scripts, partial application smoke tests, or benchmark tools that are good for capacity checks but weak at reproducing real user behavior. Database Replay is different because it records actual client requests with their original sequence characteristics.

What synthetic tests miss

Irregular think times, mixed transaction types, hot objects, real bind patterns, and unpredictable concurrency bursts.

What replay preserves

Timing, overlapping sessions, transaction dependencies, connection maps, and the shape of the real production workload.

What you still must do

Restore the test system to a logically equivalent starting point, remap connections carefully, and interpret divergence rather than blindly trusting one score.
When it shines

Database Replay is strongest when the change is wide in scope: database upgrades, patching, host moves, RAC changes, storage changes, and large parameter or configuration shifts. These are exactly the changes where application-only testing often leaves blind spots.

Change typeWhy replay helpsWhat you want to learn
Database upgradeSame workload, new optimizer and database engine behaviorDo errors appear, and do response patterns regress?
Patch or RUSmall code changes can still alter workload behaviorDid a previously stable path get slower or behave differently?
Storage or host migrationI/O and concurrency behavior can shift sharplyDoes the new stack preserve latency under real overlap?
RAC or service redesignConnection routing and session placement matterAre connections remapped correctly and does throughput hold?
Initialization parameter changeSome settings affect broad runtime behaviorDid the system become faster, noisier, or less predictable?

The end-to-end workflow, visually

At a high level, Database Replay is a six-phase process: define a clean starting point, capture production workload, move and preprocess the files, restore a test database to the same logical state, calibrate and run replay clients, then analyze divergence and performance.

1. Define start statebackup or restore pointquiet start if possible2. Capturerecord external client callsoptional filters, optional STS3. Preprocessgenerate replay metadatadone once per capture4. Restore test DBsame logical data stateas capture start5. Replayinitialize, remap, preparewrc clients drive workload6. Analyzeperformance, errors,data divergence, AWR, SQL
Recollect: The most common conceptual mistake is to focus on the replay command and forget the restore step. Replay quality depends heavily on having a test database that starts from the same logical application state as the original capture start.

Capture design and prerequisites

A strong replay starts before the capture itself. You need to decide what business window to record, how representative the workload really is, whether maintenance jobs should be in or out, and how you will reconstruct the capture start state in test. Good capture design is what separates a replay that teaches you something from a replay that merely generates files.

Choose the right window

Capture a period that includes real peaks, not only quiet time. If month-end or batch overlap is what hurts in production, capture that.

Know the start state

Plan how the test database will be restored to the same logical point as capture start - backup, clone, or restore point strategy.

Decide what to exclude

Some sessions or modules should be filtered out if they would distort the purpose of the test.

Plan replay clients

The wrc replay clients must be able to simulate the original connection load. Calibrate before replay.
Before you capture

If you cannot recreate the same application data state in test, your replay analysis becomes much harder to trust. Performance differences can come from data drift rather than from the change you are validating.

What gets captured

Database Replay captures external client requests to the database. The capture records call order, timing, and session interaction information needed to reconstruct workload behavior later. That is why it is much more realistic than hand-written replay scripts. At the same time, you still need to interpret results carefully because some calls may be unsupported, filtered, remapped, or intentionally excluded.

Capture objective

Preserve realism: concurrency, think time, transaction shape, and user overlap.

Replay objective

Stress the changed system with the same workload pattern, not just the same SQL text.

Analysis objective

Determine whether errors, latency, and workload behavior changed enough to block go-live.

Example capture start

SQL - Create directory and start capture
-- Create an OS-backed directory for the capture files
CREATE OR REPLACE DIRECTORY capture_dir AS '/u01/dbreplay/capture';

-- Optional: inspect prior captures
SELECT id, name, status, start_time, end_time
FROM   dba_workload_captures
ORDER BY id DESC;

-- Start workload capture on the source system
BEGIN
  DBMS_WORKLOAD_CAPTURE.START_CAPTURE(
    name             => 'UPGRADE_CAPTURE_MAR_2026',
    dir              => 'CAPTURE_DIR',
    duration         => 3600,
    capture_sts      => TRUE,
    sts_cap_interval => 300,
    plsql_mode       => 'extended'
  );
END;
/

-- Stop capture early if required
EXEC DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE;

Capture filters matter

Filters are useful when the production workload contains noise that would distort the test objective. For example, you may want to exclude health checks, maintenance accounts, or modules unrelated to the change being evaluated. Use filters sparingly: over-filtering can turn a realistic workload into a curated demo.

SQL - Add a capture filter
-- Example: exclude a maintenance service account from capture
BEGIN
  DBMS_WORKLOAD_CAPTURE.ADD_FILTER(
    fname      => 'exclude_maint_user',
    fattribute => 'USER',
    fvalue     => 'MAINT_USER'
  );
END;
/
Recollect: A capture is only as good as its representativeness. Ask yourself, "If replay passes, would I be comfortable betting production on that window?" If the answer is no, recapture.

Preprocess, calibrate, initialize, remap, and replay

Once you move the capture files to the replay environment, the next phase is procedural and exacting. This is where many first-time teams make mistakes: they preprocess correctly but forget client calibration, they initialize replay but skip connection remapping review, or they start wrc without confirming how many clients are truly needed.

1. Preprocess the capture

SQL - Process capture files
BEGIN
  DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE(
    capture_dir => 'CAPTURE_DIR'
  );
END;
/

Processing builds the metadata the replay engine needs in order to drive the captured workload later. You do this once for the capture set after it has been copied to the replay environment.

2. Calibrate replay clients with wrc

The replay clients are external OS processes. Before running replay, calibrate them so you know how many wrc clients are required to simulate the captured concurrency. This is one of the most important operational steps because under-sizing replay clients can make a healthy system look artificially slow.

OS - Calibrate replay clients
# Run from the replay host
wrc mode=calibrate replaydir=/u01/dbreplay/capture

# Typical result: it tells you the recommended number of clients
# Start at least that many replay clients before you launch replay

3. Initialize replay and review connection mapping

SQL - Initialize replay
BEGIN
  DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY(
    replay_name => 'REPLAY_UPGRADE_TEST',
    replay_dir  => 'CAPTURE_DIR'
  );
END;
/

-- Inspect how captured connections were interpreted
SELECT connection_id, captured_connect_string, replay_connect_string
FROM   dba_workload_connection_map
ORDER BY connection_id;

If the target environment differs from source - new host, new service name, different SCAN listeners, new PDB service, or remapped users - you must review and possibly remap those connections before replay.

SQL - Example remapping
-- Example: remap all captured connections to a new replay service
BEGIN
  DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION(
    connection_id         => 1,
    replay_connect_string => 'replay_scan:1521/APPTEST_SVC'
  );
END;
/

4. Prepare replay and start clients

SQL and OS - Prepare and start replay
-- Prepare replay using SCN synchronization
EXEC DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => 'SCN');

# In one or more OS sessions, start the replay clients
wrc replaydir=/u01/dbreplay/capture

-- Then start replay from SQL
EXEC DBMS_WORKLOAD_REPLAY.START_REPLAY;
Process capturebuild replay metadataCalibrate wrcsize client count correctlyInitialize and remapconnection strings and usersPrepareSCN or time syncStart replayclients first, then DB
Recollect: Calibration answers "How many replay clients do I need?" Remapping answers "Where should captured connections land now?" Those are different questions, and both matter.

How to read replay results without fooling yourself

The replay report is the beginning of interpretation, not the end. Teams often look for one headline number, but the real value lies in a layered read: did replay complete cleanly, were there new or missing errors, did throughput or DB time move materially, did certain transaction classes slow down, and were there any data divergences that indicate semantic mismatch?

What to inspectWhy it mattersWhat a concern looks like
Error divergenceTells you whether replay introduced new failures or lost expected behaviorNew errors, missing commits, or changed return codes
DB time and response patternsShows whether the changed system does more work or stalls more oftenLarge response inflation or much higher DB time for same workload
Average active sessions / concurrency shapeHelps confirm replay pressure resembled capture pressureReplay clients under-sized, causing workload to flatten unnaturally
Data divergenceSignals semantic differences rather than simple timing differencesAny unexpected divergence needs explanation
AWR and SQL detailExplains where the time movedOne SQL family, wait class, or object suddenly dominates
Important judgment call

A replay can "finish successfully" and still be unsafe for go-live. If the workload completed but new errors appeared, data diverged, or latency became unpredictable, treat that as a failed validation until you understand the cause.

Useful report generation

SQL - Generate replay report
SELECT DBMS_WORKLOAD_REPLAY.REPORT(
         replay_id => 1,
         format    => 'HTML'
       )
FROM dual;

-- Also inspect replay history
SELECT id, replay_name, status, start_time, end_time
FROM   dba_workload_replays
ORDER BY id DESC;

A good analysis routine

Layer 1: sanity

Did replay complete? Did the client count look sufficient? Was the environment restored correctly?

Layer 2: divergence

Did errors, commits, or data outcomes change? Any unexplained semantic drift matters.

Layer 3: root cause

Use AWR, ASH, and SQL-level analysis to explain where time moved and why.
Recollect: Never ask only "Is replay faster?" Ask "Is replay faithful, semantically safe, and acceptably performant?"

Pitfalls, edge cases, and unreplayable work

Database Replay is powerful, but it is not magic. Some applications depend on external state, non-deterministic calls, wall-clock assumptions, messaging systems, or environment-specific side effects that cannot be reproduced perfectly. Replay is strongest at database-call realism; it is weaker when the behavior depends on things outside the database boundary.

Common first-time mistakes

  • Replaying against data that no longer matches capture start state
  • Skipping wrc calibration and under-driving the test system
  • Ignoring connection remapping differences
  • Capturing an unrepresentative quiet period
  • Reading only headline report values and not investigating divergences

Work that needs special care

  • External web service dependencies
  • Calls whose correctness depends on current time or sequence drift
  • Batch chains coordinated outside the database
  • Environment-dependent authentication or network behavior
  • Side effects in systems not restored alongside the database
Do not overclaim

Replay can tell you a great deal, but it does not prove every integration path outside the database. Treat it as a high-value validation layer, not the only test you run.

A practical acceptance mindset

When unreplayable or partially divergent calls exist, the right question is not "Can I force perfect replay?" The better question is "Can I explain the divergence, quantify the risk, and decide whether the change is still safe?" Mature teams log these exceptions explicitly rather than hand-waving them away.

Database Replay versus SQL Performance Analyzer

Because both belong to RAT, people sometimes ask which one they should use. In practice, the answer is often both - but for different reasons.

Question you need answeredBest toolWhy
Will the whole application workload still behave safely after this change?Database ReplayIt preserves timing, concurrency, and workload interaction at system level.
Which SQL statements regressed after the change?SQL Performance AnalyzerIt compares statement performance directly and highlights regressed SQL.
I need one broad go/no-go signal plus detailed SQL diagnosisBothReplay gives system confidence; SPA explains SQL-level regressions.
Practical pairing

A common pattern is: capture production workload, create or refresh an STS, run SPA to pinpoint statement regressions, then run Database Replay to validate whole-system behavior under concurrency.

Practical playbooks for real DBA change windows

Playbook A: database upgrade

  1. Capture a representative peak business window.
  2. Preserve or clone the source start state.
  3. Upgrade the replay environment.
  4. Run SPA for statement-level drift if an STS was captured.
  5. Run replay, compare errors, DB time, and SQL outliers.
  6. Repeat after fixes until the divergence story is understood.

Playbook B: infrastructure or Exadata move

  1. Keep application logic constant and change only the platform variables.
  2. Replay the same workload more than once to distinguish jitter from real pattern.
  3. Compare not just elapsed time but also wait profile and throughput.

Playbook C: risky parameter change

  1. Run a baseline replay without the parameter change.
  2. Run a second replay with the change.
  3. Compare both reports and AWR data to isolate the parameter impact.

End-to-end demo: validating a release update for an order-entry service

Here is a realistic sample use case that stays within the supported Database Replay workflow. Imagine a production OLTP database named APPDB that serves an order-entry application through a service called APP_SVC. The goal is to validate a database release update in a replay environment before touching production.

Demo principle

This demo uses verified package procedures, views, and wrc commands. Where outputs vary by environment, the article shows the exact verification query or the exact line you should expect to see, instead of inventing fake numbers.

Step 0: choose a clean capture window

Pick a window that contains real concurrency - for example, 10:00 to 11:00 when order entry, payment checks, and fulfillment updates overlap. If possible, begin from a well-defined start state so the replay database can be restored to the same logical point later.

Step 1: create the capture directory and start capture on production

SQL - Start production capture
CREATE OR REPLACE DIRECTORY rat_cap_dir AS '/u01/rat/capture/appdb_mar11';

BEGIN
  DBMS_WORKLOAD_CAPTURE.START_CAPTURE(
    name             => 'APPDB_RU_CAPTURE_20260311',
    dir              => 'RAT_CAP_DIR',
    duration         => 3600,
    capture_sts      => TRUE,
    sts_cap_interval => 300,
    plsql_mode       => 'extended'
  );
END;
/

-- When the business window finishes:
EXEC DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE;

Step 2: confirm the capture metadata and record the start SCN

SQL - Verify capture metadata
SELECT id,
       name,
       status,
       start_time,
       end_time,
       start_scn,
       end_scn
FROM   dba_workload_captures
WHERE  name = 'APPDB_RU_CAPTURE_20260311';

The result you want is simple: the capture row should show COMPLETED, and you should record the START_SCN. That SCN becomes part of your restore story for the replay system.

Step 3: copy the capture files to the replay environment

OS - Copy capture files
# Example using rsync from source to replay host
rsync -av /u01/rat/capture/appdb_mar11/ oracle@testdb01:/u01/rat/capture/appdb_mar11/

Step 4: restore the replay database to the capture start state

This part is environment-specific, so the exact restore command depends on your backup strategy. The important accuracy point is this: the replay database must be restored to the same logical application state that existed at capture start. If you use RMAN duplication, your command should restore to the capture start SCN you recorded in the previous step.

RMAN - Example duplicate until capture SCN
# Run from RMAN on the replay side, adapting names and channels to your environment
DUPLICATE TARGET DATABASE TO APPDBRPL
  UNTIL SCN <capture_start_scn>
  NOFILENAMECHECK;

After the restore or clone completes, verify the replay environment is open, reachable, and aligned with the application release you want to test.

Step 5: preprocess the capture on the replay database version

SQL - Process capture
CREATE OR REPLACE DIRECTORY rat_cap_dir AS '/u01/rat/capture/appdb_mar11';

BEGIN
  DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE(
    capture_dir => 'RAT_CAP_DIR',
    plsql_mode  => 'EXTENDED'
  );
END;
/

Step 6: calibrate replay clients before you start them

OS - Calibrate wrc
wrc mode=calibrate replaydir=/u01/rat/capture/appdb_mar11

In calibrate mode, the important output is the recommendation line telling you how many replay clients and CPUs are needed. Use that recommendation as your starting point rather than guessing.

Step 7: initialize replay, inspect connections, and remap to the replay service

SQL - Initialize and inspect mapping
BEGIN
  DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY(
    replay_name => 'APPDB_RU_REPLAY_20260311',
    replay_dir  => 'RAT_CAP_DIR'
  );
END;
/

SELECT replay_id,
       conn_id,
       capture_conn,
       replay_conn
FROM   dba_workload_connection_map
ORDER BY conn_id;
SQL - Remap to replay service
-- Use the CONN_ID values returned by DBA_WORKLOAD_CONNECTION_MAP
BEGIN
  DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION(
    connection_id     => 1,
    replay_connection => 'testscan01:1521/APP_RAT_SVC'
  );

  DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION(
    connection_id     => 2,
    replay_connection => 'testscan01:1521/APP_RAT_SVC'
  );
END;
/

If your replay users or schemas differ from source, this is also the stage where you apply user remapping. If they do not differ, do not invent remaps you do not need.

Step 8: prepare replay, start the replay clients, and then start the database replay

SQL and OS - Prepare and launch replay
EXEC DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => 'SCN');

# Start the number of replay clients recommended by calibration
wrc system/<password>@APP_RAT_SVC mode=replay replaydir=/u01/rat/capture/appdb_mar11

-- After all replay clients are attached, start the replay itself
EXEC DBMS_WORKLOAD_REPLAY.START_REPLAY;
SQL - Monitor replay threads and replay history
SELECT *
FROM   v$workload_replay_thread;

SELECT id,
       replay_name,
       status,
       start_time,
       end_time,
       capture_id
FROM   dba_workload_replays
ORDER BY id DESC;

During execution, watch for replay clients to attach cleanly and for the replay row to move through expected status changes. Do not treat a replay as trustworthy if clients were missing or obviously under-sized.

Step 9: generate the replay report and investigate divergence

SQL - Generate report and inspect divergence
DECLARE
  cap_id  NUMBER;
  rep_id  NUMBER;
  rep_rpt CLOB;
BEGIN
  cap_id := DBMS_WORKLOAD_REPLAY.GET_REPLAY_INFO(dir => 'RAT_CAP_DIR');

  SELECT MAX(id)
  INTO   rep_id
  FROM   dba_workload_replays
  WHERE  capture_id = cap_id;

  rep_rpt := DBMS_WORKLOAD_REPLAY.REPORT(
               replay_id => rep_id,
               format    => DBMS_WORKLOAD_REPLAY.TYPE_HTML);
END;
/

SELECT *
FROM   dba_workload_div_summary
ORDER BY replay_id DESC;
What counts as a good result here?

A good result is not a magical fixed ratio. A good result is: the replay completes, there are no unexplained new errors, no unacceptable data divergence, connection mapping behaved as intended, and any performance shifts are understood through replay reporting plus AWR or SQL analysis.

Recollect: The safest end-to-end replay demo is the one that shows exact commands, exact verification points, and honest interpretation criteria. That is much better than inventing a perfect-looking report with fake numeric results.
Decision rule

The best go-live decision is evidence-based and specific: "Replay completed with no unexplained data divergence, no material new errors, and performance changes are understood and acceptable." Anything weaker should trigger more analysis.

Knowledge check

These questions are designed to reinforce judgment, not just memorization. After submitting, read the explanations carefully - that is where the concepts become sticky.

Q1. What is the most important difference between Database Replay and a synthetic load test?
Synthetic tests are always faster to run, so they are always better.
Database Replay preserves real workload timing and concurrency patterns from production rather than inventing a test workload.
Database Replay only captures SQL text and ignores session behavior.
There is no meaningful difference if the CPU count is the same.
Correct answer: Replay preserves real workload shape. That is the core advantage. It is not merely generating more SQL; it is preserving the production call pattern closely enough to expose system-level problems more realistically.
Q2. Why is restoring the replay database to the same logical start state as capture so important?
Because replay cannot run unless the database name is identical.
Because AWR reports require matching DBID values.
Because otherwise performance and data differences may come from changed data state rather than from the system change you are testing.
Because replay only supports empty databases.
Correct answer: State equivalence matters. Replay quality depends on comparing like with like. If the data state moved, you may be measuring drift in data and application activity rather than the upgrade or configuration change itself.
Q3. What problem does wrc mode=calibrate help solve?
It estimates how many replay clients are needed to drive the captured workload realistically.
It patches the database before replay begins.
It eliminates all divergence automatically.
It converts Database Replay into SQL Performance Analyzer mode.
Correct answer: Client sizing. If you under-size replay clients, the target system may never see the same concurrency pressure that production saw, and your results become less trustworthy.
Q4. Which replay outcome should make you stop and investigate before approving go-live?
The HTML report is generated successfully.
Replay completes slightly faster than capture.
The connection map contains multiple services.
New error divergence or unexplained data divergence appears.
Correct answer: Unexplained divergence is the real red flag. A replay can technically finish and still reveal unacceptable semantic or error differences. Completion alone is not approval.
Q5. When should you prefer SQL Performance Analyzer over Database Replay?
When you need to reproduce full application concurrency and session overlap.
When the main question is which specific SQL statements regressed after a change.
When replay clients are unavailable on the operating system.
When the database is too fast.
Correct answer: SPA is the statement-level microscope. Database Replay is broader and more realistic at workload level; SPA is sharper when you need SQL-by-SQL evidence.
Q6. Why can over-filtering a capture be dangerous?
Because filters are illegal during capture.
Because a filtered capture always runs slower than an unfiltered one.
Because you may remove the very concurrency or workload behavior that made the change risky in production.
Because filtered captures cannot be processed.
Correct answer: Representativeness is everything. A replay that looks clean because you excluded the hard parts may give false confidence.
Q7. What is the healthiest mindset for a DBA reading a replay report?
Use the report as a structured investigation starting point: check sanity, then divergence, then root cause.
If replay completed, sign off immediately.
Ignore SQL detail because replay is a system-level feature.
Only compare elapsed time and disregard errors.
Correct answer: Replay analysis is layered reasoning. Completion status is only one signal. The real value comes from interpreting divergence, workload shape, and root cause evidence together.

No comments:

Post a Comment