Database Replay and Real Application TestingHow to validate upgrades, patches, migrations, and risky changes with real workload behavior
Database Replay is one of the most practical risk-reduction tools in the Oracle DBA toolbox. Instead of trusting synthetic benchmarks, isolated SQL tests, or intuition, you capture a real production workload, restore a test system to the same logical starting point, replay that workload, and analyze whether performance, errors, timing, and transactional behavior still look safe.
What Real Application Testing actually is
Real Application Testing, usually shortened to RAT, is the umbrella feature set Oracle uses for validating database changes with production-like evidence. It has two major parts:
Database Replay
SQL Performance Analyzer
Think of Database Replay as the system-level test and SQL Performance Analyzer as the statement-level microscope. When used together, they answer both: "Did the workload still behave correctly?" and "Which SQL got better or worse?"
That pairing is why RAT is so valuable before upgrades, release updates, platform migrations, Exadata moves, storage changes, parameter changes, or application releases that alter workload shape.
Why Database Replay matters more than synthetic load tests
Many test environments rely on hand-made scripts, partial application smoke tests, or benchmark tools that are good for capacity checks but weak at reproducing real user behavior. Database Replay is different because it records actual client requests with their original sequence characteristics.
What synthetic tests miss
What replay preserves
What you still must do
Database Replay is strongest when the change is wide in scope: database upgrades, patching, host moves, RAC changes, storage changes, and large parameter or configuration shifts. These are exactly the changes where application-only testing often leaves blind spots.
| Change type | Why replay helps | What you want to learn |
|---|---|---|
| Database upgrade | Same workload, new optimizer and database engine behavior | Do errors appear, and do response patterns regress? |
| Patch or RU | Small code changes can still alter workload behavior | Did a previously stable path get slower or behave differently? |
| Storage or host migration | I/O and concurrency behavior can shift sharply | Does the new stack preserve latency under real overlap? |
| RAC or service redesign | Connection routing and session placement matter | Are connections remapped correctly and does throughput hold? |
| Initialization parameter change | Some settings affect broad runtime behavior | Did the system become faster, noisier, or less predictable? |
The end-to-end workflow, visually
At a high level, Database Replay is a six-phase process: define a clean starting point, capture production workload, move and preprocess the files, restore a test database to the same logical state, calibrate and run replay clients, then analyze divergence and performance.
Capture design and prerequisites
A strong replay starts before the capture itself. You need to decide what business window to record, how representative the workload really is, whether maintenance jobs should be in or out, and how you will reconstruct the capture start state in test. Good capture design is what separates a replay that teaches you something from a replay that merely generates files.
Choose the right window
Know the start state
Decide what to exclude
Plan replay clients
wrc replay clients must be able to simulate the original connection load. Calibrate before replay.If you cannot recreate the same application data state in test, your replay analysis becomes much harder to trust. Performance differences can come from data drift rather than from the change you are validating.
What gets captured
Database Replay captures external client requests to the database. The capture records call order, timing, and session interaction information needed to reconstruct workload behavior later. That is why it is much more realistic than hand-written replay scripts. At the same time, you still need to interpret results carefully because some calls may be unsupported, filtered, remapped, or intentionally excluded.
Capture objective
Replay objective
Analysis objective
Example capture start
-- Create an OS-backed directory for the capture files CREATE OR REPLACE DIRECTORY capture_dir AS '/u01/dbreplay/capture'; -- Optional: inspect prior captures SELECT id, name, status, start_time, end_time FROM dba_workload_captures ORDER BY id DESC; -- Start workload capture on the source system BEGIN DBMS_WORKLOAD_CAPTURE.START_CAPTURE( name => 'UPGRADE_CAPTURE_MAR_2026', dir => 'CAPTURE_DIR', duration => 3600, capture_sts => TRUE, sts_cap_interval => 300, plsql_mode => 'extended' ); END; / -- Stop capture early if required EXEC DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE;
Capture filters matter
Filters are useful when the production workload contains noise that would distort the test objective. For example, you may want to exclude health checks, maintenance accounts, or modules unrelated to the change being evaluated. Use filters sparingly: over-filtering can turn a realistic workload into a curated demo.
-- Example: exclude a maintenance service account from capture BEGIN DBMS_WORKLOAD_CAPTURE.ADD_FILTER( fname => 'exclude_maint_user', fattribute => 'USER', fvalue => 'MAINT_USER' ); END; /
Preprocess, calibrate, initialize, remap, and replay
Once you move the capture files to the replay environment, the next phase is procedural and exacting. This is where many first-time teams make mistakes: they preprocess correctly but forget client calibration, they initialize replay but skip connection remapping review, or they start wrc without confirming how many clients are truly needed.
1. Preprocess the capture
BEGIN DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE( capture_dir => 'CAPTURE_DIR' ); END; /
Processing builds the metadata the replay engine needs in order to drive the captured workload later. You do this once for the capture set after it has been copied to the replay environment.
2. Calibrate replay clients with wrc
The replay clients are external OS processes. Before running replay, calibrate them so you know how many wrc clients are required to simulate the captured concurrency. This is one of the most important operational steps because under-sizing replay clients can make a healthy system look artificially slow.
# Run from the replay host wrc mode=calibrate replaydir=/u01/dbreplay/capture # Typical result: it tells you the recommended number of clients # Start at least that many replay clients before you launch replay
3. Initialize replay and review connection mapping
BEGIN DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY( replay_name => 'REPLAY_UPGRADE_TEST', replay_dir => 'CAPTURE_DIR' ); END; / -- Inspect how captured connections were interpreted SELECT connection_id, captured_connect_string, replay_connect_string FROM dba_workload_connection_map ORDER BY connection_id;
If the target environment differs from source - new host, new service name, different SCAN listeners, new PDB service, or remapped users - you must review and possibly remap those connections before replay.
-- Example: remap all captured connections to a new replay service BEGIN DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION( connection_id => 1, replay_connect_string => 'replay_scan:1521/APPTEST_SVC' ); END; /
4. Prepare replay and start clients
-- Prepare replay using SCN synchronization EXEC DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => 'SCN'); # In one or more OS sessions, start the replay clients wrc replaydir=/u01/dbreplay/capture -- Then start replay from SQL EXEC DBMS_WORKLOAD_REPLAY.START_REPLAY;
How to read replay results without fooling yourself
The replay report is the beginning of interpretation, not the end. Teams often look for one headline number, but the real value lies in a layered read: did replay complete cleanly, were there new or missing errors, did throughput or DB time move materially, did certain transaction classes slow down, and were there any data divergences that indicate semantic mismatch?
| What to inspect | Why it matters | What a concern looks like |
|---|---|---|
| Error divergence | Tells you whether replay introduced new failures or lost expected behavior | New errors, missing commits, or changed return codes |
| DB time and response patterns | Shows whether the changed system does more work or stalls more often | Large response inflation or much higher DB time for same workload |
| Average active sessions / concurrency shape | Helps confirm replay pressure resembled capture pressure | Replay clients under-sized, causing workload to flatten unnaturally |
| Data divergence | Signals semantic differences rather than simple timing differences | Any unexpected divergence needs explanation |
| AWR and SQL detail | Explains where the time moved | One SQL family, wait class, or object suddenly dominates |
A replay can "finish successfully" and still be unsafe for go-live. If the workload completed but new errors appeared, data diverged, or latency became unpredictable, treat that as a failed validation until you understand the cause.
Useful report generation
SELECT DBMS_WORKLOAD_REPLAY.REPORT( replay_id => 1, format => 'HTML' ) FROM dual; -- Also inspect replay history SELECT id, replay_name, status, start_time, end_time FROM dba_workload_replays ORDER BY id DESC;
A good analysis routine
Layer 1: sanity
Layer 2: divergence
Layer 3: root cause
Pitfalls, edge cases, and unreplayable work
Database Replay is powerful, but it is not magic. Some applications depend on external state, non-deterministic calls, wall-clock assumptions, messaging systems, or environment-specific side effects that cannot be reproduced perfectly. Replay is strongest at database-call realism; it is weaker when the behavior depends on things outside the database boundary.
Common first-time mistakes
- Replaying against data that no longer matches capture start state
- Skipping
wrccalibration and under-driving the test system - Ignoring connection remapping differences
- Capturing an unrepresentative quiet period
- Reading only headline report values and not investigating divergences
Work that needs special care
- External web service dependencies
- Calls whose correctness depends on current time or sequence drift
- Batch chains coordinated outside the database
- Environment-dependent authentication or network behavior
- Side effects in systems not restored alongside the database
Replay can tell you a great deal, but it does not prove every integration path outside the database. Treat it as a high-value validation layer, not the only test you run.
A practical acceptance mindset
When unreplayable or partially divergent calls exist, the right question is not "Can I force perfect replay?" The better question is "Can I explain the divergence, quantify the risk, and decide whether the change is still safe?" Mature teams log these exceptions explicitly rather than hand-waving them away.
Database Replay versus SQL Performance Analyzer
Because both belong to RAT, people sometimes ask which one they should use. In practice, the answer is often both - but for different reasons.
| Question you need answered | Best tool | Why |
|---|---|---|
| Will the whole application workload still behave safely after this change? | Database Replay | It preserves timing, concurrency, and workload interaction at system level. |
| Which SQL statements regressed after the change? | SQL Performance Analyzer | It compares statement performance directly and highlights regressed SQL. |
| I need one broad go/no-go signal plus detailed SQL diagnosis | Both | Replay gives system confidence; SPA explains SQL-level regressions. |
A common pattern is: capture production workload, create or refresh an STS, run SPA to pinpoint statement regressions, then run Database Replay to validate whole-system behavior under concurrency.
Practical playbooks for real DBA change windows
Playbook A: database upgrade
- Capture a representative peak business window.
- Preserve or clone the source start state.
- Upgrade the replay environment.
- Run SPA for statement-level drift if an STS was captured.
- Run replay, compare errors, DB time, and SQL outliers.
- Repeat after fixes until the divergence story is understood.
Playbook B: infrastructure or Exadata move
- Keep application logic constant and change only the platform variables.
- Replay the same workload more than once to distinguish jitter from real pattern.
- Compare not just elapsed time but also wait profile and throughput.
Playbook C: risky parameter change
- Run a baseline replay without the parameter change.
- Run a second replay with the change.
- Compare both reports and AWR data to isolate the parameter impact.
End-to-end demo: validating a release update for an order-entry service
Here is a realistic sample use case that stays within the supported Database Replay workflow. Imagine a production OLTP database named APPDB that serves an order-entry application through a service called APP_SVC. The goal is to validate a database release update in a replay environment before touching production.
This demo uses verified package procedures, views, and wrc commands. Where outputs vary by environment, the article shows the exact verification query or the exact line you should expect to see, instead of inventing fake numbers.
Step 0: choose a clean capture window
Pick a window that contains real concurrency - for example, 10:00 to 11:00 when order entry, payment checks, and fulfillment updates overlap. If possible, begin from a well-defined start state so the replay database can be restored to the same logical point later.
Step 1: create the capture directory and start capture on production
CREATE OR REPLACE DIRECTORY rat_cap_dir AS '/u01/rat/capture/appdb_mar11'; BEGIN DBMS_WORKLOAD_CAPTURE.START_CAPTURE( name => 'APPDB_RU_CAPTURE_20260311', dir => 'RAT_CAP_DIR', duration => 3600, capture_sts => TRUE, sts_cap_interval => 300, plsql_mode => 'extended' ); END; / -- When the business window finishes: EXEC DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE;
Step 2: confirm the capture metadata and record the start SCN
SELECT id, name, status, start_time, end_time, start_scn, end_scn FROM dba_workload_captures WHERE name = 'APPDB_RU_CAPTURE_20260311';
The result you want is simple: the capture row should show COMPLETED, and you should record the START_SCN. That SCN becomes part of your restore story for the replay system.
Step 3: copy the capture files to the replay environment
# Example using rsync from source to replay host
rsync -av /u01/rat/capture/appdb_mar11/ oracle@testdb01:/u01/rat/capture/appdb_mar11/Step 4: restore the replay database to the capture start state
This part is environment-specific, so the exact restore command depends on your backup strategy. The important accuracy point is this: the replay database must be restored to the same logical application state that existed at capture start. If you use RMAN duplication, your command should restore to the capture start SCN you recorded in the previous step.
# Run from RMAN on the replay side, adapting names and channels to your environment DUPLICATE TARGET DATABASE TO APPDBRPL UNTIL SCN <capture_start_scn> NOFILENAMECHECK;
After the restore or clone completes, verify the replay environment is open, reachable, and aligned with the application release you want to test.
Step 5: preprocess the capture on the replay database version
CREATE OR REPLACE DIRECTORY rat_cap_dir AS '/u01/rat/capture/appdb_mar11'; BEGIN DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE( capture_dir => 'RAT_CAP_DIR', plsql_mode => 'EXTENDED' ); END; /
Step 6: calibrate replay clients before you start them
wrc mode=calibrate replaydir=/u01/rat/capture/appdb_mar11
In calibrate mode, the important output is the recommendation line telling you how many replay clients and CPUs are needed. Use that recommendation as your starting point rather than guessing.
Step 7: initialize replay, inspect connections, and remap to the replay service
BEGIN DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY( replay_name => 'APPDB_RU_REPLAY_20260311', replay_dir => 'RAT_CAP_DIR' ); END; / SELECT replay_id, conn_id, capture_conn, replay_conn FROM dba_workload_connection_map ORDER BY conn_id;
-- Use the CONN_ID values returned by DBA_WORKLOAD_CONNECTION_MAP BEGIN DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION( connection_id => 1, replay_connection => 'testscan01:1521/APP_RAT_SVC' ); DBMS_WORKLOAD_REPLAY.REMAP_CONNECTION( connection_id => 2, replay_connection => 'testscan01:1521/APP_RAT_SVC' ); END; /
If your replay users or schemas differ from source, this is also the stage where you apply user remapping. If they do not differ, do not invent remaps you do not need.
Step 8: prepare replay, start the replay clients, and then start the database replay
EXEC DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => 'SCN'); # Start the number of replay clients recommended by calibration wrc system/<password>@APP_RAT_SVC mode=replay replaydir=/u01/rat/capture/appdb_mar11 -- After all replay clients are attached, start the replay itself EXEC DBMS_WORKLOAD_REPLAY.START_REPLAY;
SELECT * FROM v$workload_replay_thread; SELECT id, replay_name, status, start_time, end_time, capture_id FROM dba_workload_replays ORDER BY id DESC;
During execution, watch for replay clients to attach cleanly and for the replay row to move through expected status changes. Do not treat a replay as trustworthy if clients were missing or obviously under-sized.
Step 9: generate the replay report and investigate divergence
DECLARE cap_id NUMBER; rep_id NUMBER; rep_rpt CLOB; BEGIN cap_id := DBMS_WORKLOAD_REPLAY.GET_REPLAY_INFO(dir => 'RAT_CAP_DIR'); SELECT MAX(id) INTO rep_id FROM dba_workload_replays WHERE capture_id = cap_id; rep_rpt := DBMS_WORKLOAD_REPLAY.REPORT( replay_id => rep_id, format => DBMS_WORKLOAD_REPLAY.TYPE_HTML); END; / SELECT * FROM dba_workload_div_summary ORDER BY replay_id DESC;
A good result is not a magical fixed ratio. A good result is: the replay completes, there are no unexplained new errors, no unacceptable data divergence, connection mapping behaved as intended, and any performance shifts are understood through replay reporting plus AWR or SQL analysis.
The best go-live decision is evidence-based and specific: "Replay completed with no unexplained data divergence, no material new errors, and performance changes are understood and acceptable." Anything weaker should trigger more analysis.
Knowledge check
These questions are designed to reinforce judgment, not just memorization. After submitting, read the explanations carefully - that is where the concepts become sticky.
wrc mode=calibrate help solve?
No comments:
Post a Comment