Sunday, January 8, 2023

Exadata X8M : Storage High Availability Demo

Oracle Exadata Storage HA Explained - Failure Domains, Mirroring, and Safe Maintenance
Oracle Exadata Series

Oracle Exadata Storage HA Explained Failure groups, mirroring, resync, rebalance, and the checks that tell you whether a cell outage is actually safe.

Exadata storage high availability is not one feature. It is the combined result of ASM mirroring across cell-based failure groups, Exadata-specific maintenance workflows, short-interruption resync behavior, and enough free mirrored space to keep the system protected when something goes wrong. Once those pieces are separated, storage events become much easier to reason about without overpromising what the platform can tolerate.

Cell = failure groupCore Exadata HA idea
Resync or rebalanceDepends on outage type
RMF mattersMirror headroom is not optional
Plan before shutdownUse deactivation checks first

High availability starts with failure domains, not just with the word “redundancy”

In Exadata, Oracle ASM uses failure groups so that mirrored copies of an extent land in different failure domains. On Exadata, all grid disks created on the same cell are expected to belong to the same ASM failure group because the cell is the unit whose loss must be isolated from its mirrors. That is the architectural reason a single cell outage can often be absorbed cleanly by surviving mirrors: the copies were placed with that failure domain in mind.

That does not mean every cell outage is automatically harmless. The real question is whether the disk group still has the redundancy, health, and mirrored free capacity required for the event you are about to tolerate. Exadata gives you explicit checks for that, which is why careful operators ask the platform first instead of assuming a shutdown is safe because the rack is “redundant”.

Cell 01 Failure group FG01 Grid disks from one cell stay together Cell 02 Failure group FG02 Mirror copy is placed away from FG01 Cell 03 Failure group FG03 Adds recovery headroom after a loss Operational meaning Mirrors survive only if copies stay separated and the remaining disk group still has enough healthy capacity.
A cell outage story is really a failure-group story. That is the right abstraction level for Exadata HA.

Placement rule

Mirrored extent copies must not share the same failure domain if you expect the loss of that domain to be survivable.

Cell perspective

All grid disks from a storage cell align to one ASM failure group, which makes the cell the practical HA boundary.

Operator perspective

Before maintenance or fault response, confirm what the disk group says about safety rather than assuming the mirror layout is healthy enough.

Good mental shortcut

If someone says “we can lose a cell,” translate that into a more precise question: “Can the relevant disk groups currently lose one failure group and remain protected?”

Outage behavior: short interruptions resync, longer losses rebalance

Exadata and ASM do not respond to every storage interruption in the same way. A short interruption can follow a different path: dismounted ASM disks may be tracked by a dirty region logging bitmap and then resynchronized when the disks return, instead of forcing a full rebalance. Longer or permanent losses follow the more familiar drop-and-rebalance path. Mixing those two paths together creates a lot of confusion during incidents.

That distinction matters operationally. A cell reboot, brief outage, or maintenance window can look very different from a failed disk that must be dropped and rebuilt. The first case tends to be about rapid return and resync eligibility. The second is about surviving mirrors and how much work ASM must do to restore protection.

Short interruption path

  • Disk temporarily disappears or is intentionally taken out for a short period.
  • Changed regions are tracked so ASM can resynchronize efficiently.
  • The goal is fast restoration of redundancy without a full data movement cycle.

Longer or permanent loss path

  • Disk or cell loss lasts too long or becomes a true failure event.
  • ASM drops or permanently loses access to those mirrors.
  • Redundancy is restored through rebalance onto surviving healthy capacity.
Event starts Disk or cell becomes unavailable Path A: short interruption ASM tracks changed regions Return the disk or cell promptly Restore protection through resync Path B: longer loss Surviving mirrors carry the workload ASM restores protection with rebalance Requires healthy remaining capacity What decides the path Duration, health, and return timing Not every outage becomes a rebalance
If you confuse resync with rebalance, you will misread the urgency, the expected runtime, and the validation plan.
Field rule

During triage, ask whether you are watching a temporary return-to-service event or a true loss-of-mirror-rebuild event. That one distinction cleans up most storage incident discussions.

Disk group design: redundancy type, failure groups, and mirror headroom must all agree

ASM redundancy level is only part of the storage HA answer. Mirror-capacity indicators such as REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB matter because a disk group that is technically mirrored but short on mirror free space is not in the same operational condition as a comfortably protected one. Exadata maintenance decisions rely on those facts rather than on generic confidence.

Normal redundancy and high redundancy also have different design trade-offs. Normal redundancy stores two-way mirrors and requires fewer copies, while high redundancy stores three-way mirrors. In smaller high-redundancy configurations, quorum disks are part of the design, which is another reminder that high availability is a full layout decision rather than just a disk-group label.

Design element What it tells you Why it matters during failure or maintenance Validation habit
ASM redundancy type Whether the disk group stores two-way or three-way mirrors. Sets the baseline protection model for extent copies. Check TYPE in V$ASM_DISKGROUP.
Failure groups Which disks belong to which cell-level fault boundary. Determines whether mirrors are actually separated across cells. Check FAILGROUP in V$ASM_DISK.
Required mirror free space The reservation needed to restore protection after a failure. Shows whether you have the cushion needed for recovery work. Compare REQUIRED_MIRROR_FREE_MB and free space.
Usable file space The mirror-aware capacity actually available for new allocation. Prevents false comfort from raw free space alone. Watch USABLE_FILE_MB, not only FREE_MB.
SQL: prove the disk-group protection picture
-- Mirror-aware capacity view
SELECT name,
       type,
       total_mb,
       free_mb,
       required_mirror_free_mb,
       usable_file_mb,
       state
FROM   v$asm_diskgroup
ORDER BY name;

-- Failure-group layout and disk visibility
SELECT group_number,
       disk_number,
       name,
       failgroup,
       path,
       header_status,
       mode_status,
       state
FROM   v$asm_disk
ORDER BY group_number, failgroup, disk_number;
FREE_MBRaw free space only
REQUIRED_MIRROR_FREE_MBRecovery reservation
USABLE_FILE_MBMirror-aware headroom
FAILGROUPFailure-domain mapping
The subtle trap

A disk group can look spacious in raw megabytes and still be in a weak HA position if mirror-aware free space is tight or if the remaining failure groups are already under stress.

Safe maintenance workflow: ask the cells and disk groups whether deactivation is safe

Planned maintenance on Exadata has a safer path than simply shutting services down and hoping ASM absorbs the event. Exadata provides deactivation checks that tell you whether taking grid disks inactive on a cell is safe for the relevant ASM disk groups. If the answer is not safe, that is not noise. It means your current redundancy state or free mirror condition is not good enough for the step you are considering.

This is the point where disciplined Exadata operations differ from casual storage administration. The right workflow is to validate, deactivate deliberately, perform the maintenance, then reactivate and verify. Doing those steps in order turns HA from a vague promise into an evidence-backed procedure.

1. Inspect deactivation outcome

Check whether any grid disk reports that taking it inactive would be unsafe.

2. Review ASM headroom

Confirm mirror-aware free space and current disk health before touching the cell.

3. Inactivate for maintenance

Use the Exadata cell workflow rather than forcing an abrupt surprise outage.

4. Reactivate and verify

Bring grid disks back, then monitor resync or rebalance as needed.

CellCLI + SQL: maintenance precheck and follow-through
-- Storage cell: identify any grid disks that are not safe to deactivate
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome

-- Optional focused review
CellCLI> LIST GRIDDISK WHERE asmDeactivationOutcome != 'Yes'
ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome

-- If the outcome is safe and maintenance is approved
CellCLI> ALTER GRIDDISK ALL INACTIVE

-- After maintenance, restore service exposure
CellCLI> ALTER GRIDDISK ALL ACTIVE

-- ASM side: confirm disk-group condition after the event
SELECT name, type, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM   v$asm_diskgroup
ORDER BY name;
Maintenance mindset

The best pre-maintenance question is not “Does Exadata have HA?” It is “Do the affected disk groups and grid disks say this exact maintenance action is safe right now?”

Operational proof points: what to watch while the platform absorbs the event

During a real storage event, the most useful signals are the simplest ones. You want to know which failure groups are affected, whether ASM sees disks as online or offline, whether a resync or rebalance is running, and whether mirror-aware capacity still looks healthy. Those checks usually establish the state of the event more clearly than a first pass through noisy logs.

Exadata also extends HA below the hard-disk layer. Exadata also supports flash-cache write-back resilvering, where mirrored write-back flash cache content can be rebuilt after a flash device failure using the RDMA network fabric. That matters because HA on Exadata includes both persistent data protection and the restoration of performance-critical cache structures after certain failures.

What proves the storage event is contained

  • The affected failure group is clear and isolated.
  • Remaining disks and failure groups stay healthy.
  • V$ASM_OPERATION shows the expected recovery work.
  • Mirror-aware free space remains sensible after the event.

What should slow you down

  • Unexpected offline disks outside the target failure group.
  • Negative or weak usable capacity for recovery headroom.
  • Noisy assumptions that a returning cell means no validation is needed.
  • Maintenance plans that never checked deactivation safety first.
Runtime checks during outage, return, and rebuild
-- Which disks and failure groups are affected?
SELECT failgroup,
       mode_status,
       state,
       COUNT(*) AS disks
FROM   v$asm_disk
GROUP BY failgroup, mode_status, state
ORDER BY failgroup, mode_status, state;

-- Is ASM resyncing or rebalancing work?
SELECT group_number,
       operation,
       state,
       power,
       sofar,
       est_work,
       est_rate,
       est_minutes
FROM   v$asm_operation;

-- Mirror-aware capacity after the event
SELECT name, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM   v$asm_diskgroup
ORDER BY name;

For database storage

The question is whether mirrored database extents stay available and whether ASM is restoring protection as expected.

For flash write-back cache

The question is whether mirrored write-back cache contents are being rebuilt cleanly after a flash failure or replacement.

Caveats and edge cases: where confident storage assumptions get people in trouble

Claim you may hear More accurate reading Why it matters
“A cell can always be taken down with no risk.” Only if the current disk-group state, redundancy, and mirror-free conditions support it. Explicit deactivation outcomes exist because safety is state-dependent.
“All outages cause rebalance.” Short interruptions can use ASM resync instead of a full rebalance path. It changes both expectations and incident handling.
FREE_MB tells me whether I am safe.” Mirror-aware metrics such as REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB matter too. Raw free space can hide a weak protection posture.
“High redundancy is just a larger normal redundancy.” It changes mirror copy count and can involve quorum-disk rules in smaller high-redundancy systems. Design, capacity cost, and metadata behavior differ.
“Once the cell returns, the story is over.” You still need to verify whether the event is finishing via resync, rebalance, or another recovery step. Returning hardware is not the same thing as restored redundancy.

Misconception: redundancy type is enough

The protection story also depends on failure-group placement and mirror-aware free space.

Misconception: maintenance and failure are the same

Planned deactivation uses a different, safer workflow and should not be treated like an accidental outage.

Misconception: flash cache HA is irrelevant

Write-back flash cache protection and resilvering matter because cache state can affect post-failure performance behavior.

Misconception: a healthy rack means every disk group is healthy

Disk-group state must still be verified individually because HA is consumed at the disk-group level.

Best final check

Before any disruptive storage action, make the platform answer three questions: Is the target safe to deactivate, do the disk groups have mirror-aware headroom, and are there any unrelated offline disks already eroding redundancy?

Validation lab: prove storage HA from both CellCLI and ASM

A good Exadata HA validation lab is not a destructive outage simulation. It is a cross-checking workflow that confirms the protection layout, verifies whether maintenance would be safe, and shows whether recovery work is active after a real event. That approach is both safer and more useful because it teaches you how to read the platform under normal conditions and under stress.

Storage cell validation
-- 1) Check whether any grid disk reports unsafe deactivation
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome

-- 2) Focus only on problematic results if any exist
CellCLI> LIST GRIDDISK WHERE asmDeactivationOutcome != 'Yes'
ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome

-- 3) Review recent cell-side alert signals if needed
CellCLI> LIST ALERTHISTORY ATTRIBUTES alertSequenceID, collectionTime, severity, message
WHERE severity != 'clear'
ASM validation
-- 1) Protection posture
SELECT name, type, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM   v$asm_diskgroup
ORDER BY name;

-- 2) Failure-group visibility
SELECT failgroup, mode_status, state, COUNT(*) disks
FROM   v$asm_disk
GROUP BY failgroup, mode_status, state
ORDER BY failgroup, mode_status, state;

-- 3) Recovery work
SELECT group_number, operation, state, est_minutes
FROM   v$asm_operation;

What “ready for maintenance” looks like

  • Target grid disks report safe deactivation outcomes.
  • No surprise offline disks exist outside the target work.
  • Mirror-aware capacity is healthy enough for the event.
  • The failure-group layout matches your design expectations.

What “post-event recovery” looks like

  • Returned disks or cells are visible again.
  • ASM recovery work trends in the expected direction.
  • Disk-group state and usable capacity stabilize cleanly.
  • The platform story matches both CellCLI and ASM views.

Quick quiz

These questions test the distinctions that matter in real Exadata incidents: failure groups, mirror-aware headroom, and the difference between a returning outage and a real rebuild.

7 questions ASM + CellCLI HA reasoning
Q1. On Exadata, why are grid disks from the same cell aligned to one ASM failure group?
Because ASM cannot display more than one failure group per disk group
Because all cells must always use high redundancy
Because the storage cell is the failure domain whose mirrors must be separated from one another
Because CellCLI cannot create more than one grid disk
Correct answer: the cell is the failure domain, so mirrors must be separated away from it.
Q2. What is the best interpretation of REQUIRED_MIRROR_FREE_MB?
The recovery reservation needed to restore protection after a failure
The amount of flash cache currently in write-back mode
The total size of one storage cell
A synonym for raw free space
Correct answer: it is the reservation needed for mirror recovery, not just generic free space.
Q3. Why is it risky to say every storage interruption leads to rebalance?
Because rebalance is unsupported on Exadata
Because CellCLI performs all rebuild work outside ASM
Because only flash cache ever recovers on Exadata
Because short interruptions can return through ASM resync instead of a full rebuild path
Correct answer: temporary outages can follow a resync path rather than a full rebalance path.
Q4. Before planned cell maintenance, which question is most important?
Whether the rack has flash cache enabled
Whether the affected grid disks report that deactivation is safe right now
Whether SQL*Plus can connect without using ASM
Whether FREE_MB is larger than zero
Correct answer: safe deactivation is a stateful validation step, not an assumption.
Q5. What does USABLE_FILE_MB add beyond raw free space?
It shows only flash cache capacity
It shows the number of active network paths
It shows mirror-aware capacity actually usable for new allocation
It replaces failure-group checks entirely
Correct answer: it is the mirror-aware capacity view, which is why it is more operationally useful than raw free space alone.
Q6. After a flash failure in write-back flash cache, what Exadata behavior is relevant to HA?
Write-back flash cache content can be resilvered using mirrored copies over the RDMA fabric
ASM disables all mirroring until the cache is empty
The database must always restart to rebuild flash contents
Flash cache protection is unrelated to Exadata HA
Correct answer: Exadata documents resilvering of mirrored write-back flash cache content using RDMA.
Q7. Which statement is the safest DBA posture after a cell returns online?
The return alone proves full redundancy is restored
No verification is needed if the database stayed open
Only flash cache needs checking
Confirm whether resync, rebalance, or another recovery step is still active and validate disk-group state
Correct answer: returning hardware is not the same thing as completed recovery.

1 comment: