Oracle Exadata Storage HA Explained Failure groups, mirroring, resync, rebalance, and the checks that tell you whether a cell outage is actually safe.
Exadata storage high availability is not one feature. It is the combined result of ASM mirroring across cell-based failure groups, Exadata-specific maintenance workflows, short-interruption resync behavior, and enough free mirrored space to keep the system protected when something goes wrong. Once those pieces are separated, storage events become much easier to reason about without overpromising what the platform can tolerate.
High availability starts with failure domains, not just with the word “redundancy”
In Exadata, Oracle ASM uses failure groups so that mirrored copies of an extent land in different failure domains. On Exadata, all grid disks created on the same cell are expected to belong to the same ASM failure group because the cell is the unit whose loss must be isolated from its mirrors. That is the architectural reason a single cell outage can often be absorbed cleanly by surviving mirrors: the copies were placed with that failure domain in mind.
That does not mean every cell outage is automatically harmless. The real question is whether the disk group still has the redundancy, health, and mirrored free capacity required for the event you are about to tolerate. Exadata gives you explicit checks for that, which is why careful operators ask the platform first instead of assuming a shutdown is safe because the rack is “redundant”.
Placement rule
Mirrored extent copies must not share the same failure domain if you expect the loss of that domain to be survivable.
Cell perspective
All grid disks from a storage cell align to one ASM failure group, which makes the cell the practical HA boundary.
Operator perspective
Before maintenance or fault response, confirm what the disk group says about safety rather than assuming the mirror layout is healthy enough.
If someone says “we can lose a cell,” translate that into a more precise question: “Can the relevant disk groups currently lose one failure group and remain protected?”
Outage behavior: short interruptions resync, longer losses rebalance
Exadata and ASM do not respond to every storage interruption in the same way. A short interruption can follow a different path: dismounted ASM disks may be tracked by a dirty region logging bitmap and then resynchronized when the disks return, instead of forcing a full rebalance. Longer or permanent losses follow the more familiar drop-and-rebalance path. Mixing those two paths together creates a lot of confusion during incidents.
That distinction matters operationally. A cell reboot, brief outage, or maintenance window can look very different from a failed disk that must be dropped and rebuilt. The first case tends to be about rapid return and resync eligibility. The second is about surviving mirrors and how much work ASM must do to restore protection.
Short interruption path
- Disk temporarily disappears or is intentionally taken out for a short period.
- Changed regions are tracked so ASM can resynchronize efficiently.
- The goal is fast restoration of redundancy without a full data movement cycle.
Longer or permanent loss path
- Disk or cell loss lasts too long or becomes a true failure event.
- ASM drops or permanently loses access to those mirrors.
- Redundancy is restored through rebalance onto surviving healthy capacity.
During triage, ask whether you are watching a temporary return-to-service event or a true loss-of-mirror-rebuild event. That one distinction cleans up most storage incident discussions.
Disk group design: redundancy type, failure groups, and mirror headroom must all agree
ASM redundancy level is only part of the storage HA answer. Mirror-capacity indicators such as REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB matter because a disk group that is technically mirrored but short on mirror free space is not in the same operational condition as a comfortably protected one. Exadata maintenance decisions rely on those facts rather than on generic confidence.
Normal redundancy and high redundancy also have different design trade-offs. Normal redundancy stores two-way mirrors and requires fewer copies, while high redundancy stores three-way mirrors. In smaller high-redundancy configurations, quorum disks are part of the design, which is another reminder that high availability is a full layout decision rather than just a disk-group label.
| Design element | What it tells you | Why it matters during failure or maintenance | Validation habit |
|---|---|---|---|
| ASM redundancy type | Whether the disk group stores two-way or three-way mirrors. | Sets the baseline protection model for extent copies. | Check TYPE in V$ASM_DISKGROUP. |
| Failure groups | Which disks belong to which cell-level fault boundary. | Determines whether mirrors are actually separated across cells. | Check FAILGROUP in V$ASM_DISK. |
| Required mirror free space | The reservation needed to restore protection after a failure. | Shows whether you have the cushion needed for recovery work. | Compare REQUIRED_MIRROR_FREE_MB and free space. |
| Usable file space | The mirror-aware capacity actually available for new allocation. | Prevents false comfort from raw free space alone. | Watch USABLE_FILE_MB, not only FREE_MB. |
-- Mirror-aware capacity view
SELECT name,
type,
total_mb,
free_mb,
required_mirror_free_mb,
usable_file_mb,
state
FROM v$asm_diskgroup
ORDER BY name;
-- Failure-group layout and disk visibility
SELECT group_number,
disk_number,
name,
failgroup,
path,
header_status,
mode_status,
state
FROM v$asm_disk
ORDER BY group_number, failgroup, disk_number;
FREE_MBRaw free space onlyREQUIRED_MIRROR_FREE_MBRecovery reservationUSABLE_FILE_MBMirror-aware headroomFAILGROUPFailure-domain mappingA disk group can look spacious in raw megabytes and still be in a weak HA position if mirror-aware free space is tight or if the remaining failure groups are already under stress.
Safe maintenance workflow: ask the cells and disk groups whether deactivation is safe
Planned maintenance on Exadata has a safer path than simply shutting services down and hoping ASM absorbs the event. Exadata provides deactivation checks that tell you whether taking grid disks inactive on a cell is safe for the relevant ASM disk groups. If the answer is not safe, that is not noise. It means your current redundancy state or free mirror condition is not good enough for the step you are considering.
This is the point where disciplined Exadata operations differ from casual storage administration. The right workflow is to validate, deactivate deliberately, perform the maintenance, then reactivate and verify. Doing those steps in order turns HA from a vague promise into an evidence-backed procedure.
1. Inspect deactivation outcome
Check whether any grid disk reports that taking it inactive would be unsafe.
2. Review ASM headroom
Confirm mirror-aware free space and current disk health before touching the cell.
3. Inactivate for maintenance
Use the Exadata cell workflow rather than forcing an abrupt surprise outage.
4. Reactivate and verify
Bring grid disks back, then monitor resync or rebalance as needed.
-- Storage cell: identify any grid disks that are not safe to deactivate
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome
-- Optional focused review
CellCLI> LIST GRIDDISK WHERE asmDeactivationOutcome != 'Yes'
ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome
-- If the outcome is safe and maintenance is approved
CellCLI> ALTER GRIDDISK ALL INACTIVE
-- After maintenance, restore service exposure
CellCLI> ALTER GRIDDISK ALL ACTIVE
-- ASM side: confirm disk-group condition after the event
SELECT name, type, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM v$asm_diskgroup
ORDER BY name;
The best pre-maintenance question is not “Does Exadata have HA?” It is “Do the affected disk groups and grid disks say this exact maintenance action is safe right now?”
Operational proof points: what to watch while the platform absorbs the event
During a real storage event, the most useful signals are the simplest ones. You want to know which failure groups are affected, whether ASM sees disks as online or offline, whether a resync or rebalance is running, and whether mirror-aware capacity still looks healthy. Those checks usually establish the state of the event more clearly than a first pass through noisy logs.
Exadata also extends HA below the hard-disk layer. Exadata also supports flash-cache write-back resilvering, where mirrored write-back flash cache content can be rebuilt after a flash device failure using the RDMA network fabric. That matters because HA on Exadata includes both persistent data protection and the restoration of performance-critical cache structures after certain failures.
What proves the storage event is contained
- The affected failure group is clear and isolated.
- Remaining disks and failure groups stay healthy.
V$ASM_OPERATIONshows the expected recovery work.- Mirror-aware free space remains sensible after the event.
What should slow you down
- Unexpected offline disks outside the target failure group.
- Negative or weak usable capacity for recovery headroom.
- Noisy assumptions that a returning cell means no validation is needed.
- Maintenance plans that never checked deactivation safety first.
-- Which disks and failure groups are affected?
SELECT failgroup,
mode_status,
state,
COUNT(*) AS disks
FROM v$asm_disk
GROUP BY failgroup, mode_status, state
ORDER BY failgroup, mode_status, state;
-- Is ASM resyncing or rebalancing work?
SELECT group_number,
operation,
state,
power,
sofar,
est_work,
est_rate,
est_minutes
FROM v$asm_operation;
-- Mirror-aware capacity after the event
SELECT name, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM v$asm_diskgroup
ORDER BY name;
For database storage
The question is whether mirrored database extents stay available and whether ASM is restoring protection as expected.
For flash write-back cache
The question is whether mirrored write-back cache contents are being rebuilt cleanly after a flash failure or replacement.
Caveats and edge cases: where confident storage assumptions get people in trouble
| Claim you may hear | More accurate reading | Why it matters |
|---|---|---|
| “A cell can always be taken down with no risk.” | Only if the current disk-group state, redundancy, and mirror-free conditions support it. | Explicit deactivation outcomes exist because safety is state-dependent. |
| “All outages cause rebalance.” | Short interruptions can use ASM resync instead of a full rebalance path. | It changes both expectations and incident handling. |
“FREE_MB tells me whether I am safe.” |
Mirror-aware metrics such as REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB matter too. |
Raw free space can hide a weak protection posture. |
| “High redundancy is just a larger normal redundancy.” | It changes mirror copy count and can involve quorum-disk rules in smaller high-redundancy systems. | Design, capacity cost, and metadata behavior differ. |
| “Once the cell returns, the story is over.” | You still need to verify whether the event is finishing via resync, rebalance, or another recovery step. | Returning hardware is not the same thing as restored redundancy. |
Misconception: redundancy type is enough
The protection story also depends on failure-group placement and mirror-aware free space.
Misconception: maintenance and failure are the same
Planned deactivation uses a different, safer workflow and should not be treated like an accidental outage.
Misconception: flash cache HA is irrelevant
Write-back flash cache protection and resilvering matter because cache state can affect post-failure performance behavior.
Misconception: a healthy rack means every disk group is healthy
Disk-group state must still be verified individually because HA is consumed at the disk-group level.
Before any disruptive storage action, make the platform answer three questions: Is the target safe to deactivate, do the disk groups have mirror-aware headroom, and are there any unrelated offline disks already eroding redundancy?
Validation lab: prove storage HA from both CellCLI and ASM
A good Exadata HA validation lab is not a destructive outage simulation. It is a cross-checking workflow that confirms the protection layout, verifies whether maintenance would be safe, and shows whether recovery work is active after a real event. That approach is both safer and more useful because it teaches you how to read the platform under normal conditions and under stress.
-- 1) Check whether any grid disk reports unsafe deactivation
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome
-- 2) Focus only on problematic results if any exist
CellCLI> LIST GRIDDISK WHERE asmDeactivationOutcome != 'Yes'
ATTRIBUTES name, asmDiskgroupName, asmDeactivationOutcome
-- 3) Review recent cell-side alert signals if needed
CellCLI> LIST ALERTHISTORY ATTRIBUTES alertSequenceID, collectionTime, severity, message
WHERE severity != 'clear'
-- 1) Protection posture
SELECT name, type, free_mb, required_mirror_free_mb, usable_file_mb, state
FROM v$asm_diskgroup
ORDER BY name;
-- 2) Failure-group visibility
SELECT failgroup, mode_status, state, COUNT(*) disks
FROM v$asm_disk
GROUP BY failgroup, mode_status, state
ORDER BY failgroup, mode_status, state;
-- 3) Recovery work
SELECT group_number, operation, state, est_minutes
FROM v$asm_operation;
What “ready for maintenance” looks like
- Target grid disks report safe deactivation outcomes.
- No surprise offline disks exist outside the target work.
- Mirror-aware capacity is healthy enough for the event.
- The failure-group layout matches your design expectations.
What “post-event recovery” looks like
- Returned disks or cells are visible again.
- ASM recovery work trends in the expected direction.
- Disk-group state and usable capacity stabilize cleanly.
- The platform story matches both CellCLI and ASM views.
Quick quiz
These questions test the distinctions that matter in real Exadata incidents: failure groups, mirror-aware headroom, and the difference between a returning outage and a real rebuild.
REQUIRED_MIRROR_FREE_MB?USABLE_FILE_MB add beyond raw free space?
ReplyDeleteThis blog is very infomative thanks for providing us this blog
Oracle Training in Noida