Oracle Machine Learning for SQL in 26ai: what changed and why it matters
Oracle AI Database 26ai extends OML4SQL in ways that are easy to underestimate if you only skim the feature bullets. The release strengthens forecast model search, algorithm expressiveness, and operational governance at the same time. For SQL-first teams, that directly affects how much manual preprocessing, tuning, and external orchestration is required before a model can be trusted in production.
Each enhancement changes a SQL-centric modeling pipeline in specific ways. The practical questions are where it fits, whether it matters for the workload, what to validate before rollout, and where the new features help less than the headline might suggest.
What 26ai adds
- Automated time series model search and support for multiple time series workflows.
- GLM link-function expansion for logistic regression and stronger XGBoost options.
- Improved prep for high-cardinality categorical inputs and persisted model build lineage.
- EM-based outlier detection, dense projection support with embeddings, and faster partitioned-model handling.
Who this is for
- DBAs and data engineers who operationalize in-database models.
- Architects comparing SQL-native ML with external pipelines.
- Developers and analysts who need practical OML4SQL decision criteria rather than marketing language.
Starts with the release-level picture, then drills into the new forecasting, modeling, governance, and scale-oriented capabilities before ending with validation guidance and rollout checklists.
Why the OML4SQL changes in 26ai are more important than they first appear
The release is not one giant new algorithm. It is a set of targeted improvements that remove common sources of friction: choosing a forecasting approach, coping with awkward input distributions, applying business constraints, scaling segmented models, and proving how a model was built after the fact.
That combination matters because SQL-centric machine learning programs usually succeed or fail on workflow quality rather than on raw algorithm availability. The hard parts are often reproducibility, fitting the model to the data shape you actually have, preserving business semantics, and keeping the pipeline simple enough that DBAs and developers can operate it without a separate platform team.
Less manual search
Automated time series model search lowers the cost of getting to a defensible first forecast model when portfolio-style forecasting would otherwise require too much hand tuning.
More faithful constraints
GLM link functions and XGBoost constraints matter when the default model shape does not reflect the event process or the business rules you must preserve.
Cleaner governance
Persisted build-query lineage and stronger preprocessing support reduce the gap between model development and auditability.
The exact 26ai feature set in scope
The local Oracle 26ai guide groups the SQL-oriented machine learning enhancements into forecasting, feature engineering, supervised learning, governance, anomaly detection, and performance themes. The table below translates that release list into a practical view.
| Capability | What changed in 26ai | Why it matters | Main caveat |
|---|---|---|---|
| Automated time series model search | Forecasting workflows can search more automatically for a suitable time series model instead of relying as heavily on manual selection. | Reduces the cost of finding a good baseline and makes first-pass forecasting more repeatable. | Automation still needs holdout evaluation, horizon review, and series-quality checks. |
| Multiple time series | Forecasting support extends beyond a single isolated series workflow. | Useful for portfolios of products, branches, regions, devices, or accounts that share a modeling pattern. | Series must still be made operationally comparable in grain, missing-period handling, and governance. |
| Dense projection with embeddings | Explicit Semantic Analysis support extends to dense projection scenarios that use embeddings. | Helps teams turn modern dense representations into model-ready features inside the SQL-oriented stack. | This is a feature-engineering capability, not a substitute for vector indexing or semantic search. |
| GLM link functions | Logistic regression support expands beyond the default logit link to include probit, complementary log-log, and cauchit. | Lets the model shape align more closely with rare-event, tail-heavy, or domain-specific response behavior. | Changing the link function changes interpretation and should be validated, not treated as a cosmetic tuning knob. |
| Improved prep for high-cardinality categoricals | Automatic preprocessing better addresses awkward categorical feature shapes. | Reduces manual prep load when source systems emit many sparse category values. | Very high-cardinality features can still require explicit domain grouping or alternate feature design. |
| Lineage persisted with the model | The model keeps the data query used to build it. | Improves reproducibility, troubleshooting, and audit conversations. | Persisted query text is not the same thing as a frozen copy of source data. |
| EM clustering for outlier detection | Outlier workflows can use Expectation Maximization clustering. | Gives a stronger unsupervised option when anomalies are not labeled in advance. | Cluster-based anomaly reasoning still needs domain interpretation and threshold governance. |
| Partitioned model performance | Partitioned model handling improves, reducing friction for segmented modeling patterns. | Important for workloads where separate per-segment models are operationally preferable. | Partition design still matters; better performance does not rescue a poor segmentation strategy. |
| XGBoost constraints and survival analysis | XGBoost gains stronger support for constrained learning and time-to-event style analysis. | Useful when you need both gradient-boosted flexibility and more business-aligned modeling behavior. | Incorrect constraints or weak survival framing can make the model less credible, not more. |
Where the 26ai enhancements fit in an OML4SQL workflow
Thinking in terms of workflow stages helps separate capabilities that affect model quality from those that mostly improve operational discipline. That distinction is important when planning pilots and setting stakeholder expectations.
This release mostly improves the middle of the workflow: the parts where teams usually spend extra manual effort aligning raw relational data, algorithm behavior, and production governance.
Automated time series model search and multiple time series support
These are the changes most likely to affect daily modeling productivity. Forecasting pipelines often fail not because teams lack data, but because model-family choice, series shape, and manual search overhead make first-pass models expensive to get right.
Automated time series model search
Oracle positions 26ai to automate time series model search more directly inside OML4SQL. The practical meaning is straightforward: instead of treating forecasting as a manual algorithm-picking exercise, you can move closer to a governed search workflow that finds a credible starting model faster.
- Best when forecasting is frequent but not unique enough to justify hand-tuning every series from scratch.
- Particularly helpful for teams that need a reliable baseline before escalating to heavier experimentation.
- Most valuable when paired with disciplined backtesting and business-metric review rather than raw error metrics alone.
Multiple time series
Supporting multiple time series changes the operational scope of in-database forecasting. Many real systems forecast product-by-region, branch-by-day, sensor-by-hour, or tenant-by-period rather than a single global series. 26ai makes that style of workload more natural.
- Good fit when many related series share a time grain and a repeatable governance process.
- Helps standardize model-building across large portfolios instead of proliferating one-off scripts.
- Does not remove the need to manage missing periods, hierarchy effects, or segment-specific anomalies.
What to inspect after a forecast build
Review these points before you call a forecasting pipeline production-ready:
- Whether the time grain is consistent across all participating series.
- How the model behaves at the exact horizon that matters to the business.
- Whether missing periods, structural breaks, or thin-history series were treated consistently.
- Whether forecast review uses segment-level diagnostics rather than only an overall average score.
| Scenario | Why automated search helps | Why multiple-series support helps | What still requires human judgment |
|---|---|---|---|
| Hundreds of branch-level daily demand series | Speeds up first-pass model choice for many similar workloads. | Lets one governed workflow cover a portfolio rather than a single branch. | Outlier branches, promotions, local closures, and abnormal demand periods. |
| Monthly finance projections across business units | Improves repeatability when teams rebuild on a regular cadence. | Supports consistent treatment of multiple units without bespoke orchestration. | Calendar effects, accounting changes, and low-history units. |
| Telemetry or IoT metrics by device class | Reduces manual experimentation burden for many related metrics. | Fits naturally when the same training pattern must be applied across groups. | Sensor drift, device retirement, and maintenance-driven pattern changes. |
GLM, XGBoost, dense projection, and EM outlier detection
This is the part of the release that gives OML4SQL more modeling nuance. The core question is no longer just, "Can Oracle train a model in SQL?" It is, "Can the in-database model reflect the shape, constraints, and edge cases of the workload closely enough that we trust it?"
Logistic regression gains more link functions
Oracle documents support for additional logistic-regression link functions in 26ai: probit, cloglog, and cauchit, alongside the familiar logit framing. This matters when the default link is convenient but not the best fit for the response shape or interpretation needs.
- Probit is often considered when a latent-normal view of the response is more natural.
- Complementary log-log is often practical for asymmetric event behavior and rare-event style modeling.
- Cauchit gives you a heavier-tailed alternative when tail behavior matters more than a standard logit assumption would suggest.
Constraints and survival analysis support
These additions make XGBoost a better fit for production environments where unconstrained predictive power is not the whole story. Monotonic constraints help encode directional business expectations, and survival-analysis support broadens the algorithm beyond simple point classification or regression tasks.
- Use constraints when domain knowledge says a feature should move the prediction in one direction.
- Use survival-style modeling when the real question is time to event, not only whether an event happened.
- Validate aggressively, because poorly chosen constraints can hide real behavior and weak censoring logic can invalidate survival conclusions.
Dense projection support with embeddings
This enhancement is best understood as a bridge between older feature-extraction ideas and newer dense-representation workflows. If your pipeline already produces embeddings, 26ai lets OML4SQL use that richer dense input in Explicit Semantic Analysis style projection scenarios, turning dense semantic representations into model-friendly features inside the database.
- Useful when semantic signal matters but the downstream task is still classic in-database ML rather than nearest-neighbor retrieval.
- Helps keep feature engineering closer to the data and closer to the SQL execution environment.
- Should not be confused with vector search infrastructure; the goal here is model input transformation.
Outlier detection using EM clustering
EM clustering is a natural addition for unlabeled anomaly work because it models data as probabilistic clusters rather than forcing a supervised label boundary that you may not have. In practice, this gives SQL-centric teams a stronger in-database path when they need anomaly scoring but cannot maintain curated anomaly labels.
- Best for early-warning and triage use cases where investigation capacity matters as much as raw detection coverage.
- Works best when anomaly review is tied to interpretable cluster behavior and threshold governance.
- Needs careful business review, because unsupervised outlier scores always reflect assumptions embedded in the clustering structure.
insert into glm_settings (setting_name, setting_value)
values ('GLMS_LINK_FUNCTION', 'GLMS_CLOGLOG');
insert into glm_settings (setting_name, setting_value)
values ('PREP_AUTO', 'ON');
-- Add the link-function row only after deciding that the
-- response shape and validation behavior justify the change.
-- Do not treat link choice as a cosmetic tuning step.High-cardinality prep, persisted lineage, and partitioned model improvements
These features are easy to undersell because they do not sound glamorous. In practice, they often have the biggest effect on whether a modeling program remains operable after the first pilot.
Improved handling of high-cardinality categorical features
Oracle's automatic data preparation already distinguishes between low, medium, and high-cardinality categoricals. The practical message in 26ai is that category-heavy source data is a more first-class concern. That matters because categorical explosion is one of the fastest ways to turn a clean SQL table into an awkward modeling input.
- Oracle documents that automatic prep uses different strategies as cardinality rises, including one-hot style treatment for low-cardinality inputs and binary-style encoding for moderate cardinality.
- Values with very small frequencies can be grouped into an
OTHERbucket, which helps reduce feature explosion. - Oracle still cautions that truly very high-cardinality inputs may require user-directed preprocessing.
The build query is persisted with the model
This is one of the most useful 26ai changes for governed environments. The persisted build query gives you a direct record of how the training data was assembled. That simplifies review, troubleshooting, audit conversations, and reproducibility checks.
- Use it to prove which joins, filters, derived columns, and source objects were involved in model creation.
- Use it to compare model builds over time and detect when the training query changed.
- Remember that query persistence is lineage metadata, not a historical snapshot of the source rows themselves.
Partitioned models become more practical
Partitioned models remain important when one global model is operationally inferior to per-segment behavior. 26ai improves performance in that area, which matters when segmentation is not optional but essential because region, product, channel, or customer-type behavior differs materially.
- Good fit when segment-level dynamics are strong enough that a pooled model hides the real signal.
- Operationally attractive when model ownership naturally maps to business partitions.
- Still requires disciplined partition design, because too many weak partitions can produce fragile models even if the mechanics are faster.
These are the features that lower long-run friction
Many teams focus on new algorithms and ignore the support systems around them. That is usually backward. Better prep, persisted lineage, and more workable partitioning often create more production value than a marginal algorithmic improvement because they keep the pipeline maintainable.
- Fewer ad hoc transformations outside the database.
- Clearer change control when a model must be rebuilt or reviewed.
- A more durable path from pilot success to scheduled, repeatable operations.
select model_name,
algorithm,
mining_function,
build_source
from user_mining_models
where model_name = upper(:model_name);
select model_name,
setting_name,
setting_value
from user_mining_model_settings
where model_name = upper(:model_name)
order by setting_name;When to use which capability, and what to validate before calling it a success
The hardest part of a feature-rich release is deciding which additions matter for the workload and which ones are only marginally relevant. The matrix and diagnostics below are meant to accelerate that judgment.
| If your problem looks like this | Start with | Why it is a fit | Validate carefully |
|---|---|---|---|
| Many related forecast series with a repeatable training process | Automated time series model search + multiple time series | Reduces manual search and fits portfolio-style forecasting operations. | Series grain, holdout behavior, structural breaks, and segment-level error concentration. |
| Binary event modeling where probability shape matters | GLM link functions | Lets you choose a link more aligned with the event process and interpretation needs. | Calibration, threshold behavior, and coefficient interpretation under the selected link. |
| High predictive flexibility with directional business rules | XGBoost constraints | Encodes monotonic expectations while keeping a boosted-tree model family. | Whether the constraint is actually true and whether it hurts fit on important segments. |
| Time-to-event analysis rather than plain classification | XGBoost survival analysis | Moves the task closer to the real business question: when, not only whether. | Censoring logic, event definitions, and horizon interpretation. |
| Label-poor anomaly detection in operational data | EM clustering for outlier detection | Provides an unsupervised path when anomalies are too rare or too expensive to label well. | Thresholding, false-positive burden, and cluster interpretability. |
| Category-heavy tables sourced from many operational systems | Improved high-cardinality prep | Reduces manual preprocessing overhead for messy categorical columns. | Rare-category handling, leakage risk, and whether domain grouping is still needed. |
| Governed production modeling with audit or replay pressure | Persisted lineage | Provides a direct record of the build query inside the model metadata. | Whether source objects, data-retention rules, and change control remain reproducible. |
| Large segmented portfolios where one model per partition is operationally preferable | Partitioned model performance improvements | Makes segment-specific modeling more workable at production scale. | Partition count, data sufficiency per partition, and lifecycle ownership. |
The feature is helping
You are reducing manual setup, not just adding more settings. Model review becomes easier, and the modeling choice maps more directly to the real business question.
The feature is being overused
The team cannot explain why a specific link function, constraint, or partition scheme was chosen beyond "because 26ai supports it now."
The feature is mis-scoped
You are using a workflow enhancement to avoid fixing a data-quality problem, business-definition gap, or governance weakness that still exists underneath.
| Question | Where to inspect | What a healthy answer looks like | Common warning sign |
|---|---|---|---|
| Was the model built from the intended data slice? | USER_MINING_MODELS.BUILD_SOURCE | The stored query matches the approved joins, filters, and derived columns. | The model was rebuilt after an ad hoc query change that no one documented. |
| Did automatic prep help or hide a data-shape issue? | Model settings, source-column profiling, validation errors by segment | Rare categories are tamed without flattening business-critical distinctions. | A large OTHER bucket swallows meaningful business categories. |
| Did the new model form improve real decision quality? | Holdout evaluation, threshold review, business outcomes | The change improves operational decisions, not just a narrow metric. | Model selection is defended only with a single aggregate score. |
| Is segmentation actually justified? | Partition-level data sufficiency and monitoring plans | Each partition has enough data and clear ownership. | Partitions were added because a global model was inconvenient, not because segment behavior truly differs. |
A hands-on review lab, implementation checklist, and FAQ
Close this topic with a repeatable review pattern. Use it to validate whether a 26ai capability improves the pipeline you actually run.
Pick one business problem and one enhancement
Do not start with every new feature at once. Choose a workload where the enhancement addresses a real point of pain.
- Select one use case: binary response, anomaly detection, multi-series forecasting, or segmented modeling.
- Write down the current pain point clearly, not only in model terms.
- Name the exact 26ai enhancement you are evaluating and the reason it should help.
Build with reviewable settings and lineage
Whatever SQL interface or package workflow you use, treat the build as a governed artifact.
- Store settings in a reviewable table rather than burying them in an opaque script fragment.
- After training, inspect
BUILD_SOURCEand the stored model settings immediately. - Confirm that the captured query reflects the intended joins, filters, and feature columns.
Validate the business behavior, not only the model artifact
This is where many pilots go wrong. They prove that the database can train the model but do not prove that the enhancement improved the decision process.
- Use a holdout or backtesting pattern appropriate to the workload.
- Review segment-level behavior, not only a global average.
- Ask whether the model is easier to explain, govern, or rerun than the previous approach.
Before rollout
- Profile the source data for category explosion, missing periods, and partition skew.
- Decide which enhancement is expected to help and how success will be measured.
- Make model settings explicit and reviewable.
- Plan validation at the same decision horizon the business actually uses.
After rollout
- Inspect persisted lineage after every controlled rebuild.
- Monitor segment-level drift rather than only overall averages.
- Review whether automatic prep or constraints are still aligned with current business semantics.
- Keep rebuild procedures simple enough that the operational team can execute them repeatedly.
Should automated time series search replace manual forecasting expertise?
No. It should reduce the cost of finding a strong candidate model. Domain review still matters for horizon choice, unusual periods, series comparability, and acceptance criteria.
Does persisted lineage mean I can reproduce a past model forever?
Not by itself. It preserves the build query, which is extremely useful, but reproducibility still depends on the durability of the source objects, data-retention policy, and change-control discipline around the upstream data.
Should I move from GLM to XGBoost just because XGBoost gained more capabilities?
Not automatically. If interpretability, stability, and controlled probability semantics are central, a better-specified GLM may still be the stronger choice. Use XGBoost when nonlinear fit, constraints, or survival-style framing genuinely improve the task.
When is automatic prep still not enough for high-cardinality categoricals?
When the business meaning of the categories matters more than generic encoding can capture, or when the category space is so large and sparse that manual grouping, alternate feature design, or a different modeling approach is still necessary.
What is the simplest way to get immediate value from the 26ai OML4SQL changes?
Start where friction is already obvious: lineage for governed rebuilds, automated forecasting search for repetitive forecast work, or improved categorical prep where source-system category sprawl has been slowing model development.
Quick quiz
Five questions on OML4SQL changes in Oracle AI Database 26ai. Pick one answer then hit Submit.
Q1. Which forecasting improvement is highlighted in 26ai?
Q2. What does persisted build-query lineage mainly help with?
Q3. When should GLM link-function changes get attention first?
Q4. What warning remains true even after partitioned model performance improves?
Q5. How should dense projection with embeddings be understood here?
No comments:
Post a Comment