Select AI, Retrieval-Augmented Generation, and Synthetic Data in Oracle AI Database 26ai

Oracle AI Database 26ai

Select AI, Retrieval-Augmented Generation, and Synthetic DataHow to turn natural-language prompting into governed retrieval workflows, where vectors fit, and where synthetic data helps without becoming a compliance shortcut

Select AI in Oracle AI Database 26ai is best understood as an orchestration layer around prompts, AI profiles, database metadata, and optional retrieval. Once RAG enters the picture, the quality of the answer depends less on the model alone and more on profile design, chunking discipline, vector retrieval quality, object scoping, and evaluation practice. Synthetic data belongs in the same conversation because it helps teams test and scale these workflows safely, but it solves a different problem than masking, access control, or formal data governance.

What is actually new hereA practical bundle: natural-language prompting, AI profiles, vector-backed retrieval options, and synthetic data generation paths.

What matters operationallyThe profile boundary, retrieval design, and evaluation loop matter more than demo-friendly prompts.

What not to assumeRAG does not guarantee truth, and synthetic data is not automatically the same thing as de-identification.

Table of Contents

Covers what Select AI actually does, how RAG plugs into it, how profiles and vectors constrain behavior, where synthetic data helps, and how to validate the whole pipeline before anyone mistakes a polished answer for a trustworthy one.

1. Feature BoundaryWhat this capability bundle does and does not promise 2. Mental ModelPrompt flow, retrieval flow, and answer assembly 3. Profile DesignHow Select AI profiles govern behavior 4. RAG PatternsWhere vectors, chunking, and context fit 5. Synthetic DataWhat to generate, what not to infer 6. Governance & EvaluationControls, review loops, and failure modes 7. Build PlaybookAn implementation sequence that avoids chaos 8. FAQCommon design questions and decision points 9. Knowledge checkFive questions on Select AI, RAG, and synthetic data

Section 1

What the feature line really bundles

The guide line Select AI with RAG and Synthetic Data Generation can sound like one feature. Treat it as three related capabilities that share infrastructure but solve different jobs.

Select AI

Lets users express intent in natural language and route that intent through an AI profile. Depending on the action, Oracle can generate SQL, explain SQL, chat, narrate, or use retrieval-augmented variants such as chat_rag and narrate_rag.

RAG

Introduces an explicit retrieval step before generation. In Oracle terms, the AI profile can reference a vector index and an embedding model so relevant chunks can be pulled into the model context instead of relying only on the base model.

Synthetic data generation

Uses a generative model to populate existing tables or create new datasets for development, testing, demos, and data amplification scenarios. It is about creating plausible data, not proving policy compliance by itself.

Governance boundary

The profile, credential, object scope, and evaluation workflow remain central. None of these capabilities remove the need for access design, validation, and production review.

Core point

Select AI is not a simple “model inside the database” story. It is a controlled interface between prompts, model providers, database metadata, and optional vector-backed retrieval. RAG improves grounding only when the retrieval layer is well-designed, and synthetic data becomes valuable only when its intended use is explicit.

Capability	What it helps with	What it does not guarantee	What to verify in practice
Select AI prompt actions	Faster query authoring, explanation, and conversational access patterns.	Correct SQL, safe SQL, or correct domain interpretation in every case.	Generated SQL shape, access scope, and business meaning.
RAG	Injecting relevant local context into model prompts.	Truth, freshness, or perfect retrieval coverage.	Chunk quality, embedding choice, retrieval precision, and answer review.
Synthetic data	Non-production datasets, test coverage, demos, and edge-case expansion.	Formal anonymization, referential realism across every workflow, or bias-free output.	Constraint validity, representational fidelity, and downstream behavior.
AI profiles	Reusable configuration for provider, credentials, and retrieval settings.	A substitute for security policy or human governance.	Who can use the profile, what objects it can touch, and how prompts are reviewed.

Section 2

Mental model: prompt, retrieval, context, response

Select AI with RAG works as a pipeline with a small number of decision points. Each decision point changes what the model sees and therefore changes what the answer can be trusted to mean.

Two ideas matter here: first, RAG is not a different model but a different prompt-construction path; second, evaluation has to inspect the generated result and the retrieval path, not just the final wording.

Prompt path

A plain chat or narrate action relies on the profile and the base prompt. It can still be useful for metadata-aware conversations or SQL assistance, but it does not retrieve extra business documents automatically.

RAG path

A chat_rag or narrate_rag action adds retrieval. That changes the context window the model receives, which often improves specificity when the underlying corpus is chunked well and indexed consistently.

Synthetic path

A GENERATESYNTHETICDATA action shifts the goal entirely. Instead of answering a question, the model generates rows or datasets according to a prompt and profile, often for non-production use cases.

What matters

Most failed RAG deployments are not model failures first. They are corpus failures, chunking failures, stale retrieval indexes, vague prompts, weak profile boundaries, or evaluation failures. Oracle’s feature set reduces plumbing work; it does not eliminate design responsibility.

Section 3

Select AI profile design is the real control plane

Oracle’s AI profile is where operational intent becomes configuration. It ties a user or session to a provider, a credential, optional retrieval settings, and object-level expectations. That is why profile design usually matters more than prompt cleverness.

Profile responsibilities

Identify the model provider and credential to use.
Carry generation settings when supported by the chosen provider and profile attributes.
Optionally declare retrieval-specific attributes such as vector_index_name and embedding_model.
Constrain or shape object access through supported profile mechanisms rather than leaving intent purely implicit in prompts.

Profile anti-patterns

One profile used for unrelated workloads with very different trust boundaries.
Retrieval profiles pointed at poorly curated chunks that mix policy, drafts, and obsolete text.
Allowing generated SQL paths into sensitive schemas without explicit review and validation.
Assuming a model setting can rescue bad retrieval or weak object hygiene.

Create a RAG-oriented AI profile

PL/SQL

BEGIN
  DBMS_CLOUD_AI.CREATE_PROFILE(
    profile_name => 'DOC_RAG_PROFILE',
    attributes   => '{
      "provider": "oci",
      "credential_name": "OCI_GENAI_CRED",
      "vector_index_name": "HR_POLICY_HNSW_IDX",
      "embedding_model": "cohere.embed-english-v3.0"
    }'
  );
END;
/

BEGIN
  DBMS_CLOUD_AI.SET_PROFILE(
    profile_name => 'DOC_RAG_PROFILE'
  );
END;
/

The important point is not the exact provider choice but the pattern: the profile is the reusable object that turns a session into a governed AI client. Once the session profile is set, Select AI prompt actions can reference that contract instead of embedding provider details in every query.

Use Select AI with a RAG action

SQL

SELECT AI chat_rag summarize the contractor leave policy differences for EMEA teams;

SELECT AI narrate_rag explain the approval workflow for laptop refresh requests;

Design implication

If teams ask, “Which prompt should we standardize on?” the better first question is, “Which profile boundaries should exist?” Separate profiles for SQL generation, document-grounded support, and synthetic data generation are usually healthier than one everything-profile.

Section 4

Practical RAG architecture patterns inside Oracle-centric estates

RAG sounds simple: embed documents, retrieve relevant chunks, add them to the prompt. In real estates, the hard work is deciding which corpus deserves retrieval, how much context to pass, how to keep indexes aligned with source changes, and when structured SQL is better than document retrieval.

Pattern	Best fit	Strength	Main caveat
Policy and knowledge-base assistant	HR, IT support, internal procedures, operating standards.	High value when the corpus is document-heavy and changes in manageable batches.	Requires aggressive stale-content hygiene and chunk review.
Mixed structured plus unstructured workflow	Cases where users need both database facts and procedural text.	Lets SQL answer the numeric part and RAG answer the document part.	Easy to confuse the two paths if the UI hides provenance.
Analyst acceleration with `showsql` then `chat_rag`	Exploration and guided analysis.	Keeps structured retrieval visible before moving to narrative explanation.	Needs user training so narrative output is not mistaken for audited reporting.
Large shared corpus with many business domains	Enterprise knowledge search across multiple owners.	Broad reach and reuse.	Most likely to produce noisy retrieval if chunking and profile segmentation are weak.

Architecture choice

Use SQL when the answer is fundamentally structured

If the real task is “sum bookings by region and quarter,” start with SQL-oriented actions. RAG is not a better database access path for facts already modeled relationally.

Architecture choice

Use RAG when context lives in prose, not tables

If the task depends on policy documents, runbooks, narrative procedures, or exception-handling text, vector retrieval earns its keep because the relevant context is document-native.

Retrieval quality

Chunk for meaning, not for page count

Small fragments improve recall but can destroy context; giant chunks preserve context but dilute retrieval precision. Chunking should follow document semantics such as policy sections, procedures, headings, or FAQ units.

Lifecycle

Keep the vector index aligned with the source corpus

A sharp vector index over obsolete text is still obsolete. Refresh discipline matters as much as index choice when readers expect current policy or operating guidance.

Review the generated SQL before trusting it

SQL

SELECT AI showsql list the customers with the highest sales in 2024;

SELECT AI explainsql show total revenue by region and quarter;

Failure mode to watch

The most expensive mistake is treating RAG as a universal answer engine. If the task is a relational question, use generated SQL and inspect it. If the task is a document question, make sure the document corpus, embedding model, and retrieval scope are trustworthy. When teams blur those boundaries, confidence rises faster than correctness.

Section 5

Synthetic data generation: valuable, but only with clear boundaries

Oracle exposes synthetic data generation through DBMS_CLOUD_AI.GENERATE with the GENERATESYNTHETICDATA action. That makes it easier to generate new rows or datasets from prompts, but the right operational question is not “can it generate plausible data?” The right question is “for which workload is plausible synthetic data a safe and useful substitute?”

Strong use cases

Populate test tables when production data is unavailable or inappropriate.
Create larger demo or training datasets from smaller seeds.
Exercise edge-case application paths by asking for skew, anomalies, or rare combinations.
Support RAG or analytics experiments without exposing live operational records.

Unsafe assumptions

Assuming synthetic output preserves all production correlations automatically.
Assuming prompt-generated data is a privacy control on its own.
Assuming generated values will satisfy every business rule without validation.
Assuming training and evaluation results on synthetic data will transfer cleanly to production behavior.

Generate synthetic rows through DBMS_CLOUD_AI

SQL

SELECT DBMS_CLOUD_AI.GENERATE(
         prompt       => 'Generate 100 customers in CUSTOMER_DEMO. Include realistic names, addresses, and purchase patterns for a retail loyalty application.',
         action       => 'GENERATESYNTHETICDATA',
         profile_name => 'SYNTHETIC_DATA_PROFILE') AS generated_text
FROM dual;

Question	If the answer is yes	If the answer is no
Do you only need realistic-looking non-production data?	Synthetic generation is often a good fit.	You may need masked or subsetted production-like data instead.
Do downstream tests depend on strict relational and business-rule fidelity?	Add validation and post-generation checks before using the data broadly.	Prompt-only generation may be enough for demos or exploratory testing.
Is privacy compliance the main driver?	Use formal data protection controls and review whether masking or subsetting is the actual requirement.	Synthetic generation can remain a productivity tool.
Will this data be used to benchmark models or critical business rules?	Treat the output as a test fixture that needs explicit quality scoring.	The tolerance for approximation may be higher.

Governance caution

Oracle Data Safe data masking and subsetting remain purpose-built controls for sanitizing sensitive data copies. Synthetic data generation can reduce exposure and speed up non-production work, but it should not be treated as an automatic replacement for formal masking policy or data minimization decisions.

Section 6

Governance, evaluation, and failure-mode reasoning

The operational maturity test for Select AI is not whether a demo works. It is whether the estate can explain why a result was produced, what data could influence it, how to review the risky paths, and how to keep behavior inside a documented boundary.

Access boundary

Separate profiles by trust zone

Keep SQL-generation profiles, document-grounded RAG profiles, and synthetic-data profiles distinct. The users, prompts, and review expectations are usually different enough that one profile per estate is a liability.

Review artifact

Preserve explainability for generated behavior

Use showsql, explainsql, and reporting capabilities such as DBMS_CLOUD_AI.GENERATE_REPORT during validation cycles so prompt transformations and generated SQL can be inspected instead of trusted blindly.

Corpus health

Audit stale and conflicting documents

RAG quality falls quickly when the vector corpus mixes current policy with draft text, superseded runbooks, or duplicated content with slight wording differences.

Data fitness

Score synthetic output against intended use

For application testing, measure constraints, null patterns, category distributions, and referential quality. For analytics experiments, measure distribution drift and whether critical business edge cases still appear.

Evaluation checklist for a serious rollout

Review generated SQL separately from answer prose; never let narrative fluency hide a poor query.
Test RAG prompts against stale, conflicting, and ambiguous document chunks to see how the system fails.
Track which profiles can be set by which users and which workloads each profile is allowed to serve.
Refresh vector indexes on a documented schedule tied to corpus updates, not just to database maintenance windows.
Validate synthetic data with the same seriousness used for ETL outputs: constraints, joins, nullability, and key business ratios.
Keep domain experts in the evaluation loop for prompts that affect finance, HR, compliance, customer support, or operational decision-making.

Symptom	Likely root cause	First review step
Answer sounds confident but cites the wrong policy rule	Irrelevant or stale chunk retrieval, or prompt ambiguity.	Check corpus freshness and whether chunk boundaries preserve the actual rule text.
Generated SQL runs but answers the wrong business question	Natural-language interpretation drift rather than syntax failure.	Inspect `showsql` output and compare it to the intended metric definition.
Synthetic data passes a demo but breaks integration tests	Missing relational or constraint realism.	Run constraint and join validation before the data reaches application teams.
Different teams get different behavior from "the same" assistant	Profiles, corpora, or prompts differ more than people realize.	Inventory profiles, retrieval settings, and corpus lineage instead of comparing answers in isolation.

Section 7

A build playbook that keeps the project grounded

A good implementation sequence prevents the team from reaching the model-integration phase before basic controls exist. The sequence below is intentionally conservative because it reflects where most enterprise friction really appears.

Step 1

Classify the workload

Decide whether each user task is primarily structured retrieval, document retrieval, narrative summarization, or synthetic-data generation. Do not let a single “AI use case” label blur these distinctions.

Step 2

Design profiles before prompts

Separate trust zones, provider credentials, and retrieval-enabled use cases. If needed, create different profiles for analysts, support staff, and engineering workflows.

Step 3

Curate the corpus

Choose which documents belong in retrieval, remove obsolete copies, and define chunking rules that follow the meaning of the source material rather than arbitrary token counts.

Step 4

Validate generated SQL early

Use showsql and explainsql before rolling out conversational paths. If SQL generation is weak or poorly scoped, the rest of the user experience will be fragile too.

Step 5

Test RAG with adversarial prompts

Ask questions that target ambiguity, stale policy, conflicting terminology, and overloaded abbreviations. Measure where retrieval fails, not just where the demo succeeds.

Step 6

Gate synthetic data by purpose

Approve different validation thresholds for demos, application testing, analytics simulation, and model evaluation. A single quality bar is usually too vague to be useful.

Implementation note

Oracle’s reporting support for Select AI is useful during rollout because it gives teams a way to review generated behavior as an artifact rather than reconstructing it from memory after the fact. Use that capability during test cycles, especially for SQL-generation use cases and during RAG tuning.

Practical rule

If a use case can materially affect operations, spending, support actions, or compliance interpretation, promote it only after a repeatable review run shows the profile, SQL generation path, and document retrieval path all behave inside a known boundary.

Section 8

Frequently asked questions

Does RAG replace the need to model data well in relational tables?

No. RAG is strongest when the important knowledge lives in prose documents. When the answer should come from structured facts, generated SQL or standard SQL remains the more direct and auditable path.

Should every Select AI profile be retrieval-enabled?

Usually not. Retrieval adds value when the task needs document context. For pure SQL generation or explanatory use cases, a narrower profile is often easier to govern and easier to evaluate.

Can synthetic data replace masking for regulated non-production environments?

Not as a blanket rule. Synthetic data can reduce exposure and accelerate testing, but masking and subsetting remain the formal controls for many copied-data scenarios. Treat these as different tools with different assurances.

What is the first thing to inspect when a Select AI answer feels wrong?

Inspect the path, not the prose. Check whether the action should have been SQL-oriented or RAG-oriented, then review the generated SQL, profile choice, and corpus quality before debating the answer text itself.

How should teams evaluate synthetic data for machine learning experiments?

Score the data against the workload: label balance, feature ranges, null patterns, joinability, rare-event presence, and whether the generated data preserves the business phenomena the experiment is trying to learn from. Plausibility alone is not enough.

Oracle Apps DBA

Search This Blog