Category: AI Data Grounding

Establishes Kojable as a technical authority on “Grounding” and “RAG” (Retrieval-Augmented Generation).

  • Persona-Specific Grounding: How Citation Sources Shift Across Financial Roles.

    Does AI use different citation sources for different personas? 

    Yes. True persona-specific AI grounding means that while the total number of citations an AI generates is dictated entirely by prompt complexity, the specific domains it cites change significantly based on the assigned professional role.


    What is the Core Hypothesis Behind Persona-Specific AI Grounding?

    If an AI is truly persona-aware, it must change its underlying evidence base, not just its tone.

    Our hypothesis was simple: an AI prompted to act as a CFO should not pull data from the same websites as an AI prompted to act as an Accounts Payable Manager.

    True persona adoption requires structural shifts in citation volume and source composition.

    A mere change in vocabulary is just superficial styling; a change in the retrieval supply chain is a fundamental behavioral shift.

    Why is Persona-Specific Grounding Important?

    Understanding how  persona-specific AI grounding alters its retrieval process based on persona fundamentally impacts how we build, optimize, and evaluate AI systems.

    • Product Teams: You can steer retrieval pipelines based on user profiles to radically improve UX.
    • Marketing Teams & SEOs: Tracking prompt intents is no longer enough; you must track who the prompt is designed for to optimize for AI visibility.
    • Evaluation Teams: QAing language model outputs requires testing the actual composition of evidence, verifying that the AI isn’t citing generic wikis for expert-level queries.
    • Governance: You must detect and mitigate retrieval bias to ensure that specific roles aren’t systematically fed lower-quality data.

    How Did We Test This? (Our Process)

    We built an end-to-end extraction and normalization workflow to rigorously test grounding behavior across 988 responses covering 12 distinct finance personas.

    First, we extracted the persona-specific AI grounding sources. Because the raw URI fields often contained generic Vertex AI redirect loops, we parsed the actual title fields and normalized them into clean root domains using tldextract.

    We then deduplicated these domains strictly within each response to prevent double-counting. Finally, we computed advanced informational metrics, transforming raw citation frequencies into Shannon entropy and Pielou’s Evenness (J) to measure true source diversity.

    Why Did We Use Advanced Statistical Models?

    We avoided naive t-tests because they consistently generate false positives by failing to account for shared topic structures and structural confounders.

    When analyzing highly skewed, sparse count data across thousands of dimensions, basic statistics inflate significance. Because certain topics (like “fraud detection”) inherently require more citations than others, we needed models that could isolate the persona’s true marginal effect.

    • Negative Binomial GLM: We used this to properly analyze citation count data, controlling for query intent and cluster complexity to prove that volume differences were driven by the query, not the persona.
    • PERMANOVA (Bray-Curtis): We deployed this to test for actual, multi-dimensional composition differences across a massive 1,308-domain distance matrix without arbitrary cutoffs.
    • PERMDISP: We used this to verify that the domain shifts identified by PERMANOVA were driven by genuine persona-driven curation, rather than just statistical noise or varying dispersion spreads between groups.

    Key Findings: How Persona-specific AI Grounding Adapts Its Evidence Base

    Our statistical suite revealed that the AI acts as a highly sophisticated routing mechanism, carefully matching domain supply to persona demand.

    1. Volume is Driven by Intent, Not Persona: The Kruskal-Wallis test initially suggested citation volume varied by persona. However, our Negative Binomial GLM ($p = 0.23$) proved this was a spurious correlation. The complexity of the query dictates the amount of evidence, not the persona.
    2. Source Composition is Highly Persona-Dependent: Our PERMANOVA ($F = 1.31$, $p = 0.01$) definitively proved that the specific domains cited change based on the persona. The AI intelligently curates distinct informational diets for different roles.
    3. Cross-Persona Overlap is Shockingly Low: The Bray-Curtis similarity matrix revealed a mean off-diagonal overlap of just 14%. An AI acting as a Treasury Manager relies on a fundamentally distinct network of domains compared to an Internal Auditor.
    4. Source Diversity is Near-Perfect: Pielou’s Evenness scores consistently ranged between 0.96 and 0.99. The persona-specific AI grounding aggressively resists source monopolization, ensuring that no single persona becomes overly reliant on a single dominant domain.
    5. Algorithmic Clustering Validates Logic: When we mapped persona source similarities via hierarchical clustering, related roles like AP Manager and AR Manager organically grouped together. The math alone correctly mapped the latent business relationships.
    Citation volume varies by persona, while source evenness remains consistently high (near-uniform source spread per persona)
    Heatmap shows weak cross-persona overlap and clear structure in which personas share similar source profiles.
    Bubble size/color reflect citation frequency, revealing which domains dominate within each persona’s top source set.

    Key Terms (Glossary)

    • Ablation: Processing data by systematically removing components (e.g., stripping the persona from a prompt) to isolate and measure the original component’s true effect.
    • Negative Binomial GLM: A generalized linear model specifically designed to handle overdispersed count data (like citation volume), controlling for confounding variables to prevent false positives.
    • PERMANOVA: Permutational Multivariate Analysis of Variance; a non-parametric test used to assess whether different groups have significantly different compositions across a complex, high-dimensional space.
    • Bray-Curtis Similarity: A statistic used to quantify the compositional similarity between two different sites (or in our case, personas) based on counts across intersecting data points.
    • Pielou’s J (Evenness): A metric derived from Shannon entropy that measures how evenly distributed frequencies are, normalizing for sample size to allow fair comparisons between datasets of different sizes.

    Frequently Asked Questions (FAQ)

    Does prompting an AI with a specific persona make its answers longer?
    Not inherently. Our data shows that while certain personas appear to generate more citations or text, this is actually driven by the complexity of the underlying query topic, not the persona itself.

    How do we know the AI isn’t just pulling from the exact same sources every time?
    Our analysis using Pielou’s Evenness metrics proves the AI relies on a highly fragmented, ultra-diverse data supply. Across all personas, the AI effectively avoids monopolization by pulling from over 1,300 distinct root domains.

    Will optimizing for one persona hurt my visibility for another?
    Yes, it is highly likely. Because the AI demonstrates only ~14% source overlap across different B2B roles, ranking for an “FP&A Lead” prompt means you are competing in a largely distinct domain pool than an “AR Manager” prompt.

  • AEO Breakthrough: Track 85% Fewer Prompts Without Losing Visibility

    TL;DR

    We generated 180 finance-domain prompts across 3 topic clusters. We ran them through Google Gemini with live Google Search grounding. Then we measured how similar the AI’s responses and search queries were.

    The results were striking:

    • Similar prompts produce near-identical responses. r = 0.878. Bootstrap confidence interval confirms significance.
    • Similar prompts trigger similar grounding searches. r = 0.869. Mantel permutation test p is less than 0.001.
    • The implication: Companies can reduce AEO monitoring costs by approximately 85% by tracking seed prompts instead of every variation.

    The Problem: AEO Is Expensive

    Answer Engine Optimization is becoming critical for B2B companies. But it has a scaling problem.

    Unlike traditional SEO, where you optimize pages and track rankings for a defined keyword set, AEO requires monitoring how AI systems respond to natural-language prompts. And those prompts are infinite.

    “What is the best cash flow software for B2B SaaS?”

    “Top cash flow tools for mid-market companies”

    “How does cash flow forecasting work for fintech lenders?”

    “Cash flow management platforms with NetSuite integration”

    Each could trigger different AI responses, different grounding searches, and different brand mentions. Track them all individually and costs scale linearly. For a company monitoring 500 plus prompts across multiple AI platforms, this becomes unsustainable.

    The question we set out to answer: Can you track one prompt and confidently infer what the AI would say for dozens of similar prompts?

    Research Design

    Two hypotheses:

    1. AI Output Similarity. Do semantically similar prompts produce semantically similar AI responses?
    2. Fan-Out Query Similarity. Do similar prompts trigger similar grounding searches?

    If both are true, companies can consolidate prompts into clusters and monitor only representative seed prompts. Dramatically reducing cost and workload.

    Methodology

    We designed a controlled experiment with three distinct topic clusters in B2B finance:

    Cash Flow. Base queries on free cash flow and cash flow forecasting. Example: “free cash flow explained for B2B SaaS”

    Payment Processing. Base queries on B2B payment automation and cross-border payments. Example: “best cross-border payments tools with Stripe”

    Fraud Detection. Base queries on transaction fraud detection and AML compliance. Example: “how AML compliance works for a compliance officer”

    Each cluster contained 60 prompts. 180 total. Generated from 60 templates that varied across 7 context dimensions drawn from real B2B finance scenarios:

    • Personas: CFO, FP&A lead, treasury manager, AR manager, controller
    • Industries: B2B SaaS, fintech lender, payments platform, credit unions
    • Geographies: Ireland, US, UK
    • Integrations: NetSuite, Xero, SAP, Stripe, QuickBooks, Sage, HubSpot
    • Company sizes: SMB, mid-market
    • Time periods: daily, weekly, monthly, quarterly
    • Metrics: runway, DSO, DPO, burn rate, working capital, net revenue retention
    Prompt Similarity Distribution showing within-cluster prompts (blue) are more similar than cross-cluster prompts (red), with minimal overlap.

    Prompts ranged from 6 to 20 words. Mixed styles including questions, commands, fragments, and phrases to simulate realistic user behavior.

    Bootstrap Distribution of Within–Cross Cluster Difference (Prompt-Level). The entire confidence interval sits above zero, confirming robust separability.

    Measurement

    All 180 prompts went to Google Gemini 3.0 Flash with grounding enabled. For each prompt we captured:

    • The AI’s full text response
    • The grounding search queries the AI generated
    • The grounding source URLs and titles

    We computed semantic similarity using Gemini Embedding-001. Not TF-IDF. This captures meaning, not just word overlap. TF-IDF would score “money” and “capital” as zero percent similar. Embeddings correctly identify them as semantically close.

    All similarity scores used cosine similarity on L2-normalized embedding vectors.

    Results

    Case Study 1: AI Output Similarity

    Do similar prompts produce similar responses?

    Yes. With extremely strong evidence.

    The Pearson correlation between prompt similarity and response similarity was r = 0.878. This means 77% of the variance in response similarity is explained by prompt similarity alone.

    To put this in context:

    • r = 0.3 would be interesting but weak
    • r = 0.5 would be moderate, worth investigating
    • r = 0.878 is near-perfect linear relationship

    Control Group Validation

    We verified our measurement using within-cluster versus cross-cluster comparisons:

    • Within-cluster response similarity, same topic: 0.664
    • Cross-cluster response similarity, different topics: 0.569
    • Cohen’s d: 1.27, classified as very large effect

    The AI clearly distinguished between topics. Cash flow prompts produced cash flow answers. Fraud prompts produced fraud answers. This confirms our embeddings capture real semantic differences, not noise.

    Case Study 1 – AI Output Similarity. Left: Prompt similarity vs response similarity (r=0.878). Middle: Distribution of all response similarities. Right: Within-cluster responses are more similar than cross-cluster responses (difference +0.066, 95% CI [0.066, 0.077]).

    Addressing Statistical Rigor

    A naive t-test on 16,110 pairs would report t = 77.7, p approximately 0. But this is pseudoreplication. Each prompt participates in 179 pairs, violating the independence assumption.

    We addressed this with a stratified prompt-level bootstrap. Two thousand iterations. Resampling prompts within each cluster to maintain balance and respect the dependence structure:

    • Observed difference, within minus cross: plus 0.066
    • 95% Bootstrap CI: [0.064, 0.078]
    • Interpretation: The CI does not include 0. The effect is robust to prompt-level dependence.

    Case Study 2: Fan-Out Query Similarity

    Do similar prompts trigger similar grounding searches?

    Yes. Also with strong evidence.

    The 180 prompts triggered 1,620 unique grounding searches. Approximately 9 per prompt. The correlation between prompt similarity and query-set similarity was r = 0.869.

    Fan-Out Query Similarity. Left: Prompt similarity vs query similarity (r=0.869). Right: Distribution of query similarities across all prompt pairs

    We used a symmetric best-match average to handle variable fan-out sizes. Some prompts triggered 5 searches, others 15. This prevents larger query sets from mechanically appearing more similar due to size alone.

    Within vs Cross-Cluster Query Similarity. Within-cluster queries are substantially more similar (0.655) than cross-cluster queries (0.580), with a large effect size (Cohen’s d = 1.42)

    Statistical significance was confirmed via a Mantel permutation test. Two thousand permutations. This accounts for the matrix dependence structure. The empirical p-value was less than 0.001. Zero out of 2,000 random permutations matched or exceeded the observed correlation.

    Grounding Source Analysis

    We examined the titles of grounding sources across clusters:

    • Over 80% of source titles were unique to a single topic cluster
    • Cash flow prompts cited cash flow-specific resources. Fraud prompts cited fraud-specific resources
    • Only generic finance portals like Investopedia appeared across multiple clusters
    Top 20 Grounding Source Titles. YouTube dominates, followed by Reddit and topic-specific vendor/reference sites

    This high specificity means the AI is not lazily citing the same sources for everything. It’s performing targeted, topic-aware retrieval.

    What This Means for AEO Strategy

    1. Prompt Consolidation: Track Seeds, Not Everything

    The core finding, r = 0.878, means you can group prompts by semantic similarity and track only one seed prompt per group.

    Before consolidation: Track 500 prompts. 500 API calls per day. High cost.

    After consolidation: Cluster prompts using cosine similarity greater than 0.75 threshold. Track approximately 50 to 75 seed prompts. 85% cost reduction.

    The seed prompt’s response can be confidently extrapolated to the entire cluster.

    2. Brand Mention Extrapolation

    If your brand appears or doesn’t appear in the response to a seed prompt, you can infer the same for all prompts in that cluster. Response similarity of 0.70 within a cluster means the structure, content, and likely brand ordering are preserved across variations.

    3. Fan-Out Query Coverage

    Instead of optimizing content for every possible grounding query, focus on the top 10 to 15 grounding queries per topic cluster. Since similar prompts trigger overlapping searches, addressing one prompt’s grounding queries provides coverage for the entire cluster.

    The math: 180 prompts generated 1,620 queries. But within a cluster, the top 15 queries cover the vast majority of search behavior. Optimizing for 45 queries, 15 times 3 clusters, is far more efficient than optimizing for 1,620.

    4. Content Architecture

    The source title specificity, over 80% unique per cluster, tells you that generic catch-all content pages won’t work for AEO. The AI prefers topic-specific, authoritative content.

    Don’t: Write one giant “Complete Guide to B2B Finance”

    Do: Write dedicated pillar pages. “Cash Flow Forecasting for B2B SaaS”. “Cross-Border Payment Automation Guide”. “AML Compliance Checklist for Fintech”. Each pillar page should target the top grounding queries for its cluster.

    Limitations and Future Work

    What we didn’t test:

    1. Brand mention rank correlation. We measured overall response similarity but didn’t extract and compare the specific order in which brands are mentioned. A follow-up using Kendall’s tau on brand rankings would strengthen the consolidation argument.
    2. Temporal stability. Our data represents a single point in time. Running the same seeds weekly for 4 to 8 weeks would confirm whether the r = 0.878 relationship holds as the AI model updates.
    3. Cross-model consistency. This study used Google Gemini. Testing with ChatGPT with Bing grounding, Perplexity, and Claude would determine whether consolidation strategies transfer across AI platforms.
    4. Domain breadth. All prompts were in B2B finance. The consolidation ratio may differ for other verticals like healthcare, legal, or e-commerce.

    Methodological Notes

    • All statistical significance tests used dependence-aware methods. Prompt-level bootstrap and Mantel permutation test rather than naive pairwise tests.
    • Similarity was measured via neural embeddings, Gemini Embedding-001, not bag-of-words approaches.
    • Query-set similarity used symmetric best-match averaging to normalize for variable fan-out sizes.

    Conclusion

    This study provides strong, statistically robust evidence that similar prompts produce similar AI responses and trigger similar grounding searches. The practical implication is clear: AEO does not require tracking every conceivable prompt variation.

    By clustering prompts semantically and monitoring representative seeds, companies can achieve comprehensive AEO coverage at a fraction of the cost. The data suggests an 85% reduction in monitoring workload is achievable without sacrificing insight quality.

    For AEO practitioners, the message is simple: Work smarter, not harder. One prompt can represent many.


    This research was conducted by Kojable as part of our ongoing work in Answer Engine Optimization. The full methodology, code, and data are available on request.

    Tools used: Google Gemini 3.0 Flash with grounding, Gemini Embedding-001, Python with NumPy, SciPy, Plotly, and scikit-learn.