All posts
6 min read

Image and Video Inclusion in AI Answers: When Visuals Get Cited

Most AEO discussion focuses on text. AI engines increasingly include images and video in answers, with their own selection logic. Here is what gets shown, what gets cited, and what to do about it.

Diagram showing how AI engines select images and videos from indexed content for inclusion in synthesized answers.

What you are missing if you only think about text

AEO discourse has been dominated by text optimization: schema, direct answers, freshness, citations of written sources. This was the right focus through 2025. It is incomplete for 2026.

AI engines increasingly include images and video in answers. AI Overviews routinely show product images, infographics, and explanatory diagrams. Perplexity has Pro Search that synthesizes video content. ChatGPT can search and reference images. The visual layer of AI answers is real and growing.

If your content has no visual presence in your category, you are absent from a growing share of citations. The fix is not adding images to every page; it is publishing visual content that engines select.

What images get included

Five characteristics consistently determine whether an image gets included in AI answers.

ImageObject schema. Images marked up with ImageObject schema, including caption, contentUrl, license, and creator, are easier for engines to attribute and reproduce. Many sites have images but no markup; the markup is the multiplier.

Original creation. Stock photography is filtered out by most engines. Original infographics, diagrams, screenshots, and product photography are weighted higher. Engines optimize for unique visual content, not generic visuals.

Subject clarity. A clean diagram with one subject, clear labels, and high resolution is more often selected than a busy collage. Visual extractability matters.

Source page citability. Images on pages that already cite well in text are more likely to also be cited as visuals. The page authority signal carries.

Alt text and surrounding context. Engines use alt text and the surrounding paragraph to understand what the image shows. Pages with vague or missing alt text underperform.

Original infographics as a citation lever

Most marketing teams under-invest in original infographics. When done well, an infographic produces citation density beyond what the same data would produce as text alone.

The pattern that works:

Single-topic, single-takeaway. An infographic that tries to cover too much extracts poorly. One clear data point or one clear process diagram performs better.

Self-contained. The infographic has to make sense without external context. Title, key numbers, source attribution, and your brand mark, all in the image itself.

Stable URL with descriptive filename. /images/aeo-citation-pipeline-2026.svg cites better than /uploads/IMG_2347.png. The URL signals the content.

Schema markup. ImageObject with full attribution. Often paired with the article schema of the page hosting it.

Distribution beyond your site. Submit to industry publications. Allow embedding with attribution. The infographic that appears on five external sites cites more reliably than one that lives only on yours.

A B2B SaaS publishing 4-6 substantial original infographics per year typically sees measurable image-citation lift in the 6-12 month window.

Video citation patterns

Video citation in AI answers is newer and more uneven. The signals that drive video selection:

Transcripts and captions. Engines mostly extract from text representations of video. A video without captions or transcript is largely invisible to most AI engines. Auto-generated captions help; human-edited transcripts help more.

VideoObject schema. With duration, contentUrl, thumbnailUrl, uploadDate, and transcript field. Foundational.

Topic clarity in the title and description. Engines select videos partly based on title-to-query match. "How to set up X (5 minutes)" cites better than "Tutorial Video #4."

Hosting platform. YouTube videos cite more reliably than self-hosted because engines have better access. Vimeo and Wistia are intermediate. Self-hosted videos with proper schema are still reachable.

Recency. Older videos with stale information are heavily down-ranked, just like text. Annual updates of evergreen video content help.

For most teams, the right video AEO investment is: maintain a YouTube channel with topical educational content, ensure complete schema and transcripts, focus on quality over volume.

Diagrams and screenshots

A specific subset of visual citation deserves attention: diagrams and product screenshots.

Diagrams (process flows, architecture diagrams, comparison charts) cite well on technical queries. Engines reach for them when synthesizing answers about how things work. Investing in 1-2 high-quality diagrams per pillar article pays back in image-citation rate.

Product screenshots cite well on product-feature queries. "What does X look like?" gets a screenshot answer if engines can find a clean, current screenshot of your product. Outdated screenshots are worse than none - they erode credibility when the actual product looks different.

The maintenance burden is real. Both diagrams and screenshots need updating when products or processes change. Plan for this; do not just publish once and forget.

What does not work

Three patterns produce no citation lift.

Stock photography on every blog post. Filler images reduce extraction quality without adding citation potential. If the image is decorative, it is wasted effort.

Heavily branded images. Logos and brand watermarks plastered across the image reduce reproducibility. Engines often skip images that look promotional rather than informational.

Image carousels and slideshows. Engines extract poorly from interactive image components. A single static image cites better than a carousel of ten.

Image-as-text content. Walls of text rendered as images instead of HTML. Engines cannot extract text from images reliably; the content is effectively invisible.

How Citevera scores this

The audit checks image schema completeness, alt-text quality, original-vs-stock detection, and infographic presence on cited topics. The visual axis is currently a smaller component of the overall AEO score (visual citations are still a smaller share than text) but is being weighted up as visual citation continues to grow.

For customers prioritizing visual AEO, the audit recommends specific infographic and diagram opportunities aligned with their topic clusters. The recommendations are content-specific: "build a diagram for your AEO pipeline article," not generic "add more images."

Run a free Citevera audit to assess your visual citation readiness

Frequently asked questions

Should I add ImageObject schema to every image?

To every meaningful image on cited pages, yes. To decorative or template images, no. The schema investment should match the citation potential of the image.

Does AI Overviews use the same image selection as Gemini chat?

Largely yes, with overlay differences. AI Overviews tends to prefer images with strong visual hierarchy and clear subjects. Gemini chat is more flexible.

Can I block AI engines from using my images while allowing text use?

Indirectly via license metadata. Setting clear restrictive license fields in ImageObject schema discourages reuse. Some engines respect this; some do not. Watermarking is more reliable but reduces selection probability.

How important is image file size?

Less than you might think for citation. Engines fetch and process images server-side. User-facing performance still matters for SEO and UX, but does not significantly affect AI image citation.

What about AI-generated images?

Mixed. Engines increasingly detect and lower-weight AI-generated images. Original photography and human-created infographics consistently outperform AI-generated alternatives for citation. Use AI-generation as a starting point; finish with human craft.

How important are image dimensions and aspect ratios?

Modest impact. AI engines select images partly on extractability, which favors clear single-subject images regardless of exact dimensions. 16:9 and 4:3 ratios both work. Very narrow or very wide images extract less cleanly. Square thumbnails for video work universally.

Do AI engines cite YouTube videos at the channel level or video level?

Video level primarily, but channel authority influences selection. A video on a credentialed channel with consistent topical focus cites better than the same video on a general-interest channel. Channel investment compounds: videos benefit from the channel's accumulated authority over months and years.