Customer Support

Multimodal customer troubleshooting in one session

A single multimodal model handles voice, text, images, and camera input together so customers can describe issues, show devices, and share screenshots in one continuous support conversation.

Why the human is still essential here

Human agents remain essential for complex diagnoses, sensitive customer situations, and escalations where judgment, empathy, or deeper product expertise is required.

How people use this

Voice-and-camera troubleshooting

Customers speak naturally while showing the device or screen on camera, and the AI gives step-by-step fixes in the same live session.

Gemini Live API

Realtime voice support bot

A support application keeps one continuous conversation across speech, typed follow-ups, and uploaded images so customers do not have to repeat context.

OpenAI Realtime API

Omnichannel multimodal service console

A service team handles customer voice, chat, screenshots, and image evidence inside one support workspace backed by multimodal AI assistance.

Salesforce Service Cloud / Agentforce

Related Prompts (1)

Community stories (1)

Medium
9 min read

How I Built a Multimodal CX Agent with Just an SOP and Gemini Live API

I wanted to test a simple idea: what if you architected an AI support agent the same way? Give it a training manual instead of a workflow tree. Give it Google Search instead of a RAG pipeline. And use a single multimodal model so you don’t need separate systems for voice, text, and vision.

I built Cortado for the Gemini Live Agent Challenge to explore what that looks like in practice.


...

VS
Vasundra SrinivasanAI Architecture and Data Strategy
Mar 14, 2026