Skip to main content

Multimodal Data Flow

A TuyaOpen AI device is multimodal: it takes in speech, text, camera images, and device/sensor data, and responds with speech, on-screen text, and actions. This page categorizes those four modalities and shows how each one travels between the hardware (mic, speaker, camera, display, sensors), the on-device software (ai_components), and the Tuya AI cloud.

On-device AI pipeline: hardware peripherals connect through the ai_components software layer and ai_agent to the Tuya AI cloud
On-device AI pipeline: hardware peripherals connect through the ai_components software layer and ai_agent to the Tuya AI cloud

The three-layer pathโ€‹

Every modality follows the same path, and ai_agent is the single bridge to the cloud. Each modality module owns one class of peripheral and hands its data to โ€” or receives it from โ€” the agent.

You enable only the modalities your product needs in Kconfig (ENABLE_COMP_AI_*); disabled modules are not compiled in.

1. Audio โ€” voice in, voice outโ€‹

The core modality for a voice assistant.

  • In: microphone โ†’ ai_audio_input โ†’ Voice Activity Detection (VAD) โ€” manual (a button press) or automatic (voice detection) โ€” slices the speech, which ai_agent uploads to cloud ASR. Wake-word listening is driven by the Wakeup chat mode.
  • Out: cloud TTS and music โ†’ ai_audio_player โ†’ decode and resample โ†’ speaker.
  • Hardware: microphone, speaker, and a button for press-to-talk.
  • Components: Audio input, Audio player.

2. Vision โ€” images in, preview outโ€‹

  • In: camera โ†’ ai_video_input captures a JPEG frame (ai_video_get_jpeg_frame) โ†’ ai_agent_send_image uploads it to cloud vision for visual Q&A or image understanding.
  • Out / preview: live camera frames render locally through the video display callback; cloud-pushed images stream in through ai_picture.
  • Hardware: camera, display.
  • Components: Video input.

3. Text โ€” typed or recognized in, rendered outโ€‹

  • In: ai_agent_send_text sends a string directly; spoken input also comes back as ASR text.
  • Out: the NLG reply streams back token by token, and ai_ui renders it in your chosen style (WeChat-style bubbles, chatbot, or OLED).
  • Hardware: display (and serial, for the serial chatbot demo).
  • Components: AI Agent, UI management.

4. Sensory / device data โ€” state in, actions outโ€‹

This is how the cloud AI perceives and controls the physical device.

  • The AI reads device state and triggers actions through MCP tools exposed by ai_mcp: query device information, switch chat mode, take a photo, adjust volume โ€” plus any custom tool you register for your own sensors and actuators.
  • Arbitrary byte payloads can also be uploaded with ai_agent_send_file.
  • Hardware: sensors, actuators, GPIO โ€” reached through your MCP tool implementations.
  • Components: MCP server, MCP tools.

Where each modality is handledโ€‹

ModalityIn (hardware โ†’ cloud)Out (cloud โ†’ hardware)Components
Audiomic โ†’ VAD โ†’ agentTTS / music โ†’ player โ†’ speakerai_audio_input, ai_audio_player
Visioncamera โ†’ JPEG โ†’ agentpreview / pushed image โ†’ displayai_video_input, ai_picture
Textsend_text / ASRNLG stream โ†’ UI โ†’ displayai_agent, ai_ui
Sensory / deviceMCP tool reads, send_fileMCP tool actionsai_mcp
note

All four modalities share one cloud session through ai_agent. The chat modes decide when the device listens and uploads; the agent decides how data reaches the cloud; the cloud decides what comes back.

See alsoโ€‹