媒体理解（入站） — 2026-01-17

OpenClaw 可以在回复流程运行前预处理入站媒体（图片/音频/视频）。它会自动检测本地工具或 Provider API 密钥是否可用，也可以手动关闭或自定义。如果理解功能关闭，模型仍然照常收到原始文件/URL。

目标

可选功能：将入站媒体预消化为短文本，加速路由和命令解析。
始终保留原始媒体传递给模型。
支持 Provider API 和 CLI 回退。
允许多个模型按顺序回退（错误/大小/超时）。

整体行为

收集入站附件（MediaPaths、MediaUrls、MediaTypes）。
对每个启用的能力（图片/音频/视频），按策略选择附件（默认：第一个）。
选择第一个符合条件的模型条目（大小 + 能力 + 认证）。
如果模型失败或媒体太大，回退到下一个条目。
成功后：
- Body 变为 [Image]、[Audio] 或 [Video] 块。
- 音频设置 {{Transcript}}；有说明文字时用说明文字进行命令解析，否则用转写文本。
- 说明文字以 User text: 形式保留在块内。

如果理解失败或被关闭，回复流程照常继续，使用原始正文和附件。

配置概览

tools.media 支持共享模型列表和按能力的覆盖：

tools.media.models：共享模型列表（用 capabilities 限定范围）。
tools.media.image / tools.media.audio / tools.media.video：
- 默认值（prompt、maxChars、maxBytes、timeoutSeconds、language）
- Provider 覆盖（baseUrl、headers、providerOptions）
- Deepgram 音频选项通过 tools.media.audio.providerOptions.deepgram
- 音频转写回显控制（echoTranscript，默认 false；echoFormat）
- 可选的按能力 models 列表（优先于共享模型）
- attachments 策略（mode、maxAttachments、prefer）
- scope（按频道/聊天类型/会话键的可选限定）
tools.media.concurrency：最大并发能力运行数（默认 2）。

{
  tools: {
    media: {
      models: [
        /* 共享列表 */
      ],
      image: {
        /* 可选覆盖 */
      },
      audio: {
        /* 可选覆盖 */
        echoTranscript: true,
        echoFormat: '📝 "{transcript}"',
      },
      video: {
        /* 可选覆盖 */
      },
    },
  },
}

模型条目

每个 models[] 条目可以是 Provider 类型或 CLI 类型：

{
  type: "provider", // 省略时默认
  provider: "openai",
  model: "gpt-5.2",
  prompt: "Describe the image in <= 500 chars.",
  maxChars: 500,
  maxBytes: 10485760,
  timeoutSeconds: 60,
  capabilities: ["image"], // 可选，用于多模态条目
  profile: "vision-profile",
  preferredProfile: "vision-fallback",
}

{
  type: "cli",
  command: "gemini",
  args: [
    "-m",
    "gemini-3-flash",
    "--allowed-tools",
    "read_file",
    "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
  ],
  maxChars: 500,
  maxBytes: 52428800,
  timeoutSeconds: 120,
  capabilities: ["video", "image"],
}

CLI 模板还可以使用：

{{MediaDir}}（媒体文件所在目录）
{{OutputDir}}（本次运行创建的临时目录）
{{OutputBase}}（临时文件基础路径，不含扩展名）

默认值与限制

推荐默认值：

maxChars：图片/视频 500（简短，便于命令解析）
maxChars：音频不限（完整转写，除非你设置限制）
maxBytes：
- 图片：10MB
- 音频：20MB
- 视频：50MB

规则：

媒体超过 maxBytes 时，该模型被跳过，尝试下一个模型。
小于 1024 字节的音频文件被视为空/损坏，在 provider/CLI 转写前直接跳过。
模型返回超过 maxChars 的内容会被截断。
prompt 默认为简单的”Describe the {media}.”加上 maxChars 指导（仅图片/视频）。
如果 <capability>.enabled: true 但未配置模型，OpenClaw 会尝试使用当前回复模型（前提是其 provider 支持该能力）。

自动检测媒体理解（默认行为）

如果 tools.media.<capability>.enabled 没有设为 false，且你没有配置模型，OpenClaw 会按以下顺序自动检测，找到第一个可用的就停：

本地 CLI（仅音频；如果已安装）
- sherpa-onnx-offline（需要 SHERPA_ONNX_MODEL_DIR 包含 encoder/decoder/joiner/tokens）
- whisper-cli（whisper-cpp；使用 WHISPER_CPP_MODEL 或内置的 tiny 模型）
- whisper（Python CLI；自动下载模型）
Gemini CLI（gemini），使用 read_many_files
Provider API 密钥
- 音频：OpenAI → Groq → Deepgram → Google
- 图片：OpenAI → Anthropic → Google → MiniMax
- 视频：Google

要关闭自动检测：

{
  tools: {
    media: {
      audio: {
        enabled: false,
      },
    },
  },
}

注意：二进制文件检测在 macOS/Linux/Windows 上属于尽力而为；请确保 CLI 在 PATH 上（支持 ~ 展开），或者用完整路径指定 CLI 模型。

代理环境变量支持（Provider 模型）

启用基于 Provider 的音频和视频媒体理解时，OpenClaw 会尊重标准出站代理环境变量：

HTTPS_PROXY
HTTP_PROXY
https_proxy
http_proxy

没设代理环境变量时走直连。代理值格式错误时，OpenClaw 会记录警告并回退到直连。

能力限定（可选）

设置 capabilities 后，该条目只对指定的媒体类型生效。对于共享列表，OpenClaw 会推断默认值：

openai、anthropic、minimax：image
google（Gemini API）：image + audio + video
groq：audio
deepgram：audio

CLI 条目请显式设置 capabilities，避免意外匹配。省略 capabilities 时，该条目对其所在列表的所有类型生效。

Provider 支持矩阵（OpenClaw 集成）

能力	Provider 集成	说明
图片	OpenAI / Anthropic / Google / 其他通过 `pi-ai`	注册表中任何支持图片的模型都可用。
音频	OpenAI、Groq、Deepgram、Google、Mistral	Provider 转写（Whisper/Deepgram/Gemini/Voxtral）。
视频	Google（Gemini API）	Provider 视频理解。

模型选择建议

在质量和安全性很重要时，优先使用各能力最新最强的模型。
对于处理不可信输入的工具型 agent，避免使用较旧/较弱的媒体模型。
每个能力至少保留一个回退（高质量模型 + 更快/更便宜的模型），保障可用性。
CLI 回退（whisper-cli、whisper、gemini）在 Provider API 不可用时很有用。
parakeet-mlx 注意事项：使用 --output-dir 时，当输出格式为 txt（或未指定）时，OpenClaw 会读取 <output-dir>/<media-basename>.txt；非 txt 格式回退到解析标准输出。

附件策略

按能力的 attachments 控制处理哪些附件：

mode：first（默认）或 all
maxAttachments：处理数量上限（默认 1）
prefer：first、last、path、url

当 mode: "all" 时，输出标注为 [Image 1/2]、[Audio 2/2] 等。

配置示例

1) 共享模型列表 + 覆盖

{
  tools: {
    media: {
      models: [
        { provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
        {
          provider: "google",
          model: "gemini-3-flash-preview",
          capabilities: ["image", "audio", "video"],
        },
        {
          type: "cli",
          command: "gemini",
          args: [
            "-m",
            "gemini-3-flash",
            "--allowed-tools",
            "read_file",
            "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
          ],
          capabilities: ["image", "video"],
        },
      ],
      audio: {
        attachments: { mode: "all", maxAttachments: 2 },
      },
      video: {
        maxChars: 500,
      },
    },
  },
}

2) 仅音频 + 视频（图片关闭）

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
          },
        ],
      },
      video: {
        enabled: true,
        maxChars: 500,
        models: [
          { provider: "google", model: "gemini-3-flash-preview" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

3) 可选的图片理解

{
  tools: {
    media: {
      image: {
        enabled: true,
        maxBytes: 10485760,
        maxChars: 500,
        models: [
          { provider: "openai", model: "gpt-5.2" },
          { provider: "anthropic", model: "claude-opus-4-6" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

4) 多模态单条目（显式 capabilities）

{
  tools: {
    media: {
      image: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      audio: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      video: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
    },
  },
}

状态输出

媒体理解运行时，/status 会包含一行简要摘要：

📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)

显示每个能力的结果和选用的 provider/模型。

补充说明

理解功能属于尽力而为。出错不会阻塞回复。
即使理解功能关闭，附件仍然会传递给模型。
使用 scope 限制理解功能的运行范围（比如只在私聊中运行）。