Voice Messages (STT/TTS)

xopcbot supports voice message processing via Telegram:

STT (Speech-to-Text): Convert voice to text
TTS (Text-to-Speech): Convert text to voice

Overview

When a user sends a voice message via Telegram:

STT converts voice to text
Agent processes the text content
TTS converts reply to voice (optional)
Sends both text and voice reply

Quick Start

Add to ~/.xopcbot/config.json:

json

{
  "stt": {
    "enabled": true,
    "provider": "alibaba",
    "alibaba": {
      "apiKey": "your-dashscope-api-key"
    }
  },
  "tts": {
    "enabled": true,
    "provider": "openai",
    "trigger": "auto",
    "openai": {
      "apiKey": "your-openai-api-key"
    }
  }
}

STT Configuration

Alibaba Paraformer (Recommended for Chinese)

json

{
  "stt": {
    "enabled": true,
    "provider": "alibaba",
    "alibaba": {
      "apiKey": "your-dashscope-api-key",
      "model": "paraformer-v1"
    }
  }
}

Supported Models:

paraformer-v1: Chinese/English, 16kHz+ audio
paraformer-8k-v1: Phone recordings, 8kHz
paraformer-mtl-v1: Multi-language support

OpenAI Whisper

json

{
  "stt": {
    "enabled": true,
    "provider": "openai",
    "openai": {
      "apiKey": "your-openai-api-key",
      "model": "whisper-1"
    }
  }
}

Model:

whisper-1: Supports 99+ languages

Fallback Configuration

json

{
  "stt": {
    "enabled": true,
    "provider": "alibaba",
    "fallback": {
      "enabled": true,
      "order": ["alibaba", "openai"]
    }
  }
}

If primary provider fails, automatically tries fallback providers in order.

TTS Configuration

Trigger Modes

Mode	Description
`auto`	User sends voice → Agent replies with voice
`never`	Disable TTS, text only

OpenAI TTS

json

{
  "tts": {
    "enabled": true,
    "provider": "openai",
    "trigger": "auto",
    "openai": {
      "apiKey": "your-openai-api-key",
      "model": "tts-1",
      "voice": "alloy"
    }
  }
}

Voices: alloy, echo, fable, onyx, nova, shimmer

Models:

tts-1: Standard quality, faster
tts-1-hd: High definition quality

Alibaba CosyVoice (Recommended for Chinese)

json

{
  "tts": {
    "enabled": true,
    "provider": "alibaba",
    "trigger": "auto",
    "alibaba": {
      "apiKey": "your-dashscope-api-key",
      "model": "cosyvoice-v1",
      "voice": "longxiaochun"
    }
  }
}

See Alibaba documentation for available voices.

Complete Configuration Example

json

{
  "stt": {
    "enabled": true,
    "provider": "alibaba",
    "alibaba": {
      "apiKey": "${DASHSCOPE_API_KEY}",
      "model": "paraformer-v1"
    },
    "openai": {
      "apiKey": "${OPENAI_API_KEY}",
      "model": "whisper-1"
    },
    "fallback": {
      "enabled": true,
      "order": ["alibaba", "openai"]
    }
  },
  "tts": {
    "enabled": true,
    "provider": "openai",
    "trigger": "auto",
    "openai": {
      "apiKey": "${OPENAI_API_KEY}",
      "model": "tts-1",
      "voice": "alloy"
    },
    "alibaba": {
      "apiKey": "${DASHSCOPE_API_KEY}",
      "model": "cosyvoice-v1",
      "voice": "longxiaochun"
    }
  }
}

Limits

Limit	Value
Voice message duration	60 seconds (STT skipped if longer)
TTS text length	4000 characters
Supported channels	Telegram only

Environment Variables

Use environment variables instead of hardcoding API keys:

Variable	Purpose
`DASHSCOPE_API_KEY`	Alibaba DashScope API Key (STT/TTS)
`OPENAI_API_KEY`	OpenAI API Key (STT/TTS)

Example:

json

{
  "stt": {
    "alibaba": {
      "apiKey": "${DASHSCOPE_API_KEY}"
    }
  },
  "tts": {
    "openai": {
      "apiKey": "${OPENAI_API_KEY}"
    }
  }
}

Workflow

User sends voice message (Telegram)
        │
        ▼
┌─────────────────────┐
│ Download voice      │
│ message audio       │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ STT Processing      │
│ (Alibaba/OpenAI)    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Agent processes     │
│ transcribed text    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ TTS Processing      │
│ (if trigger=auto)   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Send text + voice   │
│ reply to Telegram   │
└─────────────────────┘

Troubleshooting

STT Fails

Check API key: Verify API key is correct and has credits
Check audio length: Must be < 60 seconds
View logs: tail -f ~/.xopcbot/logs/xopcbot.log
Test fallback: Ensure fallback is configured

No Voice Reply

Check TTS enabled: Verify tts.enabled is true
Check trigger mode: Must be auto for voice replies
Verify user sent voice: TTS only triggers on voice messages
Check text length: Must be < 4000 characters
View logs: Check for TTS error messages

Poor Recognition Quality

Try different provider: Switch between Alibaba and OpenAI
Check audio quality: Ensure clear audio recording
Language match: Use provider optimized for your language
- Chinese: Alibaba Paraformer
- Multi-language: OpenAI Whisper

API Reference

STT Config Schema

typescript

interface STTConfig {
  enabled: boolean;
  provider: 'alibaba' | 'openai';
  alibaba?: {
    apiKey?: string;
    model?: string;
  };
  openai?: {
    apiKey?: string;
    model?: string;
  };
  fallback?: {
    enabled: boolean;
    order: ('alibaba' | 'openai')[];
  };
}

TTS Config Schema

typescript

interface TTSConfig {
  enabled: boolean;
  provider: 'openai' | 'alibaba';
  trigger: 'auto' | 'never';
  alibaba?: {
    apiKey?: string;
    model?: string;
    voice?: string;
  };
  openai?: {
    apiKey?: string;
    model?: string;
    voice?: string;
  };
}

Best Practices

Use fallback: Configure fallback providers for reliability
Monitor usage: STT/TTS APIs consume credits
Set length limits: Inform users about 60-second limit
Test voices: Choose appropriate voice for your use case
Environment variables: Store API keys securely

Voice Messages (STT/TTS) ​

Overview ​

Quick Start ​

STT Configuration ​

Alibaba Paraformer (Recommended for Chinese) ​

OpenAI Whisper ​

Fallback Configuration ​

TTS Configuration ​

Trigger Modes ​

OpenAI TTS ​

Alibaba CosyVoice (Recommended for Chinese) ​

Complete Configuration Example ​

Limits ​

Environment Variables ​

Workflow ​

Troubleshooting ​

STT Fails ​

No Voice Reply ​

Poor Recognition Quality ​

API Reference ​

STT Config Schema ​

TTS Config Schema ​

Best Practices ​

Voice Messages (STT/TTS)

Overview

Quick Start

STT Configuration

Alibaba Paraformer (Recommended for Chinese)

OpenAI Whisper

Fallback Configuration

TTS Configuration

Trigger Modes

OpenAI TTS

Alibaba CosyVoice (Recommended for Chinese)

Complete Configuration Example

Limits

Environment Variables

Workflow

Troubleshooting

STT Fails

No Voice Reply

Poor Recognition Quality

API Reference

STT Config Schema

TTS Config Schema

Best Practices