Speech-to-Text
This document explains how to use the IMKit SDK to convert audio content from voice messages into text.
Prerequisites
Before getting started, ensure your SDK version is ≥ 5.22.0 and the Speech-to-Text feature is enabled in the RC Console.
This feature is supported from version 5.22.0 onward and only works with voice messages sent using SDK 5.22.0 or later. High-definition voice messages must meet these requirements: sampling rate of 8000Hz or 16000Hz, mono channel, aac
format, and duration ≤ 60 seconds. Historical voice messages are not supported.
Demo





Overview
The IMKit SDK supports recording voice messages up to 60 seconds. In the chat UI, users can long-press a voice message bubble, select Convert to Text from the menu, and use the IMLib SDK to transcribe the audio into text. The SDK tracks the visibility state of converted text, ensuring UI consistency when re-entering the conversation.
The Convert to Text option won't appear if:
- The feature is disabled
- The message failed to send
- Conversion is in progress
Component Workflow
Key Classes
Class | Purpose | Description |
---|---|---|
SpeechToTextHandler | Core processor | Handles conversion requests, visibility toggles, and callbacks. |
VoiceMessageItemProvider | Standard voice message UI | Manages VoiceMessage UI elements including transcription display. |
HQVoiceMessageItemProvider | HD voice message UI | Manages HQVoiceMessage UI elements including transcription display. |
MessageViewModel | Message view model | Integrates with SpeechToTextHandler for menu actions and UI state. |
SpeechToTextViews | UI component group | Encapsulates transcription-related UI elements. |
SpeechToTextInfo | Data model | Stores transcription status and results. |
Class diagram:
SpeechToTextInfo
is the IMLib SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).
SpeechToTextInfo
properties:
Property | Type | Description |
---|---|---|
status | SpeechToTextStatus | Conversion state: NOT_CONVERTED, CONVERTING, SUCCESS, FAILED. |
text | String | Transcribed text content. |
isVisible | boolean | Visibility flag (default: false). |
errorCode | int | Error code (valid when status=FAILED). |
- After initiating conversion,
SpeechToTextInfo.isVisible
defaults to true in the database. - Android versions maintain additional UI states via
UiMessage.businessState
.
SpeechToTextModel
is the IMKit SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).
SpeechToTextModel
properties:
Property | Type | Description |
---|---|---|
status | SpeechToTextStatus | Conversion state. |
sttInfo | SpeechToTextInfo | Corresponding database sttInfo . |
isVisible | boolean | UI visibility flag (default: false). |
- After initiating conversion,
SpeechToTextInfo.isVisible
defaults to true in the database.