Speech-to-Text

This document explains how to use the IMKit SDK to convert audio content from voice messages into text.

Prerequisites

Before getting started, ensure your SDK version is ≥ 5.22.0 and the Speech-to-Text feature is enabled in the RC Console.

tip

This feature is supported from version 5.22.0 onward and only works with voice messages sent using SDK 5.22.0 or later. High-definition voice messages must meet these requirements: sampling rate of 8000Hz or 16000Hz, mono channel, aac format, and duration ≤ 60 seconds. Historical voice messages are not supported.

Demo

Overview

The IMKit SDK supports recording voice messages up to 60 seconds. In the chat UI, users can long-press a voice message bubble, select Convert to Text from the menu, and use the IMLib SDK to transcribe the audio into text. The SDK tracks the visibility state of converted text, ensuring UI consistency when re-entering the conversation.

tip

The Convert to Text option won't appear if:

The feature is disabled
The message failed to send
Conversion is in progress

Component Workflow

Key Classes

Class	Purpose	Description
`SpeechToTextHandler`	Core processor	Handles conversion requests, visibility toggles, and callbacks.
`VoiceMessageItemProvider`	Standard voice message UI	Manages VoiceMessage UI elements including transcription display.
`HQVoiceMessageItemProvider`	HD voice message UI	Manages HQVoiceMessage UI elements including transcription display.
`MessageViewModel`	Message view model	Integrates with SpeechToTextHandler for menu actions and UI state.
`SpeechToTextViews`	UI component group	Encapsulates transcription-related UI elements.
`SpeechToTextInfo`	Data model	Stores transcription status and results.

Class diagram:

SpeechToTextInfo is the IMLib SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).

SpeechToTextInfo properties:

Property	Type	Description
`status`	`SpeechToTextStatus`	Conversion state: NOT_CONVERTED, CONVERTING, SUCCESS, FAILED.
`text`	`String`	Transcribed text content.
`isVisible`	`boolean`	Visibility flag (default: false).
`errorCode`	`int`	Error code (valid when status=FAILED).

tip

After initiating conversion, SpeechToTextInfo.isVisible defaults to true in the database.
Android versions maintain additional UI states via UiMessage.businessState.

SpeechToTextModel is the IMKit SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).

SpeechToTextModel properties:

Property	Type	Description
`status`	`SpeechToTextStatus`	Conversion state.
`sttInfo`	`SpeechToTextInfo`	Corresponding database `sttInfo`.
`isVisible`	`boolean`	UI visibility flag (default: false).

tip

After initiating conversion, SpeechToTextInfo.isVisible defaults to true in the database.

IMLib

IMKit

Global Chat UIKit

push notification

IMLib

Global Chat UIKit

CallLib

CallKit

RTCLib

Chat Server

Chat Server SDK

Audio & Video Server

Speech-to-Text

Prerequisites​

Demo​

Overview​

Component Workflow​

Key Classes​

Prerequisites

Demo

Overview

Component Workflow

Key Classes