Skip to main content

Speech-to-Text

This document explains how to use the IMKit SDK to convert audio content from voice messages into text.

Prerequisites

Before getting started, ensure your SDK version is ≥ 5.22.0 and the Speech-to-Text feature is enabled in the RC Console.

tip

This feature is supported from version 5.22.0 onward and only works with voice messages sent using SDK 5.22.0 or later. High-definition voice messages must meet these requirements: sampling rate of 8000Hz or 16000Hz, mono channel, aac format, and duration ≤ 60 seconds. Historical voice messages are not supported.

Demo

Overview

The IMKit SDK supports recording voice messages up to 60 seconds. In the chat UI, users can long-press a voice message bubble, select Convert to Text from the menu, and use the IMLib SDK to transcribe the audio into text. The SDK tracks the visibility state of converted text, ensuring UI consistency when re-entering the conversation.

tip

The Convert to Text option won't appear if:

  • The feature is disabled
  • The message failed to send
  • Conversion is in progress

Component Workflow

Key Classes

ClassPurposeDescription
SpeechToTextHandlerCore processorHandles conversion requests, visibility toggles, and callbacks.
VoiceMessageItemProviderStandard voice message UIManages VoiceMessage UI elements including transcription display.
HQVoiceMessageItemProviderHD voice message UIManages HQVoiceMessage UI elements including transcription display.
MessageViewModelMessage view modelIntegrates with SpeechToTextHandler for menu actions and UI state.
SpeechToTextViewsUI component groupEncapsulates transcription-related UI elements.
SpeechToTextInfoData modelStores transcription status and results.

Class diagram:

SpeechToTextInfo is the IMLib SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).

SpeechToTextInfo properties:

PropertyTypeDescription
statusSpeechToTextStatusConversion state: NOT_CONVERTED, CONVERTING, SUCCESS, FAILED.
textStringTranscribed text content.
isVisiblebooleanVisibility flag (default: false).
errorCodeintError code (valid when status=FAILED).
tip
  • After initiating conversion, SpeechToTextInfo.isVisible defaults to true in the database.
  • Android versions maintain additional UI states via UiMessage.businessState.

SpeechToTextModel is the IMKit SDK's data model for tracking transcription states. Below are key properties (see full API docs for details).

SpeechToTextModel properties:

PropertyTypeDescription
statusSpeechToTextStatusConversion state.
sttInfoSpeechToTextInfoCorresponding database sttInfo.
isVisiblebooleanUI visibility flag (default: false).
tip
  • After initiating conversion, SpeechToTextInfo.isVisible defaults to true in the database.