# Streaming Voice Processing
In addition to providing voice processing capabilities via HTTP, the platform also offers streaming capabilities, including ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).
# Invocation Method
Use the WebSocket protocol, with control messages being UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).
# Calling Process
- Establish a WebSocket connection;
- Send a Starter package, which contains generic configuration information for subsequent requests;
- Receive a response, indicating whether authentication was successful or failed;
- Send Data packets;
- Receive corresponding results, steps 4 and 5 can be repeated;
- If there are no more tasks currently, you can disconnect directly (there is no design for a disconnection message);
# Calling Restrictions
- If a Starter package is not sent within 10 seconds after establishing a WebSocket connection, the WebSocket connection will be disconnected;
- If no request is received within 60 seconds, the server will actively disconnect. It is recommended to send Ping packets at regular intervals to keep the connection alive;
# Request Message Format
# Starter
The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packets. The format is JSON text and contains the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
type | Workflow Type | string | Required | Fill in the service engine number or combination corresponding to the capability, e.g., "TTS3", "ASR5", "TTS3+ASR5" |
device | Device ID | string | Empty string | Device ID, recommended to fill in for troubleshooting |
session | Session ID | string | Random UUIDv4 | It is recommended that the caller generates and fills in the Session ID for troubleshooting |
asr | ASR Config | object | Required for ASR use | ASR specific configuration, see below for details |
tts | TTS Config | object | Required for TTS use | TTS specific configuration, see below for details |
# Starter Message Example
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "ASR5",
"device": "device-weye",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"asr": {
"mic_volume": 0.67
}
}
# Data
After sending the Starter package and successfully establishing a connection, multiple Data packets can be sent repeatedly. The format of the Data package is described in the corresponding capability document.
# Return Message Format
# Authentication Result
After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The current request corresponding service module, i.e., auth |
session | Session ID | string | Yes | The current connection's Session ID |
status | Status Name | enum | Yes | The current session's status, ok for normal, fail for fail |
error | Error Message | string | No | If failed, the error message returned |
# Authentication Result Example
{
"service": "auth",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
# Result Data
Depending on the capability, each request will return one or more result data packets. The format is JSON text and contains the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The current request corresponding service module |
session | Session ID | string | Yes | The current connection's Session ID |
trace | Trace ID | string | Yes | The current request's Trace ID |
status | Status Name | enum | Yes | The current session's status, ok for normal, fail for fail |
error | Error Message | string | No | If failed, the error message returned |
asr | ASR Content | object | No | If successful, the recognition result returned |
nlp | NLP Content | object | No | If successful, the reply result returned |
tts | TTS Content | object | No | If successful, the synthesis result returned |
# ASR (Automatic Speech Recognition) Capability Integration Instructions
The central control WebSocket full-duplex interface ASR calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).
# Calling Process
- Establish a WebSocket connection;the address is usually
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}
; - Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
- Receive a response, indicating whether authentication was successful or failed;
- Send Data binary data packets, containing PCM audio;
- After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
- Receive corresponding ASR results, formatted as JSON text;
- If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
- If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);
# Request Message Format
# Starter
The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text and contains the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
type | Workflow Type | string | Required | Fill in the service engine number for Chinese recognition, please choose: "ASR5" |
device | Device ID | string | Empty string | Device ID, recommended to fill in for troubleshooting |
session | Session ID | string | Random UUIDv4 | It is recommended that the caller generates and fills in the Session ID for troubleshooting |
asr | ASR Config | object | Required | ASR specific configuration, see below for details |
ASR Config configuration is as follows:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
language | Language Code | string | zh-CN | Optional field, the language to be recognized |
mic_volume | Microphone Volume | float | 1.0 | Optional field, microphone volume for ASR to perform auto gain, supported range is 0 to 1 |
subtitle | Subtitle Format | string | Empty string | Optional field, the format of returned subtitles, empty means no return, supports: srt |
subtitle_max_length | Subtitle Max Length | int | 0 | Optional field, the maximum number of characters per line of subtitle, 0 means no limit |
intermediate | Return Intermediate Result | bool | false | Optional field, whether to return intermediate results |
sentence_time | Return Sentence-Level Timestamp | bool | false | Optional field, whether to return sentence-level timestamps |
word_time | Return Word-Level Timestamp | bool | false | Optional field, whether to return word-level timestamps |
cache_url | Return Cache URL for Data | bool | false | Optional field, whether to upload subtitle files to Object Store storage and return cache URL |
pause_time_msec | Speech Pause Time (msec) | int | 500 | Optional field, speech pause time, used to determine the boundaries and segmentation of speech, default is 500 milliseconds |
# Data
After sending the Starter package and successfully establishing a connection, multiple binary Data packages can be sent repeatedly to stream audio.
The audio stream format is PCM, using a 16KHz sampling rate, 16bit data width, single channel, little endian. For example, use sox -t raw -r 16000 -e signed -b 16 -c 1
to convert formats, or ffmpeg -acodec pcm_s16le -ac 1 -ar 16000 -f s16le
to convert formats.
The requirement for the audio sending rate is different between streaming and non-streaming ASR:
- Streaming ASR: Audio is sent at the rate it is read from the microphone, recommended to send 1280 bytes every 40 milliseconds, or 5120 bytes every 160 milliseconds.
- Non-streaming ASR: Audio data can be sent all at once, each Data package size is limited to within 1920KB (1 minute).
# EOF
After completing the audio package transmission, send an EOF package to indicate the end of recognition.
The requirement for EOF package usage is different between streaming and non-streaming ASR:
- Streaming ASR: After completing the audio Data package transmission, send an EOF package. Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end.
- Non-streaming ASR: After completing the audio Data package transmission, must send an EOF package.
If subtitles are needed, regardless of whether it is streaming or not, an EOF package must be sent to notify the ASR server to generate subtitles.
The format is JSON text and contains the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
signal | End Mark | string | Required | Fixed as eof |
trace | Trace ID | string | Random UUIDv4 | Optional field, it is recommended that the caller generates and fills in the Trace ID for troubleshooting |
# Return Message Format
# Authentication Result
After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The current request corresponding service module, i.e., auth |
session | Session ID | string | Yes | The current connection's Session ID |
status | Status Name | enum | Yes | The current session's status, ok for normal, fail for fail |
error | Error Message | string | No | If failed, the error message returned |
# ASR Result Data
ASR will continuously return multiple text recognition result data packets and return subtitle, subtitle file address data packets after receiving the EOF request. If subtitle and subtitle address are not requested in the Starter request, only text recognition results are returned.
The return message format is JSON text, containing the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The current request corresponding service module, i.e., asr |
session | Session ID | string | Yes | The current connection's Session ID |
trace | Trace ID | string | Yes | The current sentence's Trace ID |
status | Status Name | enum | Yes | The current session's status, ok for normal, fail for fail |
error | Error Message | string | No | If failed, the error message returned |
asr | ASR Content | object | No | If successful, the recognition result returned, with specific fields described below |
Specific recognition results are located within the ASR Content:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
index | Index No. | int | Yes | Return package sequence number |
type | Package Type | enum | Yes | Text result package is text , intermediate result package is intermediate , subtitle package is subtitle , subtitle address package is subtitle_url , indicating all transmissions are complete is eof |
text | Text | string | Yes | Text recognition result, will also appear in subtitle package but content is empty |
subtitle | Subtitle | string | No | Subtitle content, only in subtitle package |
subtitle_url | Subtitle URL | string | No | Subtitle content download address, only in subtitle package |
sentence_time | Sentence-Level Timestamp | object | No | Sentence-level timestamp |
word_times | Word-Level Timestamp | object | No | Word-level timestamp |
# Text Result
Each sentence that is completely recognized will return a message. Intermediate recognition results are not returned. If it cannot be recognized or the recognition result is blank characters, it is also not returned.
Text result example:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "c9cb36d8-3ca9-4e2b-9034-29f2c4edc3de",
"asr": {
"index": 1,
"type": "text",
"text": "Hello."
}
}
# Subtitle Result
After requesting subtitles and sending EOF, the generated subtitle result is returned.
Subtitle result example:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "9a971f17-f871-4b73-9084-b856b67537d5",
"asr": {
"index": 3,
"type": "subtitle",
"subtitle": "1\n00:00:00,000 --> 00:00:01,280\n你好。\n\n2\n00:00:02,960 --> 00:00:04,240\n再见。\n\n"
}
}
# Subtitle Address
Request subtitles, request Cache URL
, and after sending EOF, return the cache URL of the subtitle file uploaded to Object Store.
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "25b30001-0b5a-4e9e-892c-4cc4d5bf134d",
"asr": {
"index": 4,
"type": "subtitle_url",
"subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/eab708a8-7aca-4237-a0a3-a6422ade8a23_25b30001-0b5a-4e9e-892c-4cc4d5bf134d.srt"
}
}
# EOF
When an EOF request is received, send an EOF result packet, indicating that all results have been sent.
EOF example:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "16ff049a-41fb-4c7a-ac5e-b26dbc3218e5",
"asr": {
"index": 5,
"type": "eof"
}
}
# Practical Process Example Analysis
# Case 1: Minimum Configuration Process
Request: Starter
{
"type": "ASR5",
"asr": {}
}
Request: Binary Data Response: 1
{
"service": "auth",
"status": "ok",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721"
}
Response: 2
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "e1c44bdc-4f9a-487c-806e-005679db7d0d",
"asr": {
"index": 1,
"type": "text",
"text": "早知道你喜欢十里春光"
}
}
Response: 3
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "f7551818-5025-4d83-b41b-136bb19b5b5f",
"asr": {
"index": 2,
"type": "text",
"text": "我一定会在麦田里种满玫瑰和山茶"
}
}
Response: 4
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "89d3a8b4-a291-4cdc-9b78-d3f912d06223",
"asr": {
"index": 3,
"type": "text",
"text": "你路过这片土地才算浪漫"
}
}
# Case 2: Complete Configuration Process
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "ASR5",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"asr": {
"subtitle": "srt",
"cache_url": true,
"intermediate": true,
"mic_volume": 0.67
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
Request: Binary Data
Request: EOF
{
"signal": "eof",
"trace": "52517513-875a-47b6-bd30-f11a75e26745"
}
Response: 2
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "dcf88fbe-6cda-452d-8f51-e316cb4a0943",
"asr": {
"index": 1,
"type": "intermediate",
"text": "介"
}
}
Response: 3
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "879d4700-746f-411b-954c-f83a2c6cd300",
"asr": {
"index": 2,
"type": "intermediate",
"text": "介绍下长"
}
}
Response: 4
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "20e6a5e5-6d5d-4284-9013-4d410f1a5d37",
"asr": {
"index": 3,
"type": "intermediate",
"text": "介绍下长宁图书"
}
}
Response: 5
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "cf7733e2-da28-442d-b5bb-282fc8f352f3",
"asr": {
"index": 4,
"type": "text",
"text": "介绍一下长宁图书馆。",
"sentence_time": {
"begin_ms": 2080,
"end_ms": 4640
},
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "介"
},
{
"begin_ms": 2560,
"end_ms": 2800,
"text": "绍"
},
{
"begin_ms": 2800,
"end_ms": 2920,
"text": "一"
},
{
"begin_ms": 2920,
"end_ms": 3040,
"text": "下"
},
{
"begin_ms": 3040,
"end_ms": 3280,
"text": "长"
},
{
"begin_ms": 3280,
"end_ms": 3480,
"text": "宁"
},
{
"begin_ms": 3480,
"end_ms": 3640,
"text": "图"
},
{
"begin_ms": 3640,
"end_ms": 3880,
"text": "书"
},
{
"begin_ms": 3880,
"end_ms": 4640,
"text": "馆"
}
]
}
}
Response: 6
{
"service": "asr",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
"asr": {
"index": 5,
"type": "subtitle",
"subtitle": "1\n00:00:00,000 --> 00:00:02,280\n介绍一下长宁图书馆\n\n"
}
}
Response: 7
{
"service": "asr",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
"asr": {
"index": 6,
"type": "subtitle_url",
"subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/8f97055c-bd29-41c7-92d1-3933fed566fa_3dcafe20-d6e0-4bce-a2ba-932b442e9e92.srt"
}
}
Response: 8
{
"service": "asr",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "2bd4cbce-0f72-402c-8e88-0f2704a22868",
"asr": {
"index": 7,
"type": "eof"
}
}
# TTS (Text-to-Speech) Integration Guide (Qid)
Central control WebSocket full-duplex interface TTS calling method description. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.
# Calling Process
- Establish a WebSocket connection, the address is usually
ws://aigc.softsugar.com/api/voice/stream/v3?Authorization=Bearer {token}
; - Send a Starter package, which contains the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
- Receive a response, indicating whether authentication was successful or failed;
- Send a Task package, which contains specific text and format information that needs to be synthesized;
- Receive data packages corresponding to the Task;
- If there are no more voice synthesis tasks currently, you can directly disconnect (there is no design for disconnect message);
# Request Message Format
# Starter
The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text, containing the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
type | Workflow Type | string | Required | Only supports TTS |
device | Device ID | string | Empty string | Device ID, recommended to fill in for troubleshooting and locating problems |
session | Session ID | string | Random UUIDv4 | It is recommended that the caller generates and fills in the Session ID for troubleshooting and locating problems |
tts | TTS Config | object | Required | TTS dedicated configuration, see below for details |
TTS Config configuration details:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
qid | Qid | string | 8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg | Required field |
pitch_offset | Pitch Offset | float | 0.0 | Optional field, pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supports range [-10, 10] |
speed_ratio | Speed Ratio | float | 1.0 | Optional field, speech speed, the higher the value, the slower the speech, supports range [0.5, 2] |
sample_rate | Sample Rate | int | 16000 | Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000 |
volume | Volume | int | 100 | Optional field, volume, the higher the value, the louder the sound, supports range [1, 400] |
format | File Format | string | pcm | Optional field, audio file and content, may support pcm, wav, mp3, but only pcm supports streaming return |
omit_error | Omit Error Message in Response | bool | false | Optional field, whether to omit error messages, defaults to returning |
audio | Return Audio Data | bool | true | Optional field, whether to return audio, defaults to returning |
phone | Return Phonetic Symbols | bool | false | Optional field, whether to return phonetic symbols, defaults not to return |
polyphone | Return Polyphone | bool | false | Optional field, whether to return polyphones in the query, defaults not to return |
subtitle | Subtitle Format | string | Empty string | Optional field, the format of the returned subtitles, empty means not to return, supports: srt |
subtitle_max_length | Subtitle Max Length | int | 0 | Optional field, the maximum number of characters for each line of subtitles/sentence-level timestamps, 0 means no limit on the number of characters, only valid when returning subtitles or sentence-level timestamps |
subtitle_cut_by_punc | Subtitle Cut by Punctuation | bool | false | Optional field, whether to wrap subtitles/sentence-level timestamps based on punctuation and remove punctuation, only valid when returning subtitles or sentence-level timestamps. The range of punctuation marks is seen in the Quick Reference section under Subtitle Wrapping Punctuation Range |
sentence_time | Return Sentence-Level Timestamp | bool | false | Optional field, whether to return sentence-level timestamps |
word_time | Return Word-Level Timestamp | bool | false | Optional field, whether to return word-level timestamps |
# Task
After sending the Starter package and successfully establishing a connection, multiple Task packages can be repeatedly sent to submit synthesis tasks. The Task package format is JSON text, containing the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
id | Task ID | string | Random UUIDv4 | Optional field, recommended that the caller generates and fills in, used to distinguish different requests in concurrent requests |
query | Query | string | Required | Text content to be synthesized |
ssml | Use SSML | bool | false | Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing method |
no_cache | Disable Cache | bool | false | Optional field, whether to disable result caching for the current request, if enabled, results will neither use cache nor be stored in cache for the current request |
override | TTS Config | object | Empty | Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: it's a direct replacement, not a merge) |
# Response Message Format
# Authentication Result
After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., auth |
session | Session ID | string | Yes | The Session ID of the current connection |
status | Status Name | enum | Yes | The status of the current session, normally ok , failure is fail |
error | Error Message | string | No | If failed, the error message returned |
# TTS Result Data
Each successful Task continuously returns multiple data packages, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packages. Packages of the same type are returned in logical order, but the order of different types of data packages is not guaranteed. If phonetic symbols, subtitles, Cache URL
are not requested in the Starter request, only audio will be returned.
The format of the return message is JSON text, containing the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., tts |
session | Session ID | string | Yes | The Session ID of the current connection |
trace | Trace ID | string | Yes | The Trace ID corresponding to the current Task |
status | Status Name | enum | Yes | The status of the current Task, normally ok , failure is fail |
error | Error Message | string | No | If failed, the error message returned |
tts | TTS Content | object | No | If successful, the synthesized result returned, see below for specific field meanings |
Specific synthesized results are located within TTS Content:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
id | Task ID | string | Yes | The ID corresponding to the current Task |
index | Index No. | int | Yes | Sequence number of audio package, phonetic symbol package |
type | Package Type | enum | Yes | audio for audio package, audio_url for audio file URL package, phone for phonetic symbol package, phone_url for phonetic symbol file URL package, subtitle_url for subtitle file URL package,subtitle for subtitle package, polyphone for polyphone package, timestamp for timestamp package, eof indicates all data has been sent |
audio_data | Base64-encoded Audio Data | string | No | Audio data, only in audio package |
phone_data | Base64-encoded Phonetic Symbols | string | No | Phonetic symbol data, only in phonetic symbol package |
polyphones | Polyphone Data | object | No | Polyphone data, only in polyphone package |
subtitle_data | Base64-encoded Subtitles | string | No | Subtitle data, only in subtitle package |
sentence_time | Sentence-Level Timestamp | object | No | Sentence-level timestamps, only in timestamp package |
word_times | Word-Level Timestamp | object | No | Word-level timestamps, only in timestamp package |
facefeature_data | Base64-encoded Face Feature | string | No | Face Feature data, only in Face Feature package |
audio_url | string | No | Audio file URL, only in audio package (Strikethrough indicates deprecated or unused features) | |
phone_url | URL of JSON for Phonetic Symbols | string | No | Phonetic symbol file URL, only in phonetic symbol package |
subtitle_url | URL of Subtitle File | string | No | Subtitle file URL, only in subtitle package |
resource | Phonetic Symbols Info | object | No | Phonetic symbol information, only in phonetic symbol package (used when TTS2 single return interface is used, this field will be returned) |
# Audio Package
Contains Base64 encoded synthesized audio data results.
When the requested audio format is pcm
, it is returned in multiple packages for streaming, other formats will be returned in a single package after audio synthesis.
Audio package example:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
}
}
# Phoneme Package
Includes the synthesis phoneme data results encoded in Base64.
Example of a phoneme package:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 2,
"type": "phone",
"phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
}
}
# Subtitles
Contains Base64 encoded synthetic subtitle data results.
Example of a subtitle pack:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 4,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
# Timestamp Packet
Includes sentence-level and character-level timestamp information.
Example of a timestamp:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944",
"index": 7,
"type": "timestamp",
"sentence_time": {
"begin_ms": 7770,
"end_ms": 9140,
"text": "新人起步很不容易"
},
"word_times": [
{"begin_ms": 7770, "end_ms": 7960, "text": "新"},
{"begin_ms": 7960, "end_ms": 8120, "text": "人"},
{"begin_ms": 8120, "end_ms": 8310, "text": "起"},
{"begin_ms": 8310, "end_ms": 8430, "text": "步"},
{"begin_ms": 8430, "end_ms": 8630, "text": "很"},
{"begin_ms": 8630, "end_ms": 8720, "text": "不"},
{"begin_ms": 8720, "end_ms": 8920, "text": "容"},
{"begin_ms": 8920, "end_ms": 9140, "text": "易"}
]
}
}
# Homophones Package
Includes homophone information, with recommended pronunciations first, followed by other pronunciations.
Example of homophones:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
# EOF
EOF packet, indicating that all results have been sent.
EOF example:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 8,
"type": "eof"
}
}
# Real-Process Example Analysis
# Case 1: Minimum Configuration Process
Request: Starter
{
"type": "TTS",
"tts": {}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}
Request: Task
{
"query": "大家好!"
}
Response: 2
{
"service": "tts",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260",
"trace": "f2e13c02-c629-4db8-a942-4393583a5182",
"tts": {
"id": "4b69geebj4septyxh72qy885f",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
# Case 2: Complete Configuration Process
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "TTS3",
"device": "device-wei",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"tts": {
"qid": "8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg",
"speed_ratio": 1.05,
"sample_rate": 16000,
"volume": 200,
"phone": true,
"polyphone": true,
"subtitle": "srt",
"sentence_time": true,
"word_time": true,
"cache_url": true
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}
Request: Task
{
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"query": "你好。",
"ssml": false
}
Response: 2 Audio
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
Response: 3 Phoneme
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 2,
"type": "phone",
"phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
}
}
Response: 4 Timestamp
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "de9a1066-d968-475a-ac38-b2da017b2a27",
"index": 3,
"type": "timestamp",
"sentence_time": {
"begin_ms": 500,
"end_ms": 1010,
"text": "你好。"
},
"word_times": [
{"begin_ms": 500, "end_ms": 590, "text": "你"},
{"begin_ms": 590, "end_ms": 1010, "text": "好"}
]
}
}
Response: 5 Polyphone
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
Response: 6 Subtitles
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 5,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
Response: 7 EOF
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 9,
"type": "eof"
}
}
# LLM (Large Language Model) Capability Integration Description
The central control WebSocket full-duplex interface LLM calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).
# Calling Process
- Establish a WebSocket connection;the address is usually
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}
; - Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
- Receive a response, indicating whether authentication was successful or failed;
- Send Data binary data packets, containing PCM audio;
- After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
- Receive corresponding ASR results, formatted as JSON text;
- If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
- If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);
# Request Message Format
# Starter
The first packet sent after establishing a connection, indicating the purpose of this connection and how subsequent data packets should be parsed. The format is JSON text, containing the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
type | Workflow Type | string | Required | Fill in the service engine number corresponding to the capability, e.g., "NLP7" NLP7 (sensechat), NLP10 (SenseTime humanoid model) |
device | Device ID | string | Empty string | Device ID, recommended to fill in for traceability and troubleshooting |
session | Session ID | string | Random UUIDv4 | It is recommended that the caller generates a Session ID for traceability and troubleshooting |
nlp | NLP Config | object | Required | NLP-specific configuration, see details below |
NLP Config details are as follows:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
omit_error | Omit Error Message in Response | bool | false | Optional field, whether to omit error messages. By default, error messages are returned |
know_ids | Knowledge IDs | string list | Empty list | Optional field, list of knowledge base IDs, supported only by certain large language model engines |
prompt_header | System Role of Prompt | string | Empty string | Optional field, background description of the prompt. If empty, the preset value in the configuration is used. Supported only by certain large language model engines |
max_reply_token | Max Token in Reply | int | 500 | Optional field, maximum number of tokens in the reply. The actual maximum value depends on the model, supported only by certain large language model engines |
# Query
After the Starter packet is sent and the connection is successfully established, multiple Query text packets can be sent to submit user questions. The format is JSON text, containing the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
id | Trace ID | string | Random UUIDv4 | Optional field, it is recommended that the caller generates and fills in the Trace ID to distinguish between different concurrent requests |
query | Query | string | Required | The text content of the user's question |
# Response Message Format
# Authentication Result
After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:
Field | Name | Type | Required | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., auth |
session | Session ID | string | Yes | The Session ID of the current connection |
status | Status Name | enum | Yes | The status of the current session. Normal is ok , failure is fail |
error | Error Message | string | No | If failed, the returned error message |
# NLP Result Data
For each successfully processed Query, a message will be returned. When omit_error
is false
, error messages will also be returned. The basic format of the return message is as follows:
Field | Name | Type | Required | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., nlp |
session | Session ID | string | Yes | The Session ID of the current connection |
trace | Trace ID | string | Yes | The Trace ID corresponding to the current sentence |
status | Status Name | enum | Yes | The status of the current session. Normal is ok , failure is fail |
error | Error Message | string | No | If failed, the returned error message |
nlp | NLP Content | object | No | If successful, the returned answer result. See the specific field meanings below |
The specific recognition results are located in NLP Content:
Field | Name | Type | Required | Description |
---|---|---|---|---|
index | Index No. | int | Yes | The sequence number of the return packet |
query | Query | string | Yes | The submitted question text |
answer | Answer | string | Yes | The returned broadcast text, the digital human front end calls TTS for broadcasting |
text | Text | string | No | The returned display text, displayed as text on the digital human front end |
finish_reason | Finish Reason | string | Yes | The reason for stopping generation, enumerated values Stopped by end token: stop Stopped by reaching the maximum generation length: length Stopped by triggering sensitive words: sensitive Stopped by triggering the model context length limit: context |
# TTS (Text-to-Speech) Integration Guide (Old)
Note: This interface is currently in maintenance mode and will not be updated with new features.
This document explains how to use the TTS capability via the Central Control WebSocket full-duplex interface. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.
# Invocation Process
- Establish a WebSocket connection, usually at
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization={Token}
; - Send a Starter packet, containing the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be terminated;
- Receive a response indicating whether authentication was successful or failed;
- Send a Task packet, containing specific text and format information to be synthesized;
- Receive data packets corresponding to the Task;
- If there are no more voice synthesis tasks, you can disconnect directly (there is no design for disconnect message);
# Request Message Format
# Starter
The first packet sent after establishing a connection, indicating the purpose of this connection and the parsing method for subsequent data packets. The format is a JSON text, including the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
type | Workflow Type | string | Mandatory | Fill in the service engine number corresponding to the capability, e.g., "TTS3" |
device | Device ID | string | Empty string | Device ID, recommended to fill in for troubleshooting and tracking issues |
session | Session ID | string | Random UUIDv4 | It is recommended to generate and fill in the Session ID for troubleshooting and tracking issues |
tts | TTS Config | object | Mandatory | TTS-specific configuration, see below for details |
TTS Config details:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
language | Language Code | string | zh-CN | Optional field, the language to be synthesized, must be supported by the voice actor |
voice | Voice ID | string | Different default per engine | Optional field, selectable voice actor |
pitch_offset | Pitch Offset | float | 0.0 | Optional field, pitch, the higher the value, the sharper the voice, and vice versa. Range [-10, 10] |
style | Style | string | Empty | Optional field, indicates the emotion of the voice actor |
speed_ratio | Speed Ratio | float | 1.0 | Optional field, speech speed, the higher the value, the slower the speech. Range [0.5, 2] |
sample_rate | Sample Rate | int | 16000 | Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000 |
volume | Volume | int | 100 | Optional field, volume, the higher the value, the louder the sound. Range [1, 400] |
format | File Format | string | pcm | Optional field, audio file and content format, may support pcm, wav, mp3, but only pcm supports streaming return |
omit_error | Omit Error Message in Response | bool | false | Optional field, whether to omit error messages, by default, they will be returned |
audio | Return Audio Data | bool | true | Optional field, whether to return audio, by default, it will be returned |
phone | Return Phonetic Symbols | bool | false | Optional field, whether to return phonetic symbols, by default, they will not be returned |
polyphone | Return Polyphone | bool | false | Optional field, whether to return polyphones in the query, by default, they will not be returned |
subtitle | Subtitle Format | string | Empty string | Optional field, the format of the subtitles to be returned, empty means not returned, supports: srt |
subtitle_max_length | Subtitle Max Length | int | 0 | Optional field, the maximum number of characters per subtitle line/time-stamped sentence, 0 means unlimited, only effective when returning subtitles or sentence-level timestamps |
subtitle_cut_by_punc | Subtitle Cut by Punctuation | bool | false | Optional field, whether to line-break and remove punctuation for subtitles/sentence-level timestamps based on punctuation, only effective when returning subtitles or sentence-level timestamps. See Quick Reference for the range of punctuation |
sentence_time | Return Sentence-Level Timestamp | bool | false | Optional field, whether to return sentence-level timestamps |
word_time | Return Word-Level Timestamp | bool | false | Optional field, whether to return word-level timestamps |
cache_url | false |
# Task
After sending the Starter packet and successfully establishing a connection, multiple Task packets can be sent to submit synthesis tasks. The Task packet format is a JSON text, including the following fields:
Field | Name | Type | Default Value | Description |
---|---|---|---|---|
id | Task ID | string | Random UUIDv4 | Optional field, it is recommended to generate and fill in to distinguish different requests in concurrent scenarios |
query | Query | string | Mandatory | The text content to be synthesized |
ssml | Use SSML | bool | false | Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing methods |
no_cache | Disable Cache | bool | false | Optional field, whether to disable result caching for the current request, if enabled, neither cache results will be used nor will the result be stored in the cache |
override | TTS Config | object | Empty | Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: this is a direct replacement, not a merger) |
# Response Message Format
# Authentication Result
After sending the Starter request, a message containing the authentication result will be returned. The format is a JSON text, including the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., auth |
session | Session ID | string | Yes | The Session ID of the current connection |
status | Status Name | enum | Yes | The status of the current session, normally ok , failure fail |
error | Error Message | string | No | If failed, the error message returned |
# TTS Result Data
Each successful Task continuously returns multiple data packets, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packets. Data packets of the same type are returned in logical order, but the order of different types of data packets is not guaranteed. If phonetic symbols, subtitles, Cache URL
are not requested in the Starter request, only audio will be returned.
The format of the return message is a JSON text, including the following fields:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
service | Service Name | string | Yes | The service module corresponding to the current request, i.e., tts |
session | Session ID | string | Yes | The Session ID of the current connection |
trace | Trace ID | string | Yes | The Trace ID corresponding to the current Task |
status | Status Name | enum | Yes | The status of the current Task, normally ok , failure fail |
error | Error Message | string | No | If failed, the error message returned |
tts | TTS Content | object | No | If successful, the synthesis result returned, see below for specific field meanings |
Specific synthesis results are located within TTS Content:
Field | Name | Type | Mandatory | Description |
---|---|---|---|---|
id | Task ID | string | Yes | The ID corresponding to the current Task |
index | Index No. | int | Yes | Sequence number of audio packets, phonetic symbol packets |
type | Package Type | enum | Yes | audio for audio packets, audio_url for audio address packets, phone for phonetic symbol packets, phone_url for phonetic symbol address packets,subtitle for subtitle packets, subtitle_url for subtitle address packets, polyphone for polyphone packets, timestamp for timestamp packets, eof indicates all packets have been sent |
audio_data | Base64-encoded Audio Data | string | No | Audio data, only present in audio packets |
phone_data | Base64-encoded Phonetic Symbols | string | No | Phonetic symbol data, only present in phonetic symbol packets |
polyphones | Polyphone Data | object | No | Polyphone data, only present in polyphone packets |
subtitle_data | Base64-encoded Subtitles | string | No | Subtitle data, only present in subtitle packets |
sentence_time | Sentence-Level Timestamp | object | No | Sentence-level timestamp, only present in timestamp packets |
word_times | Word-Level Timestamp | object | No | Word-level timestamp, only present in timestamp packets |
resource | Phonetic Symbols Info | object | No | Phonetic symbol information, only present in phonetic symbol packets (When using TTS2 single return interface, this field will always be returned) |
# Audio Packet
Contains Base64 encoded synthesized audio data results.
When the audio format is pcm
, it is returned in multiple streaming packets; other formats will be returned in a single packet after audio synthesis.
Audio packet example:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
}
}
# Phoneme Pack
Includes the synthesis phoneme data results encoded in Base64.
Example of a phoneme pack:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 2,
"type": "phone",
"phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
}
}
# Subtitles
Contains the result of synthesized subtitle data encoded in Base64.
Example of a subtitle package:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 4,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
# Timestamp Package
Includes sentence-level and character-level timestamp information.
Example of timestamp:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944",
"index": 7,
"type": "timestamp",
"sentence_time": {
"begin_ms": 7770,
"end_ms": 9140,
"text": "新人起步很不容易"
},
"word_times": [
{"begin_ms": 7770, "end_ms": 7960, "text": "新"},
{"begin_ms": 7960, "end_ms": 8120, "text": "人"},
{"begin_ms": 8120, "end_ms": 8310, "text": "起"},
{"begin_ms": 8310, "end_ms": 8430, "text": "步"},
{"begin_ms": 8430, "end_ms": 8630, "text": "很"},
{"begin_ms": 8630, "end_ms": 8720, "text": "不"},
{"begin_ms": 8720, "end_ms": 8920, "text": "容"},
{"begin_ms": 8920, "end_ms": 9140, "text": "易"}
]
}
}
# Polyphonic Character Package
Includes information on polyphonic characters, with the recommended pronunciation first, followed by other pronunciations.
Examples of polyphonic characters:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
# EOF
EOF result packet, indicating that all results have been sent.
EOF example:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 8,
"type": "eof"
}
}
# Practical Process Sample Analysis
# Case 1: Minimum Configuration Process
Request: Starter
{
"type": "TTS3",
"tts": {}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}
Request: Task
{
"query": "大家好!"
}
Response: 2
{
"service": "tts",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260",
"trace": "f2e13c02-c629-4db8-a942-4393583a5182",
"tts": {
"id": "4b69geebj4septyxh72qy885f",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
# Case 2: Complete Configuration Process
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "TTS3",
"device": "device-wei",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"tts": {
"language": "zh-CN",
"voice": "xiaoling",
"speed_ratio": 1.05,
"sample_rate": 16000,
"volume": 200,
"phone": true,
"polyphone": true,
"subtitle": "srt",
"sentence_time": true,
"word_time": true
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}
Request: Task
{
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"query": "你好。",
"ssml": false
}
Response: 2 audio
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
Response: 3 Phoneme
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 2,
"type": "phone",
"phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
}
}
Response: 4 Timestamp
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "de9a1066-d968-475a-ac38-b2da017b2a27",
"index": 3,
"type": "timestamp",
"sentence_time": {
"begin_ms": 500,
"end_ms": 1010,
"text": "你好。"
},
"word_times": [
{"begin_ms": 500, "end_ms": 590, "text": "你"},
{"begin_ms": 590, "end_ms": 1010, "text": "好"}
]
}
}
Response: 5 Polyphone
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
Response: 6 subtitle
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 5,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
Response: 7 EOF
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 9,
"type": "eof"
}
}