# Voice Processing
In addition to the multi-element integrated video synthesis foundational capabilities, the platform also provides voice processing capabilities, supporting users in preprocessing voice-related content for video synthesis. Multiple capabilities are provided including TTS, ASR, polyphonic character handling, etc. Furthermore, USSML (Unified Speech Synthesis Markup Language) capabilities are provided to normalize the SSML (Speech Synthesis Markup Language) syntax formats across various TTS vendors.
# Capability Introduction
- TTS语音合成
- Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.
- ASR语音识别
- Parses the audio content uploaded by the user and transcribes it into corresponding text content.
- 多音字处理
- Searches for polyphonic characters based on the text content uploaded by the user, providing feedback on possible polyphonic characters and their alternative pronunciations.
# API Reference
To call all API services on the platform, users must access the service endpoint: aigc.softsugar.com, and include the token information in the request header.
# TTS Language Detection
# 接口Description
Identifies the language of the text content uploaded by the user and returns the corresponding language code.
# Request URL
POST
/api/voice/v1/nlp/language
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| query | String | True | Text content |
# Request Example
{
"query": "待合成语音的文本内容xxxxxx"
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.result | String | Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu |
# Response Example
case1:请求Success
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": "zh"
}
}
# TTS Language Validation (QID)
# 接口Description
Determines whether the text content, language, and QID provided by the user match each other, and returns the detected language code.
# Request URL
POST
/api/voice/v3/tts/validate
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| query | String | True | Text content |
| qid | String | True | 音色-QID |
| ssml | Boolean | False | Whether to use SSML,默认为否 |
# Request Example
{
"query": "待合成语音的文本内容xxxxxx",
"qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK"
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.result | String | Validation result |
| data.result.valid | Boolean | Whether matched |
| data.result.language | String | 检测到的Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu |
# Response Example
case1:请求Success
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": {
"valid": true,
"language": "zh"
}
}
}
# TTS Language Validation (Legacy)
Note: This API is currently in maintenance mode and will not receive new feature updates. It is recommended to use the QID API.
# 接口Description
Determines whether the text content, language, voice, and vendor ID provided by the user match each other, and returns the detected language code.
# Request URL
POST
/api/voice/v1/tts/validate
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| query | String | True | Text content |
| voice_id | String | True | Voice ID |
| language | String | True | 语言,支持 BCP47 (opens new window).格式,如 "en-US"、"zh-CN" |
| vendor_id | Integer | True | 供应商ID |
| ssml | Boolean | False | Whether to use SSML |
# Request Example
{
"query": "待合成语音的文本内容xxxxxx",
"voice_id": "xiaoling",
"language": "zh-CN",
"vendor_id": 3
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.result | String | Validation result |
| data.result.valid | Boolean | Whether matched |
| data.result.language | String | 检测到的Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu |
# Response Example
case1:请求Success
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": {
"valid": true,
"language": "zh"
}
}
}
# Voice ID Migration to QID
# 接口Description
Actively migrates existing voices represented by Voice ID to new voices represented by QID, and returns the corresponding QID value. It is recommended to store the QID for subsequent inference requests.
# Request URL
POST
/api/voice/v3/tts/migrate
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| voice_id | String | True | Voice ID to be converted |
# Request Example
{
"voice_id": "xiaoning"
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.result | String | QID,表示预配置 QID。此Field的长度较长,建议以 VARCHAR(512) Type进行存储 |
# Response Example
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": "JQb7Qv:AEA_Z10Mqp9GYwDGdLzMvPzEzIqwo"
}
}
# QID Details API
# 接口Description
Gets detailed information about a QID, including the adjustable parameter details supported by the voice.
# Request URL
GET
/api/voice/v3/tts/qid/{qid}
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| qid | String | True | QID to query for details |
# Request Example
https://domain/api/voice/v3/tts/qid/mwvA2f:AEBEvMy850Y_Z10Mqp9GUwTMr8xMyI3Tzk3Q
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | int | code 码 |
| message | string | 状态Description |
| data | object | Result data |
| -status | int | Status, see status table |
| -result | object | QID 详情 |
| -pitch | boolean | Whether pitch adjustment is supported |
| -speed | boolean | Whether speed adjustment is supported |
| -volume | boolean | Whether volume adjustment is supported |
| -subtitle | boolean | Whether subtitle, sentence-level timestamp, and word-level timestamp return is supported |
| -ussml | object | Whether USSML syntax is supported |
| -break | boolean | USSML 停顿是否生效,即是否支持 <break> 标签 |
| -phoneme | string list | USSML 指定读音是否生效,即是否支持 <phoneme> 标签。<pinyin><ipa>两种标签 |
| -sub | boolean | USSML 替换文本是否生效,即是否支持 <sub> 标签 |
| -sayas | string list | USSML 指定朗读方式是否生效,即是否支持 <say-as> 标签。<cardinal><digit><phone><address><date><clock>六种标签 |
| -languages | string list | Supported language list |
# Response Example
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": {
"pitch": true,
"speed": true,
"volume": true,
"phone": true,
"subtitle": true,
"ussml": {
"break": true,
"phoneme": ["pinyin"],
"sub": true,
"sayas": ["cardinal", "digit", "phone", "address", "date", "clock"]
},
"languages": [
"zh-CN",
"en-US"
]
}
}
}
# Submit TTS Request (QID)
# 接口Description
Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.
# Request URL
POST
/api/voice/v3/tts/request
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| qid | String | True | Speaker QID |
| query | String | True | 待合成语音的Text content |
| ssml | Boolean | False | Whether to use USSML |
| timeout | Integer | False | 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回TTS结果,否则返回 request_id |
| pitch_offset | Float | False | 语调,数值越大越尖锐,越低越低沉,支持范围 [-60, 60];默认0 |
| speed_ratio | Float | False | 语速,数值越大语速越慢,支持范围 [0.5, 2];默认1.0 |
| volume | Integer | False | 音量,数值越大声音越大,支持范围 [1, 400];默认100 |
| subtitle_max_length | Integer | False | Maximum length per subtitle line, default is 0 (no limit) |
| subtitle_cut_by_punc | Boolean | False | Whether to split subtitles by punctuation, default is false (no splitting) |
| word_time | Boolean | False | Whether to return word-level timestamps, default is false (not returned) |
# Request Example
{
"qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK",
"query": "待合成语音的文本内容xxxxxx",
"timeout": 3000,
"word_time": true,
"pitch_offset": 0.0,
"speed_ratio": 1.0,
"volume": 100
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.audio_url | String | TTS合成音频MP3文件URL |
| data.result.srt_url | String | TTS音频字幕SRT文件URL |
| data.result.duration_ms | Integer | TTS合成音频MP3文件音频时长,单位为 ms |
| data.result.word_times | List | TTS音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | TTS音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | TTS音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | TTS音频文件字Text content |
# Response Example
case1:Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2:请求未超时,请求Success
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# TTS Status Query (QID)
# 接口Description
Returns the current status of a specified TTS request based on input parameters.
# Request URL
POST
/api/voice/v3/tts/result
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Parameters
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.audio_url | String | TTS合成音频MP3文件URL |
| data.result.srt_url | String | TTS音频字幕SRT文件URL |
| data.result.duration_ms | Integer | TTS合成音频MP3文件音频时长,单位为 ms |
| data.result.word_times | List | TTS音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | TTS音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | TTS音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | TTS音频文件字Text content |
# Response Example
case1:Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Submit ASR Request
# 接口Description
Parses the audio content uploaded by the user and transcribes it into corresponding text content.
# Request URL
POST
/api/voice/v1/request/asr
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| audio_url | String | True | Audio file URL of the text to be recognized |
| timeout | Integer | False | 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回ASR结果,否则返回 request_id |
| subtitle_max_length | Integer | False | Maximum length per subtitle line, default is 0 (no limit) |
| subtitle_cut_by_punc | Boolean | False | Whether to split subtitles by punctuation, default is false (no splitting) |
| word_time | Boolean | False | Whether to return word-level timestamps, default is false (not returned) |
# Request Example
{
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"timeout": 3000
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | ASR请求结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.srt_url | String | ASR音频字幕SRT文件URL |
| data.result.text | String | ASR识别结果 |
| data.result.word_times | List | ASR音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | ASR音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | ASR音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | ASR音频文件字Text content |
# Response Example
case1:Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2:请求未超时,请求Success
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"text": "ASR识别结果",
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# ASR Status Query
# 接口Description
Returns the current status of a specified ASR request based on input parameters.
# Request URL
POST
/api/voice/v1/result/asr
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | ASR请求结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.srt_url | String | ASR音频字幕SRT文件URL |
| data.result.text | String | ASR识别结果 |
| data.result.word_times | List | ASR音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | ASR音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | ASR音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | ASR音频文件字Text content |
# Response Example
case1:Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2:请求未超时,请求Success
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"text": "ASR识别结果",
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Submit Polyphonic Character Request
# 接口Description
Returns whether the text contains polyphonic characters and their selectable pronunciations based on input parameters.
# Request URL
POST /api/voice/v1/request/polyphony
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| query | String | True | Request text |
| timeout | Integer | False | 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回多音字结果,否则返回 request_id |
# Request Example
{
"query": "待合成语音的文本内容xxxxxx",
"timeout": 3000
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object List | 多音字请求结果,如在超时时间内返回,返回此Field,否则此Field为空{} |
| data.result.text | String | Polyphonic character text |
| data.result.polyphony | String List | Selectable pronunciations for the polyphonic character; recommended pronunciation first, alternatives after |
# Response Example
case1:Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": []
}
}
case2:请求未超时,请求Success
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": [
{
"text":"待",
"polyphony":["dai4","dai1"]
},
{
"text":"的",
"polyphony":["de5","di4","di1","di2"]
}
]
}
}
# Polyphonic Character Status Query
# 接口Description
Returns the current status of a specified polyphonic character request based on input parameters.
# Request URL
POST /api/voice/v1/result/polyphony
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object List | 多音字请求结果,如在超时时间内返回,返回此Field,否则此Field为空{} |
| data.result.text | String | Polyphonic character text |
| data.result.polyphony | String List | Selectable pronunciations for the polyphonic character; recommended pronunciation first, alternatives after |
# Response Example
case1:Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": []
}
}
case2:Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": [
{
"text":"待",
"polyphony":["dai4","dai1"]
},
{
"text":"的",
"polyphony":["de5","di4","di1","di2"]
}
]
}
}
# Submit TTS Request (Legacy)
Note: This API is currently in maintenance mode and will not receive new feature updates. It is recommended to use the QID API.
# 接口Description
Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.
# Request URL
POST
/api/voice/v1/request/tts
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| voice_id | String | True | Speaker ID |
| language | String | False | 语言 |
| query | String | True | 待合成语音的Text content |
| ssml | Boolean | False | Whether to use SSML |
| timeout | Integer | False | 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回TTS结果,否则返回 request_id |
| pitch_offset | Float | False | 语调,数值越大越尖锐,越低越低沉,支持范围 [-60, 60];默认0 |
| speed_ratio | Float | False | 语速,数值越大语速越慢,支持范围 [0.5, 2];默认1.0 |
| volume | Integer | False | 音量,数值越大声音越大,支持范围 [1, 400];默认100 |
| subtitle_max_length | Integer | False | Maximum length per subtitle line, default is 0 (no limit) |
| subtitle_cut_by_punc | Boolean | False | Whether to split subtitles by punctuation, default is false (no splitting) |
| word_time | Boolean | False | Whether to return word-level timestamps, default is false (not returned) |
# Request Example
{
"voice_id": "xiaoling",
"query": "待合成语音的文本内容xxxxxx",
"timeout": 3000,
"word_time": true,
"pitch_offset": 0.0,
"speed_ratio": 1.0,
"volume": 100
}
# Response Elements
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.audio_url | String | TTS合成音频MP3文件URL |
| data.result.srt_url | String | TTS音频字幕SRT文件URL |
| data.result.duration_ms | Integer | TTS合成音频MP3文件音频时长,单位为 ms |
| data.result.word_times | List | TTS音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | TTS音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | TTS音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | TTS音频文件字Text content |
# Response Example
case1:Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2:请求未超时,请求Success
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# TTS Status Query (Legacy)
Note: This API is currently in maintenance mode and will no longer receive new feature updates. It is recommended to use the QID API.
# 接口Description
Returns the current status of a specified TTS request based on input parameters.
# Request URL
POST
/api/voice/v1/result/tts
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Parameters
| Field | Type | Description |
|---|---|---|
| code | Integer | Code, see status table |
| message | String | 状态Description |
| data | Object | Result data |
| data.status | Integer | Status, see status table |
| data.request_id | String | Request ID |
| data.result | Object | TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{} |
| data.result.audio_url | String | TTS合成音频MP3文件URL |
| data.result.srt_url | String | TTS音频字幕SRT文件URL |
| data.result.duration_ms | Integer | TTS合成音频MP3文件音频时长,单位为 ms |
| data.result.word_times | List | TTS音频字文件字级别时间戳 ,单位为 ms |
| data.result.word_times.begin_ms | Integer | TTS音频文件字开始时间戳,单位为 ms |
| data.result.word_times.end_ms | Integer | TTS音频文件字结束时间戳,单位为 ms |
| data.result.word_times.text | String | TTS音频文件字Text content |
# Response Example
case1:Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Status Table
# Language Detection Related Status
# TTS Language Detection Request Response Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | 处理Success |
| 3201 | Rejected: Required parameter missing |
| 3202 | 拒绝:非法的参数--Request text为空 |
| 3301 | 请求Failure:未能检测到语言 |
# Language Validation Related Status
# TTS Language Validation Request Response Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | 处理Success |
| 1201 | Rejected: Required parameter missing |
| 1202 | Rejected: Invalid parameter — voice ID is empty |
| 1203 | 拒绝:非法的参数--Request text为空 |
| 1204 | Rejected: Invalid parameter — language is empty |
| 1205 | Rejected: Invalid parameter — voice ID does not exist |
| 1206 | Rejected: Invalid parameter — unknown language |
| 1207 | Rejected: Invalid parameter — unknown vendor |
| 1301 | 请求Failure:未能检测到语言 |
# TTS Related Status
# TTS Request Response Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | 处理Success |
| 1000 | Processing |
| 1002 | Rejected: Required parameter missing |
| 1003 | Rejected: Invalid parameter — speaker is empty |
| 1004 | Rejected: Invalid parameter — synthesis text is empty |
| 1005 | Rejected: Invalid parameter — speaker does not exist |
| 1102 | Failure:未连接到服务 |
| 1103 | Failure:繁忙 |
| 1104 | Failure:内部错误 |
| 1105 | Failure:Request timed out |
# TTS Request Query Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | Success |
| 1000 | Processing |
| 1102 | Failure:未连接到服务 |
| 1103 | Failure:繁忙 |
| 1104 | Failure:内部错误 |
| 1105 | Failure:Request timed out |
| 1106 | Failure:未知的request ID |
# ASR Related Status
# ASR Request Response Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | 请求Success |
| 1000 | Request timed out, new request created |
| 2001 | Rejected: Required parameter missing |
| 2002 | Rejected: Invalid parameter — audio URL is empty |
| 2102 | Failure:未连接到服务 |
| 2103 | Failure:繁忙 |
| 2104 | Failure:内部错误 |
| 2105 | Failure:Request timed out |
| 2106 | Failure:下载音频错误 |
# ASR Request Query Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | Success |
| 1000 | Processing |
| 2102 | Failure:未连接到服务 |
| 2103 | Failure:繁忙 |
| 2104 | Failure:内部错误 |
| 2105 | Failure:Request timed out |
| 2106 | Failure:下载音频错误 |
| 2107 | Failure:未知的request ID |
# Polyphonic Character Related Status
# Polyphonic Character Request Response Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | 处理Success |
| 1000 | Processing |
| 5002 | Rejected: Required parameter missing |
| 5004 | Rejected: Invalid parameter — synthesis text is empty |
| 5102 | Failure:未连接到服务 |
| 5103 | Failure:繁忙 |
| 5104 | Failure:内部错误 |
| 5105 | Failure,未超时:Request timed out |
# Polyphonic Character Request Query Status
| code | Description |
|---|---|
| 0 | Success |
| x | Failure |
| status | Description |
|---|---|
| 0 | Success |
| 1000 | Processing |
| 5102 | Failure:未连接到服务 |
| 5103 | Failure:繁忙 |
| 5104 | Failure:内部错误 |
| 5105 | Failure:Request timed out |
| 5106 | Failure:未知的request ID |
# USSML Syntax Reference
# Introduction
USSML(Unified Speech Synthesis Markup Language)旨在提供统一的 SSML(Speech Synthesis Markup Language)语法格式。USSML支持最常用的几个SSML标签:停顿、指定读音、替换合成文本和指定朗读方式,可以满足大部分的语音合成需求。
# Usage
# Syntax Format
USSML 语法格式如下:
<speak sttts:version="0.1">
<break time="string" />
<phoneme ph="string"></phoneme>
<sub alias="string"></sub>
<say-as interpret-as="string"></say-as>
</speak>
除
# Special Characters
在 USSML 中,如果使用了以下的特殊字符,需要进行转义,如下表所示。对于 USSML 标记本身的相关字符,则无需转义。
补充转译的例子。
| 特殊字符 | 转义字符 |
|---|---|
| & | & |
| < | < |
| > | > |
| " | " |
| ' | ' |
# Tag Description
# <speak>
<speak> 标签是 USSML 的根标签,用于包裹所有的 USSML 标签。
属性 sttts:version 用于指定 USSML 的版本号,目前 USSML 的版本号为 0.1。
<speak sttts:version="0.1">
<!-- USSML 标签 -->
</speak>
# <break>
<break> 标签用于指定停顿,其 time 属性用于指定停顿的时长。time 属性的值是一个字符串,单位为秒或毫秒。停顿的最大时长为5秒。不可以传输纯数字,需要有单位。例如:
<break time="5s" />
或
<break time="5000ms" />
# <phoneme>
<phoneme> 标签用于指定读音,其 ph 属性用于指定读音的内容。由于不同供应商支持语言范围不同,目前 USSML 中的 <phoneme> 仅支持汉语拼音。拼音用法:字与字的拼音用空格分隔,拼音的数目必须与字数相等。每个拼音由发音和音调组成,音调为1~5的数字编号,其中”5”表示轻声。例如:
<phoneme ph="mai2 mo4">埋没</phoneme>
# 示例
<speak sttts:version="0.1">
你说<phoneme ph="bo2">薄</phoneme>。
<break time="500ms" />
我说<phoneme ph="bao2">薄</phoneme>。
</speak>
# <sub>
<sub> 标签用于在合成过程中替换字幕文本。该标签的 ‘alias’ 属性用于指定要替换的Text content。合成时,‘alias’ 属性所包含的文本将会取代原始文本进行合成,若有字幕,字幕内容为标签内的原始文本。
<sub> 标签内容与 ‘alias’ 属性的文本均不得为空。
例如:
<sub alias="World Wide Web Consortium">W3C</sub>
以上例子中,字幕显示为:W3C,朗读内容为:World Wide Web Consortium。
# <say-as>
<say-as>标签允许您使用特定的朗读方式来合成标签内容。该标签的 ‘interpret-as’ 属性用于指定朗读方式。不同的供应商对相同的朗读方式可能产生略微不同的结果。
interpret-as 属性支持多种值,包括:
| 值 | Description | 样例 |
|---|---|---|
| cardinal | 按照数值方式进行发音 | “1487” 读作 “一千四百八十七” |
| digit | 按数字串发音 | “12345” 读作 “一二三四五” |
| phone | 按电话号码常用方式发音 | “1301001155” 读作 “幺三零幺零零幺幺五五” |
| address | 按地址发音 | “市台路388-301号” 读作 “市台路三八八杠三零幺号” |
| date | 按日期发音 | “1998-12-12” 读作 “一九九八年十二月十二日” |
| clock | 按时刻发音 | “12:00:12” 读作 “十二点零分十二秒” |
# 示例
<say-as interpret-as="cardinal">12345</say-as>
示例 1:
<speak sttts:version="0.1">
你说<phoneme ph="bo2">薄</phoneme>。
<break time="500ms" />
我说<phoneme ph="bao2">薄</phoneme>。
</speak>
示例 2:
<speak sttts:version="0.1">
<sub alias="World Wide Web Consortium">W3C</sub>
是一个国际性的标准化组织。
</speak>
示例 3:
<speak sttts:version="0.1">
<sub alias="青岛啤酒">TsingTao</sub>
用河南话说就是,
<phoneme ph="qing2 dao1 pi4 jiu1">青岛啤酒</phoneme>。
<say-as interpret-as="cardinal">12345</say-as>
</speak>
# Streaming Voice Processing
平台提供HTTP方式语音处理能力之外,还提供流式处理能力,包括ASR、TTS等。
# Invocation Method
使用 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。
# Invocation Flow
- 建立 WebSocket 连接;
- 发送 Starter 包,内容为后续请求的通用配置信息;
- 收到响应,表示鉴权Success或Failure;
- 发送 Data 数据包;
- 收到对应返回结果,4、5 步可重复;
- 如果当前没有更多任务,可以直接断开(没有链接断开报文的设计);
# Invocation Limits
- 建立 Websocket 连接后,10 秒内未发送 Starter 包会被断开 WebSocket 连接;
- 如果 60 秒内没有收到任何请求,服务端会主动断开,建议以一定间隔发送 Ping 包进行保活;
# Request Message Format
# Starter
每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
type | Workflow Type | string | Required | 填写能力对应的服务Type,例如:"TTS"、"ASR5" |
device | Device ID | string | 空字符串 | 设备 ID,建议填写,以便追溯和定位问题 |
session | Session ID | string | 随机 UUIDv4 | 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题 |
asr | ASR Config | object | 使用 ASR 能力则Required | ASR 专属配置,具体信息见下文 |
tts | TTS Config | object | 使用 TTS 能力则Required | TTS 专属配置,具体信息见下文 |
# Starter Message Example
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "ASR5",
"device": "device-weye",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"asr": {
"language": "zh-CN"
}
}
# Data
Starter 包发送并Success建立连接后,后续可重复发送多个 Data 数据包。Data 包格式见对应能力文档。
# Response Message Format
# Authentication Result
发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即auth |
session | Session ID | string | Yes | 当前连接的 Session ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
# Authentication Result Example
{
"service": "auth",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
# Result Data
根据能力不同,每个请求会返回一个或多个Result data包。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块 |
session | Session ID | string | Yes | 当前连接的 Session ID |
trace | Trace ID | string | Yes | 当前请求对应的 Trace ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
asr | ASR Content | object | No | 如果Success,返回的识别结果 |
nlp | NLP Content | object | No | 如果Success,返回的答复结果 |
tts | TTS Content | object | No | 如果Success,返回的合成结果 |
# ASR (Speech Recognition) Integration Guide
中控 WebSocket 全双工接口 ASR 调用方式的Description,链接方式为 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。
# Invocation Flow
- 建立 WebSocket 连接;地址通常为
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}; - 发送 Starter 包,内容为后续请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
- 收到响应,表示鉴权Success或Failure;
- 发送 Data 二进制数据包,内容为 PCM 音频;
- 音频发送完成后,发送 EOF 包;(可选,不发送则需额外发送 500 毫秒以上的环境静音,帮助 VAD 结束)
- 收到对应ASR结果,格式为 JSON 文本;
- 发送 EOF 请求包的情况下,收到 EOF 结果包,表示所有识别结果发送完毕;
- 如果当前没有更多语音识别任务,可以直接断开(没有链接断开报文的设计);
# Request Message Format
# Starter
每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
type | Workflow Type | string | Required | 填写能力对应的服务Type,中文识别请选择:"ASR5" |
device | Device ID | string | 空字符串 | 设备 ID,建议填写,以便追溯和定位问题 |
session | Session ID | string | 随机 UUIDv4 | 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题 |
asr | ASR Config | object | Required | ASR 专属配置,具体信息见下文 |
ASR Config 配置见下:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
language | Language Code | string | zh-CN | 可选Field,待识别的语言 |
mic_volume | Microphone Volume | float | 1.0 | 可选Field,麦克风音量用于 ASR 进行自增益,支持范围为 0 到 1 |
subtitle | Subtitle Format | string | 空字符串 | 可选Field,返回字幕的格式,空表示不返回,支持:srt |
subtitle_max_length | Subtitle Max Length | int | 0 | 可选Field,返回每行字幕的最大字数,0表示不限制字数 |
intermediate | Return Intermediate Result | bool | false | 可选Field,是否返回中间结果 |
sentence_time | Return Sentence-Level Timestamp | bool | false | 可选Field,是否返回句级别时间戳 |
word_time | Return Word-Level Timestamp | bool | false | 可选Field,是否返回字级别时间戳 |
pause_time_msec | Speech Pause Time (msec) | int | 500 | 可选Field,语音暂停时间,用于判断语音的边界和分段,默认为500毫秒 |
# Data
Starter 包发送并Success建立连接后,后续可重复发送多个二进制 Data 包流式提交音频。
输入的音频流格式为 PCM,使用 16KHz 采样率,16bit 数据位宽,单通道,小端。即 sox -t raw -r 16000 -e signed -b 16 -c 1 可转格式,或 ffmpeg -acodec pcm_s16le -ac 1 -ar 16000 -f s16le 可转格式。
发送速率:
音频按照从麦克风读取的速率发送,建议为每 40 毫秒发送 1280 字节,或每 160 毫秒发送 5120 字节。
# EOF
音频包发送完成后,发送 EOF 包,表示结束识别。可选,不发送则需额外发送 500 毫秒以上的环境静音,帮助 VAD 结束。
如果需要获取字幕,则必须发送 EOF 包,以通知 ASR 服务端进行字幕生成。
格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
signal | 结束标记 | string | Required | 固定为 eof |
trace | Trace ID | string | 随机 UUIDv4 | 可选Field,建议调用者自行生成 Trace ID 并填写,以便追溯和定位问题 |
# Response Message Format
# Authentication Result
发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即auth |
session | Session ID | string | Yes | 当前连接的 Session ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
# ASR Result Data
ASR会持续返回多个文字识别Result data包,并在收到 EOF 请求后返回字幕、字幕文件地址数据包。如果在 Starter 请求中未要求返回字幕和字幕地址,则仅返回文字识别结果。
返回报文的格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即asr |
session | Session ID | string | Yes | 当前连接的 Session ID |
trace | Trace ID | string | Yes | 当前句子对应的 Trace ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
asr | ASR Content | object | No | 如果Success,返回的识别结果,具体Field含义见下 |
具体识别结果位于 ASR Content 中:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
index | Index No. | int | Yes | 返回包序列号 |
type | Package Type | enum | Yes | 文字结果包为 text,中间结果包为 intermediate,字幕包为 subtitle,表示全部发送完毕为 eof |
text | Text | string | Yes | 文字识别结果,在字幕包中亦会出现,但内容为空 |
subtitle | Subtitle | string | No | 字幕内容,仅在字幕包中有 |
sentence_time | Sentence-Level Timestamp | object | No | 句子级别时间戳 |
word_times | Word-Level Timestamp | object | No | 字级别时间戳 |
# Text Result
每句被完整识别的文字都会返回一条报文,中间识别结果不返回,如果无法识别或识别结果为空白字符,亦不返回。
文字结果样例:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "c9cb36d8-3ca9-4e2b-9034-29f2c4edc3de",
"asr": {
"index": 1,
"type": "text",
"text": "你好。"
}
}
# Subtitle Result
请求字幕且发送 EOF 后,返回生成的字幕结果。
字幕结果样例:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "9a971f17-f871-4b73-9084-b856b67537d5",
"asr": {
"index": 3,
"type": "subtitle",
"subtitle": "1\n00:00:00,000 --> 00:00:01,280\n你好。\n\n2\n00:00:02,960 --> 00:00:04,240\n再见。\n\n"
}
}
# EOF
收到 EOF 请求的情况下,发送 EOF 结果包,表示结果全部发送完毕。
EOF 样例:
{
"service": "asr",
"status": "ok",
"session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
"trace": "16ff049a-41fb-4c7a-ac5e-b26dbc3218e5",
"asr": {
"index": 5,
"type": "eof"
}
}
# Practical Flow Example Analysis
# Case 1: Minimum Configuration Flow
Request: Starter
{
"type": "ASR5",
"asr": {}
}
Request: 二进制 Data
Response: 1
{
"service": "auth",
"status": "ok",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721"
}
Response: 2
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "e1c44bdc-4f9a-487c-806e-005679db7d0d",
"asr": {
"index": 1,
"type": "text",
"text": "早知道你喜欢十里春光"
}
}
Response: 3
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "f7551818-5025-4d83-b41b-136bb19b5b5f",
"asr": {
"index": 2,
"type": "text",
"text": "我一定会在麦田里种满玫瑰和山茶"
}
}
Response: 4
{
"service": "asr",
"session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
"trace": "89d3a8b4-a291-4cdc-9b78-d3f912d06223",
"asr": {
"index": 3,
"type": "text",
"text": "你路过这片土地才算浪漫"
}
}
# Case 2: Full Configuration Flow
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "ASR5",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"asr": {
"subtitle": "srt",
"intermediate": true,
"mic_volume": 0.67
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
Request: 二进制 Data(略)
Request: EOF
{
"signal": "eof",
"trace": "52517513-875a-47b6-bd30-f11a75e26745"
}
Response: 2
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "dcf88fbe-6cda-452d-8f51-e316cb4a0943",
"asr": {
"index": 1,
"type": "intermediate",
"text": "介"
}
}
Response: 3
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "879d4700-746f-411b-954c-f83a2c6cd300",
"asr": {
"index": 2,
"type": "intermediate",
"text": "介绍下长"
}
}
Response: 4
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "20e6a5e5-6d5d-4284-9013-4d410f1a5d37",
"asr": {
"index": 3,
"type": "intermediate",
"text": "介绍下长宁图书"
}
}
Response: 5
{
"service": "asr",
"status": "ok",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "cf7733e2-da28-442d-b5bb-282fc8f352f3",
"asr": {
"index": 4,
"type": "text",
"text": "介绍一下长宁图书馆。",
"sentence_time": {
"begin_ms": 2080,
"end_ms": 4640
},
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "介"
},
{
"begin_ms": 2560,
"end_ms": 2800,
"text": "绍"
},
{
"begin_ms": 2800,
"end_ms": 2920,
"text": "一"
},
{
"begin_ms": 2920,
"end_ms": 3040,
"text": "下"
},
{
"begin_ms": 3040,
"end_ms": 3280,
"text": "长"
},
{
"begin_ms": 3280,
"end_ms": 3480,
"text": "宁"
},
{
"begin_ms": 3480,
"end_ms": 3640,
"text": "图"
},
{
"begin_ms": 3640,
"end_ms": 3880,
"text": "书"
},
{
"begin_ms": 3880,
"end_ms": 4640,
"text": "馆"
}
]
}
}
Response: 6
{
"service": "asr",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
"asr": {
"index": 5,
"type": "subtitle",
"subtitle": "1\n00:00:00,000 --> 00:00:02,280\n介绍一下长宁图书馆\n\n"
}
}
Response: 7
{
"service": "asr",
"session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
"trace": "2bd4cbce-0f72-402c-8e88-0f2704a22868",
"asr": {
"index": 6,
"type": "eof"
}
}
# TTS (Speech Synthesis) Integration Guide (QID)
中控 WebSocket 全双工接口 TTS 调用方式的Description,链接方式为 WebSocket 协议,报文皆为使用 UTF-8 编码的 JSON 文本。
# Invocation Flow
- 建立 WebSocket 连接,地址通常为
ws://aigc.softsugar.com/api/voice/stream/v3?Authorization=Bearer {token}; - 发送 Starter 包,内容为后续 TTS 请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
- 收到响应,表示鉴权Success或Failure;
- 发送 Task 包,内容为特定需要合成的文字和格式信息;
- 收到对应 Task 的数据包;
- 如果当前没有更多语音合成任务,可以直接断开(没有链接断开报文的设计);
# Request Message Format
# Starter
每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
type | Workflow Type | string | Required | 仅支持填写TTS |
device | Device ID | string | 空字符串 | 设备 ID,建议填写,以便追溯和定位问题 |
session | Session ID | string | 随机 UUIDv4 | 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题 |
tts | TTS Config | object | Required | TTS 专属配置,具体信息见下 |
TTS Config 配置见下:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
qid | Qid | string | - | RequiredField |
pitch_offset | Pitch Offset | float | 0.0 | 可选Field,音调,数值越大越尖锐,越低越低沉,支持范围 [-10, 10] |
speed_ratio | Speed Ratio | float | 1.0 | 可选Field,语速,数值越大语速越慢,支持范围 [0.5, 2] |
sample_rate | Sample Rate | int | 16000 | 可选Field,采样率,支持:8000, 16000, 22050, 24000, 44100 |
volume | Volume | int | 100 | 可选Field,音量,数值越大声音越大,支持范围 [1, 400] |
format | File Format | string | pcm | 可选Field,音频文件和内容,根据所选音色,可能支持 pcm, wav, mp3,但只有 pcm 支持流式返回 |
omit_error | Omit Error Message in Response | bool | false | 可选Field,是否删去报错信息,即默认会返回 |
polyphone | Return Polyphone | bool | false | 可选Field,是否返回 query 中的多音字,默认不返回 |
subtitle | Subtitle Format | string | 空字符串 | 可选Field,返回格式字幕的格式,空表示不返回,支持:srt |
subtitle_max_length | Subtitle Max Length | int | 0 | 可选Field,返回每行字幕/句级别时间戳的最大字数,0表示不限制字数,仅在返回字幕或句级别时间戳时有效 |
subtitle_cut_by_punc | Subtitle Cut by Punctuation | bool | false | 可选Field,是否根据标点符号对字幕/句级别时间戳进行换行并去掉标点,仅在返回字幕或句级别时间戳时有效。 |
sentence_time | Return Sentence-Level Timestamp | bool | false | 可选Field,是否返回句级别时间戳 |
word_time | Return Word-Level Timestamp | bool | false | 可选Field,是否返回字级别时间戳 |
# Task
Starter 包发送并Success建立连接后,后续可重复发送多个 Task 来提交合成任务。Task 包格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
id | Task ID | string | 随机 UUIDv4 | 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回 |
query | Query | string | Required | 待合成语音的Text content |
ssml | Use SSML | bool | false | 可选Field,是否使用 SSML 来对合成文本进行标记,写法参考 ONES 使用文档 |
# Response Message Format
# Authentication Result
发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即auth |
session | Session ID | string | Yes | 当前连接的 Session ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
# TTS Result Data
每个Success的 Task 持续返回多个数据包,分别为音频、字幕、时间戳、和多音字包。同Type数据包按照逻辑顺序依次返回,不保证不同Type数据包的返回顺序。
返回报文的格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即tts |
session | Session ID | string | Yes | 当前连接的 Session ID |
trace | Trace ID | string | Yes | 当前 Task 对应的 Trace ID |
status | Status Name | enum | Yes | 当前 Task 的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
tts | TTS Content | object | No | 如果Success,返回的合成结果,具体Field含义见下 |
具体合成结果位于 TTS Content 中:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
id | Task ID | string | Yes | 当前 Task 对应的 ID |
index | Index No. | int | Yes | 返回音频包、音素包序列号 |
type | Package Type | enum | Yes | 音频包为 audio,字幕包为 subtitle,多音字包为 polyphone,时间戳包为 timestamp,表示全部发送完毕为eof |
audio_data | Base64-encoded Audio Data | string | No | 音频数据,仅在音频包中有 |
polyphones | Polyphone Data | object | No | 多音字数据,仅在多音字包中有 |
subtitle_data | Base64-encoded Subtitles | string | No | 字幕数据,仅在字幕包中有 |
sentence_time | Sentence-Level Timestamp | object | No | 句子级别时间戳,仅在时间戳包中有 |
word_times | Word-Level Timestamp | object | No | 字级别时间戳,仅在时间戳包中有 |
# Audio Package
包含 Base64 编码的合成音频数据结果。
当请求音频格式为 pcm 时,分为多包流式返回,其他格式会在音频合成后单包返回。
音频包样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
}
}
# Subtitles
包含 Base64 编码的合成字幕数据结果。
字幕包样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 4,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
# Timestamp Package
包含句子级别和字级别的时间戳信息。
时间戳样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944",
"index": 7,
"type": "timestamp",
"sentence_time": {
"begin_ms": 7770,
"end_ms": 9140,
"text": "新人起步很不容易"
},
"word_times": [
{"begin_ms": 7770, "end_ms": 7960, "text": "新"},
{"begin_ms": 7960, "end_ms": 8120, "text": "人"},
{"begin_ms": 8120, "end_ms": 8310, "text": "起"},
{"begin_ms": 8310, "end_ms": 8430, "text": "步"},
{"begin_ms": 8430, "end_ms": 8630, "text": "很"},
{"begin_ms": 8630, "end_ms": 8720, "text": "不"},
{"begin_ms": 8720, "end_ms": 8920, "text": "容"},
{"begin_ms": 8920, "end_ms": 9140, "text": "易"}
]
}
}
# Polyphonic Character Package
包含多音字信息,推荐读音在前,其他读音在后。
多音字样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
# EOF
EOF 结果包,表示结果全部发送完毕。
EOF 样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 8,
"type": "eof"
}
}
# Practical Flow Example Analysis
# Case 1: Minimum Configuration Flow
Request: Starter
{
"type": "TTS",
"tts": {}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}
Request: Task
{
"query": "大家好!"
}
Response: 2
{
"service": "tts",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260",
"trace": "f2e13c02-c629-4db8-a942-4393583a5182",
"tts": {
"id": "4b69geebj4septyxh72qy885f",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
# Case 2: Full Configuration Flow
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "TTS3",
"device": "device-wei",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"tts": {
"qid": "8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg",
"speed_ratio": 1.05,
"sample_rate": 16000,
"volume": 200,
"polyphone": true,
"subtitle": "srt",
"sentence_time": true,
"word_time": true
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}
Request: Task
{
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"query": "你好。",
"ssml": false
}
Response: 2 音频
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
Response: 3 时间戳
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "de9a1066-d968-475a-ac38-b2da017b2a27",
"index": 3,
"type": "timestamp",
"sentence_time": {
"begin_ms": 500,
"end_ms": 1010,
"text": "你好。"
},
"word_times": [
{"begin_ms": 500, "end_ms": 590, "text": "你"},
{"begin_ms": 590, "end_ms": 1010, "text": "好"}
]
}
}
Response: 4 多音字
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
Response: 5 字幕
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 5,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
Response: 6 EOF
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 9,
"type": "eof"
}
}
# LLM (Large Language Model) Integration Guide
中控 WebSocket 全双工接口 LLM 调用方式的Description,链接方式为 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。
# Invocation Flow
- 建立 WebSocket 连接;地址通常为
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}; - 发送 Starter 包,内容为后续请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
- 收到响应,表示鉴权Success或Failure;
- 发送 Query 文本包,内容为单条对话内容;
- 收到对应的应答结果,格式为 JSON;
- 如果当前没有更多问答对话,可以直接断开(没有链接断开报文的设计);
# Request Message Format
# Starter
每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
type | Workflow Type | string | Required | 填写能力对应的服务引擎编号,例如:“NLP7” NLP7(SenseChat),NLP10(商汤拟人大模型) |
device | Device ID | string | 空字符串 | 设备 ID,建议填写,以便追溯和定位问题 |
session | Session ID | string | 随机 UUIDv4 | 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题 |
nlp | NLP Config | object | Required | NLP 专属配置,具体信息见下 |
NLP Config 配置见下:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
omit_error | Omit Error Message in Response | bool | false | 可选Field,是否删去报错信息,即默认会返回 |
know_ids | Knowledge IDs | string list | 空列表 | 可选Field,知识库 ID 列表,仅部分语言大模型引擎支持 |
prompt_header | System Role of Prompt | string | 空字符串 | 可选Field,Prompt 的背景Description,为空时使用配置中的预设值,仅部分语言大模型引擎支持。 NLP10(拟人大模型)有专门的json定义,需要按照要求传输。 |
max_reply_token | Max Toke in Reply | int | 500 | 可选Field,回复内容的最大 token 数,实际最大可用值和模型相关,仅部分语言大模型引擎支持 |
# Query
Starter 包发送并Success建立连接后,后续可发送多个 Query 文本包提交用户问题,格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
id | Trace ID | string | 随机 UUIDv4 | 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回 |
query | Query | string | Required | 用户问题的Text content |
# Response Message Format
# Authentication Result
发送 Starter 请求后会返回包含鉴权结果的报文。格式为JSon文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即auth |
session | Session ID | string | Yes | 当前连接的 Session ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
# NLP Result Data
针对每一条Success处理的 Query ,都会有一条报文返回,当 omit_error 为 false 时,出错报文亦会返回。返回报文的基本格式见下:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即nlp |
session | Session ID | string | Yes | 当前连接的 Session ID |
trace | Trace ID | string | Yes | 当前句子对应的 Trace ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
nlp | NLP Content | object | No | 如果Success,返回的答复结果,具体Field含义见下 |
具体识别结果位于 NLP Content 中:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
index | Index No. | int | Yes | 返回包序列号 |
query | Query | string | Yes | 提交的问题文本 |
answer | Answer | string | Yes | 返回的播报文本,数字人前端调用 TTS 进行播报 |
text | Text | string | No | 返回的展示文本,数字人前端上屏作为文本展示 |
finish_reason | Text | string | Yes | 停止生成的原因,枚举值 因结束符停止生成: stop因达到最大生成长度停止生成: length因触发敏感词停止生成: sensitive因触发模型上下文长度限制(若要继续接受后续内容,则在下一条query中发送“请继续”): context |
# Anthropomorphic LLM (NLP10) Parameter Definition
# Data Definition
| Name | Type | 必须 | Default | 可选值 | Description |
|---|---|---|---|---|---|
| name | string | 是 | - | - | 角色姓名,长度不超过50个Unicode字符 |
| gender | string | 是 | - | - | 角色性别,长度不超过50个Unicode字符 |
| identity | string | 否 | - | - | 角色身份,长度不超过200个Unicode字符 |
| nickname | string | 否 | - | - | 角色别名,长度不超过50个Unicode字符 |
| feeling_toward | object[] | 否 | - | - | 好感度设定 |
| detail_setting | string | 否 | - | - | 详细设定,长度不超过500个Unicode字符 |
| other_setting | json string | 否 | - | - | 其他设定,长度不超过3000个Unicode字符 |
feeling_toward定义:
| Name | Type | 必须 | Default | 可选值 | Description |
|---|---|---|---|---|---|
| name | string | 是 | - | - | 角色姓名,只能选择character_settings中已设定的name |
| level | int | 是 | - | [1,3] | 对该角色的好感度,数字越大代表好感度越高 |
参考示例:
"prompt_header": "[{\"name\":\"周梓柔\",\"gender\":\"女\",\"identity\":\"我一直信赖的姐姐\",\"nickname\":\"\",\"feeling_toward\":[{\"name\":\"弟弟\",\"level\":3}],\"detail_setting\":\"周梓柔具有卓越成就感,学业和职场表现都极为杰出,从小就是个学霸,经常被长辈提及为榜样。外表冷艳,给人以远离尘嚣的印象,令人印象深刻的气质既独立又自信。对外或许保持距离,但在我面前总是展现出无限的温柔与包容,耐心倾听我的烦恼,用细腻的关怀化解我的困惑。MBTI人格是ENTJ。\",\"other_setting\":\"\"},{\"name\":\"弟弟\",\"gender\":\"男\",\"identity\":\"弟弟\",\"nickname\":\"\",\"detail_setting\":\"周梓柔总是在弟弟面前展现出无限的温柔与包容,耐心倾听我的烦恼,用细腻的关怀化解我的困惑。\",\"other_setting\":\"\"}]"
# TTS (Speech Synthesis) Integration Guide (Legacy)
Note: This API is currently in maintenance mode and will not receive new feature updates.
中控 WebSocket 全双工接口 TTS 调用方式的Description,链接方式为 WebSocket 协议,报文皆为使用 UTF-8 编码的 JSON 文本。
# Invocation Flow
- 建立 WebSocket 连接,地址通常为
ws://aigc.softsugar.com/api/voice/stream/v1?Authorization={Token}; - 发送 Starter 包,内容为后续 TTS 请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
- 收到响应,表示鉴权Success或Failure;
- 发送 Task 包,内容为特定需要合成的文字和格式信息;
- 收到对应 Task 的数据包;
- 如果当前没有更多语音合成任务,可以直接断开(没有链接断开报文的设计);
# Request Message Format
# Starter
每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
auth | AuthN Token | string | 空字符串 | 设备鉴权 Token,如服务端开启鉴权则Required |
type | Workflow Type | string | Required | 填写能力对应的服务引擎编号,例如:"TTS3" |
device | Device ID | string | 空字符串 | 设备 ID,建议填写,以便追溯和定位问题 |
session | Session ID | string | 随机 UUIDv4 | 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题 |
tts | TTS Config | object | Required | TTS 专属配置,具体信息见下 |
TTS Config 配置见下:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
language | Language Code | string | zh-CN | 可选Field,待合成的语言,需发音人支持 |
voice | Voice ID | string | 服务引擎不同,默认发音人不同 | 可选Field,可选发音人 |
pitch_offset | Pitch Offset | float | 0.0 | 可选Field,音调,数值越大越尖锐,越低越低沉,支持范围 [-10, 10] |
style | Style | string | 空 | 可选Field,表示发音人的情感 |
speed_ratio | Speed Ratio | float | 1.0 | 可选Field,语速,数值越大语速越慢,支持范围 [0.5, 2] |
sample_rate | Sample Rate | int | 16000 | 可选Field,采样率,支持:8000, 16000, 22050, 24000, 44100 |
volume | Volume | int | 100 | 可选Field,音量,数值越大声音越大,支持范围 [1, 400] |
format | File Format | string | pcm | 可选Field,音频文件和内容,根据所选音色,可能支持 pcm, wav, mp3,但只有 pcm 支持流式返回 |
omit_error | Omit Error Message in Response | bool | false | 可选Field,是否删去报错信息,即默认会返回 |
polyphone | Return Polyphone | bool | false | 可选Field,是否返回 query 中的多音字,默认不返回 |
subtitle | Subtitle Format | string | 空字符串 | 可选Field,返回格式字幕的格式,空表示不返回,支持:srt |
subtitle_max_length | Subtitle Max Length | int | 0 | 可选Field,返回每行字幕/句级别时间戳的最大字数,0表示不限制字数,仅在返回字幕或句级别时间戳时有效 |
subtitle_cut_by_punc | Subtitle Cut by Punctuation | bool | false | 可选Field,是否根据标点符号对字幕/句级别时间戳进行换行并去掉标点,仅在返回字幕或句级别时间戳时有效。 |
sentence_time | Return Sentence-Level Timestamp | bool | false | 可选Field,是否返回句级别时间戳 |
word_time | Return Word-Level Timestamp | bool | false | 可选Field,是否返回字级别时间戳 |
# Task
Starter 包发送并Success建立连接后,后续可重复发送多个 Task 来提交合成任务。Task 包格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Default | Description |
|---|---|---|---|---|
id | Task ID | string | 随机 UUIDv4 | 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回 |
query | Query | string | Required | 待合成语音的Text content |
ssml | Use SSML | bool | false | 可选Field,是否使用 SSML 来对合成文本进行标记,写法参考 ONES 使用文档 |
# Response Message Format
# Authentication Result
发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即auth |
session | Session ID | string | Yes | 当前连接的 Session ID |
status | Status Name | enum | Yes | 当前会话的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
# TTS Result Data
每个Success的 Task 持续返回多个数据包,分别为音频、字幕、时间戳和多音字包。同Type数据包按照逻辑顺序依次返回,不保证不同Type数据包的返回顺序。
返回报文的格式为 JSON 文本,包含以下Field:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
service | Service Name | string | Yes | 当前请求对应的服务模块,即tts |
session | Session ID | string | Yes | 当前连接的 Session ID |
trace | Trace ID | string | Yes | 当前 Task 对应的 Trace ID |
status | Status Name | enum | Yes | 当前 Task 的状态,正常为 ok,Failure为 fail |
error | Error Message | string | No | 如果Failure,返回的错误信息 |
tts | TTS Content | object | No | 如果Success,返回的合成结果,具体Field含义见下 |
具体合成结果位于 TTS Content 中:
| Field | Name | Type | Required | Description |
|---|---|---|---|---|
id | Task ID | string | Yes | 当前 Task 对应的 ID |
index | Index No. | int | Yes | 返回音频包、音素包序列号 |
type | Package Type | enum | Yes | 音频包为 audio,多音字包为 polyphone,时间戳包为 timestamp,表示全部发送完毕为eof |
audio_data | Base64-encoded Audio Data | string | No | 音频数据,仅在音频包中有 |
polyphones | Polyphone Data | object | No | 多音字数据,仅在多音字包中有 |
subtitle_data | Base64-encoded Subtitles | string | No | 字幕数据,仅在字幕包中有 |
sentence_time | Sentence-Level Timestamp | object | No | 句子级别时间戳,仅在时间戳包中有 |
word_times | Word-Level Timestamp | object | No | 字级别时间戳,仅在时间戳包中有 |
# Audio Package
包含 Base64 编码的合成音频数据结果。
当请求音频格式为 pcm 时,分为多包流式返回,其他格式会在音频合成后单包返回。
音频包样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
}
}
# Subtitles
包含 Base64 编码的合成字幕数据结果。
字幕包样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 4,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
# Timestamp Package
包含句子级别和字级别的时间戳信息。
时间戳样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944",
"index": 7,
"type": "timestamp",
"sentence_time": {
"begin_ms": 7770,
"end_ms": 9140,
"text": "新人起步很不容易"
},
"word_times": [
{"begin_ms": 7770, "end_ms": 7960, "text": "新"},
{"begin_ms": 7960, "end_ms": 8120, "text": "人"},
{"begin_ms": 8120, "end_ms": 8310, "text": "起"},
{"begin_ms": 8310, "end_ms": 8430, "text": "步"},
{"begin_ms": 8430, "end_ms": 8630, "text": "很"},
{"begin_ms": 8630, "end_ms": 8720, "text": "不"},
{"begin_ms": 8720, "end_ms": 8920, "text": "容"},
{"begin_ms": 8920, "end_ms": 9140, "text": "易"}
]
}
}
# Polyphonic Character Package
包含多音字信息,推荐读音在前,其他读音在后。
多音字样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 4,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
# EOF
EOF 结果包,表示结果全部发送完毕。
EOF 样例:
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
"index": 8,
"type": "eof"
}
}
# Practical Flow Example Analysis
# Case 1: Minimum Configuration Flow
Request: Starter
{
"type": "TTS3",
"tts": {}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}
Request: Task
{
"query": "大家好!"
}
Response: 2
{
"service": "tts",
"status": "ok",
"session": "49d3af81-f344-4ccf-8231-574ceac1a260",
"trace": "f2e13c02-c629-4db8-a942-4393583a5182",
"tts": {
"id": "4b69geebj4septyxh72qy885f",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
# Case 2: Full Configuration Flow
Request: Starter
{
"auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
"type": "TTS3",
"device": "device-wei",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"tts": {
"language": "zh-CN",
"voice": "xiaoling",
"speed_ratio": 1.05,
"sample_rate": 16000,
"volume": 200,
"polyphone": true,
"subtitle": "srt",
"sentence_time": true,
"word_time": true
}
}
Response: 1
{
"service": "auth",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}
Request: Task
{
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"query": "你好。",
"ssml": false
}
Response: 2 音频
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 1,
"type": "audio",
"audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
}
}
Response: 3 时间戳
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "de9a1066-d968-475a-ac38-b2da017b2a27",
"index": 2,
"type": "timestamp",
"sentence_time": {
"begin_ms": 500,
"end_ms": 1010,
"text": "你好。"
},
"word_times": [
{"begin_ms": 500, "end_ms": 590, "text": "你"},
{"begin_ms": 590, "end_ms": 1010, "text": "好"}
]
}
}
Response: 4 多音字
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "08a5785a-a6a2-4140-b587-a6cead592531",
"tts": {
"id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
"index": 3,
"type": "polyphone",
"polyphones": [
{
"word": "好",
"phones": ["hao3", "hao4"]
}
]
}
}
Response: 5 字幕
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 4,
"type": "subtitle",
"subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
}
}
Response: 6 EOF
{
"service": "tts",
"status": "ok",
"session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
"trace": "d923181d-9d9b-4be1-9370-40f456be3771",
"tts": {
"id": "bf3qmpuuk18ktv7cv4b6kzhs9",
"index": 9,
"type": "eof"
}
}
# Create TTS Personal Voice Model Generation Task (QID)
# 接口Description
TTS个人音色模型生成( QID)服务可根据用户上传的真人采集或录制的语音素材文件,以及声音复刻同意文件,通过算法训练产出发音效果与声音素材提供者一致的数字人TTS音色模型。为保证训练效果,请在采集时遵照商汤数字人音色复制采集制作规范,内容包括环境要求、设备要求、发音要求、授权要求、朗读脚本,具体参考:采集规范 (opens new window),PaaS平台支持7天在线存储,需要及时转存,7天后生成内容将无法下载。
# Request URL
POST
/api/2dvh/v1/material/voice/clone/qid/create
# Request Headers
Content-Type:
application/json
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
audioUrl | String | True | Training audio file URL。素材支持格式:wav, mp3, m4a, mp4, mov, aac |
audioLanguage | String | True | Primary language used in the audio file。zh-CN 中文普通话,en-US 美式英文。遵循 BCP 47 标准 |
consent | Object | True | User consent declaration information |
- audioUrl | String | True | 用户同意音频文件 URL。用户同意文件应与音频文件在相同环境下录制并使用同一种语言。 中文的用户同意声明文本为:”我(发音人姓名)确认我的声音将会被(公司Name)使用于创建合成版本语音。”。 English: "I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice." Japanese: "私(姓名を記入)は自身の音声を(会社名を記入)が使用し、合成音声を作り使用されることに同意します。" Korean: "나는 [본인의 이름을 말씀하세요] 내 목소리의 녹음을 이용해 합성 버전을 만들어 사용된다는 것을 [회사 이름을 말씀하세요]알고 있습니다." 素材支持格式:wav, mp3, m4a, mp4, mov, aac |
- speakerName | String | True | 用户同意音频文件中使用的发音人姓名,必须与音频文件中的发音人姓名保持一致。长度限制不大于64字符 |
- companyName | String | True | 用户同意文件中使用的公司Name,必须与音频文件中的公司Name保持一致。长度限制不大于64字符 |
taskType | String | True | Training algorithm type。TTS3,TTS6,TTS7,TTS8,TTS101。默认填写TTS3。更多需求请咨询技术支持 |
voice | Object | True | Speaker information |
- name | String | True | Speaker name。长度限制不大于64字符 |
- gender | Integer | True | Speaker gender (1: Male, 2 : Female) |
musicSep | Boolean | False | Whether to perform audio background music removal (source separation) |
trainMode | String | False | 训练模式,仅对TTS3有效。common: 常规训练模式,默认为 common模式;backend_only: 极速训练模式,大幅度压缩模型训练时长,效果也会有影响 |
# Request Example
{
"audioUrl": "http://oss.com/abc/object.mp3",
"audioLanguage": "zh-CN",
"consent": {
"audioUrl":"http://oss.com/abc/xx.mp3",
"speakerName": "xiaowang",
"companyName": "XXXX"
},
"taskType": "TTS3",
"voice": {
"name": "xiaotang0",
"gender": 2
},
"musicSep": false,
"trainMode": "common"
}
# Response Elements
| Field | Type | Required | Description |
|---|---|---|---|
code | Integer | True | 0 - Success, 其他 - 异常 |
message | String | True | Error details |
data | Object | False | Task ID |
# Response Example
{
"code": 0,
"message": "success",
"data": 11890
}
# TTS Voice Training Audio Duration Requirements
| Training algorithm type | Duration requirement |
|---|---|
| TTS3 | At least 5 minutes; 20+ minutes for better results |
| TTS6 | 30-90秒 |
| TTS7 | 30-300秒 |
| TTS8 | 30-300秒 |
| TTS101 | At least 5 minutes; 20+ minutes for better results |
# TTS Language Standards (BCP 47)
| Code | Language (Region) |
|---|---|
| en-US | 英语(美国) |
| zh-CN | 中文(中国) |
| af-ZA | 南非荷兰语(南非) |
| am-ET | 阿姆哈拉语(埃塞俄比亚) |
| ar-EG | 阿拉伯语(埃及) |
| ar-SA | 阿拉伯语(沙特阿拉伯) |
| az-AZ | 阿塞拜疆语(阿塞拜疆) |
| bg-BG | 保加利亚语(保加利亚) |
| bn-BD | 孟加拉语(孟加拉国) |
| bn-IN | 孟加拉语(印度) |
| bs-BA | 波斯尼亚语(波斯尼亚和黑塞哥维那) |
| ca-ES | 加泰罗尼亚语(西班牙) |
| cs-CZ | 捷克语(捷克) |
| cy-GB | 威尔士语(英国) |
| da-DK | 丹麦语(丹麦) |
| de-AT | 德语(奥地利) |
| de-CH | 德语(瑞士) |
| de-DE | 德语(德国) |
| el-GR | 希腊语(希腊) |
| en-AU | 英语(澳大利亚) |
| en-CA | 英语(加拿大) |
| en-GB | 英语(英国) |
| en-IE | 英语(爱尔兰) |
| en-IN | 英语(印度) |
| es-ES | 西班牙语(西班牙) |
| es-MX | 西班牙语(墨西哥) |
| et-EE | 爱沙尼亚语(爱沙尼亚) |
| eu-ES | 巴斯克语(西班牙) |
| fa-IR | 波斯语(伊朗) |
| fi-FI | 芬兰语(芬兰) |
| fil-PH | 菲律宾语(菲律宾) |
| fr-BE | 法语(比利时) |
| fr-CA | 法语(加拿大) |
| fr-CH | 法语(瑞士) |
| fr-FR | 法语(法国) |
| ga-IE | 爱尔兰语(爱尔兰) |
| gl-ES | 加利西亚语(西班牙) |
| he-IL | 希伯来语(以色列) |
| hi-IN | 印地语(印度) |
| hr-HR | 克罗地亚语(克罗地亚) |
| hu-HU | 匈牙利语(匈牙利) |
| hy-AM | 亚美尼亚语(亚美尼亚) |
| id-ID | 印度尼西亚语(印度尼西亚) |
| is-IS | 冰岛语(冰岛) |
| it-IT | 意大利语(意大利) |
| ja-JP | 日语(日本) |
| jv-ID | 爪哇语(印度尼西亚) |
| ka-GE | 格鲁吉亚语(格鲁吉亚) |
| kk-KZ | 哈萨克语(哈萨克斯坦) |
| km-KH | 高棉语(柬埔寨) |
| kn-IN | 卡纳达语(印度) |
| ko-KR | 韩语(韩国) |
| lo-LA | 老挝语(老挝) |
| lt-LT | 立陶宛语(立陶宛) |
| lv-LV | 拉脱维亚语(拉脱维亚) |
| mk-MK | 马其顿语(马其顿) |
| ml-IN | 马拉雅拉姆语(印度) |
| mn-MN | 蒙古语(蒙古) |
| ms-MY | 马来语(马来西亚) |
| mt-MT | 马耳他语(马耳他) |
| my-MM | 缅甸语(缅甸) |
| nb-NO | 挪威语(博克马尔,挪威) |
| ne-NP | 尼泊尔语(尼泊尔) |
| nl-BE | 荷兰语(比利时) |
| nl-NL | 荷兰语(荷兰) |
| pl-PL | 波兰语(波兰) |
| ps-AF | 普什图语(阿富汗) |
| pt-BR | 葡萄牙语(巴西) |
| pt-PT | 葡萄牙语(葡萄牙) |
| ro-RO | 罗马尼亚语(罗马尼亚) |
| ru-RU | 俄语(俄罗斯) |
| si-LK | 僧伽罗语(斯里兰卡) |
| sk-SK | 斯洛伐克语(斯洛伐克) |
| sl-SI | 斯洛文尼亚语(斯洛文尼亚) |
| so-SO | 索马里语(索马里) |
| sq-AL | 阿尔巴尼亚语(阿尔巴尼亚) |
| sr-RS | 塞尔维亚语(塞尔维亚) |
| su-ID | 巽他语(印度尼西亚) |
| sv-SE | 瑞典语(瑞典) |
| sw-KE | 斯瓦希里语(肯尼亚) |
| ta-IN | 泰米尔语(印度) |
| te-IN | 泰卢固语(印度) |
| th-TH | 泰语(泰国) |
| tr-TR | 土耳其语(土耳其) |
| uk-UA | 乌克兰语(乌克兰) |
| ur-PK | 乌尔都语(巴基斯坦) |
| uz-UZ | 乌兹别克语(乌兹别克斯坦) |
| vi-VN | 越南语(越南) |
| zh-HK | 中文(香港) |
| zh-TW | 中文(台湾) |
| zu-ZA | 祖鲁语(南非) |
# Create TTS Personal Voice Model Generation Task (Legacy API, Not Recommended)
# 接口Description
TTS个人音色模型生成服务可根据用户上传的真人采集或录制的语音素材文件通过算法训练产出发音效果与声音素材提供者一致的数字人TTS音色模型。为保证训练效果,训练音频时长不得短于5分钟,请在采集时遵照商汤数字人音色复制采集制作规范,内容包括环境要求、设备要求、发音要求、授权要求、朗读脚本,具体参考:采集规范 (opens new window),PaaS平台支持7天在线存储,需要及时转存,7天后生成内容将无法下载。
# Request URL
POST
/api/2dvh/v1/material/voice/clone/create
# Request Headers
Content-Type:
application/json
# Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
url | String | True | Training audio file URL, duration must be at least 5 minutes |
voice | Object | True | Voice parameters |
- name | String | True | Speaker name |
- gender | Integer | True | Speaker gender(1: Male,2 :Female) |
- language | String | True | Speaker language (currently only supports zh-CN: Mandarin Chinese) |
musicSep | Boolean | False | Whether to perform audio background music removal |
sampleAudioMsg | String | False | Sample audio content text. No sample audio generated by default. Maximum 500 characters. |
trainMode | String | False | 训练模式,common: 常规训练模式,默认为 common模式;backend_only: 极速训练模式,大幅度压缩模型训练时长,效果也会有影响。 |
# Request Example
{
"url": "http://oss.com/abc/object.zip",
"voice": {
"name": "xiaotang0",
"gender": 2,
"language": "zh-CN"
},
"sampleAudioMsg": "我是商汤数字人!",
"musicSep": true,
"trainMode": "common"
}
# Response Elements
| Field | Type | Required | Description |
|---|---|---|---|
code | Integer | True | 0 - Success, 其他 - 异常 |
message | String | True | Error details |
data | Object | False | Task ID |
# Response Example
{
"code": 0,
"message": "success",
"data": 11890
}
The above covers all voice processing capabilities provided by the platform.