# Speech Processing
Beyond the basic capability of integrating multiple elements in video synthesis, the platform also provides related capabilities for speech processing, supporting users to preprocess the speech-related content of video synthesis. It offers multiple capabilities such as TTS, ASR, NLP, polyphone processing, timbre transfer, etc. Additionally, it provides USSML (Unified Speech Synthesis Markup Language) capability, normalizing the SSML (Speech Synthesis Markup Language) syntax format of various TTS providers.
# Capability Introduction
- TTS (Text-to-Speech)
- Reads out the text content passed in by the user in the voice of the selected speaker, supporting the adjustment of pitch, speed, and volume of the reading voice. Note that speakers of all languages can synthesize English text; speakers of all languages can synthesize text in their own language; speakers of Chinese dialects such as Cantonese and Shanghainese can synthesize Chinese text.
- ASR (Automatic Speech Recognition)
- Analyzes the audio content provided by the user and transcribes it into corresponding text content.
- NLP (Natural Language Processing)
- Analyzes the semantics and meanings of the text content provided by the user, and effectively understands and expands it to feedback the text content expected by the user.
- Polyphone Processing
- Searches for polyphones in the text content provided by the user, and feeds back the possible polyphones and their corresponding pronunciation options to the user.
- Timbre Transfer
- Transfers the original audio content provided by the user to the specified timbre and returns the result audio after synthesis.
# API Description
To call all API services of the platform, users need to access the service entry point: aigc.softsugar.com, and add token information in the request header.
# TTS Language Detection
# Interface Description
Identifies the language of the text content provided by the user and returns the corresponding language code.
# Request URL
POST
/api/voice/v1/nlp/language
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
query | String | True | Text content |
# Request Example
{
"query": "Text content to be synthesized into speechxxxxxx"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.result | String | Language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu |
# Response Example
case1: Request successful
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": "zh"
}
}
# TTS Language Validation (Qid)
# Interface Description
Based on the text content, language, and QID provided by the user, it determines whether the elements match each other and returns the detection result's language code.
# Request URL
POST
/api/voice/v3/tts/validate
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
query | String | True | Text content |
qid | String | True | Voice-Qid |
ssml | Boolean | False | Whether to use SSML, default is no |
# Request Example
{
"query": "Text content to be synthesized into speechxxxxxx",
"qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.result | String | Validation result |
data.result.valid | Boolean | Whether it matches |
data.result.language | String | Detected language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu |
# Response Example
case1: Request successful
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": {
"valid": true,
"language": "zh"
}
}
}
# Voice ID Migration to QID
# Interface Description
Actively migrates the old timbre represented by the existing Voice ID to the new timbre represented by Qid and returns the corresponding Qid value. It is recommended to store the Qid for subsequent inference requests.
# Request URL
POST
/api/voice/v3/tts/migrate
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
voice_id | String | True | The voice id to be converted |
# Request Example
{
"voice_id": "xiaoning"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.result | String | Qid, represents the pre-configured Qid. This field is long, it is recommended to be stored as VARCHAR(512) type |
# Response Example
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": "JQb7Qv:AEA_Z10Mqp9GYwDGdLzMvPzEzIqwo"
}
}
# QID Details Interface
# Interface Description
Obtains detailed information about the QID, including the parameters that the timbre supports for adjustment.
# Request URL
GET
/api/voice/v3/tts/qid/{qid}
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
qid | String | True | The QID to be requested details |
# Request Example
https://domain/api/voice/v3/tts/qid/mwvA2f:AEBEvMy850Y_Z10Mqp9GUwTMr8xMyI3Tzk3Q
# Response Elements
Field | Type | Description |
---|---|---|
code | int | Code number |
message | string | Status description |
data | object | Result data |
-status | int | Status, see status table |
-result | object | QID details |
-pitch | boolean | Whether pitch adjustment is supported |
-speed | boolean | Whether speed adjustment is supported |
-volume | boolean | Whether volume adjustment is supported |
-phone | boolean | Whether returning phonemes is supported |
-subtitle | boolean | Whether returning subtitles, sentence-level timestamps, word-level timestamps is supported |
-ussml | object | Whether using USSML syntax is supported |
-break | boolean | Whether the <break> tag in USSML is effective |
-phoneme | string list | Whether specifying pronunciation in USSML is effective, supports <pinyin><ipa> tags |
-sub | boolean | Whether text substitution in USSML is effective, supports the <sub> tag |
-sayas | string list | Whether specifying the reading method in USSML is effective, supports <cardinal><digit><phone><address><date><clock> tags |
-languages | string list | Supported language list |
# Response Example
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"result": {
"pitch": true,
"speed": true,
"volume": true,
"phone": true,
"subtitle": true,
"ussml": {
"break": true,
"phoneme": ["pinyin"],
"sub": true,
"sayas": ["cardinal", "digit", "phone", "address", "date", "clock"]
},
"languages": [
"zh-CN",
"en-US"
]
}
}
}
# Initiate a TTS Request (Qid)
# Interface Description
Based on the selected voice of the speaker combined with the input text content, the text is read aloud. It supports adjusting the pitch, speed, and volume of the voice.
# Request URL
POST
/api/voice/v3/tts/request
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
qid | String | True | Speaker's Qid |
query | String | True | Text content to be synthesized into speech |
ssml | Boolean | False | Whether to use USSML |
phoneme | Boolean | False | Whether to return the URL of the phoneme file |
timeout | Integer | False | Timeout duration in ms. If returned within the timeout, the TTS result is directly returned; otherwise, return request_id |
pitch_offset | Float | False | Pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supported range [-60, 60]; default 0 |
speed_ratio | Float | False | Speech speed, the higher the value, the slower the speed, supported range [0.5, 2]; default 1.0 |
volume | Integer | False | Volume, the higher the value, the louder the sound, supported range [1, 400]; default 100 |
subtitle_max_length | Integer | False | Maximum length of each subtitle line, default 0, i.e., no length limit |
subtitle_cut_by_punc | Boolean | False | Whether to split and line break subtitles based on punctuation, default false, i.e., no split |
word_time | Boolean | False | Whether to return word-level timestamps, default false, i.e., not returned |
# Request Example
{
"qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK",
"query": "Text content to be synthesized into speechxxxxxx",
"phoneme": true,
"timeout": 3000,
"word_time": true,
"pitch_offset": 0.0,
"speed_ratio": 1.0,
"volume": 100
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {} |
data.result.audio_url | String | TTS synthesized audio MP3 file URL |
data.result.srt_url | String | TTS audio subtitles SRT file URL |
data.result.phone_url | String | TTS audio phoneme file URL |
data.result.duration_ms | Integer | Duration of the TTS synthesized audio MP3 file in ms |
data.result.word_times | List | Word-level timestamps of the TTS audio file in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of the TTS audio file word in ms |
data.result.word_times.end_ms | Integer | End timestamp of the TTS audio file word in ms |
data.result.word_times.text | String | Text content of the TTS audio file word |
# Response Example
case1: Request Timeout
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
Case 2: The request did not time out and was successful.
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Initiate TTS Request (Legacy)
Note: This interface is currently under maintenance and will not be updated with new features. It is recommended to use the Qid interface instead.
# Interface Description
Based on the user's choice of voice actor combined with the input text content, the text is read aloud. It supports adjusting the tone, speed, and volume of the reading voice. It is important to note that all language voice actors can synthesize English text; all language voice actors can synthesize text in their own language; dialect voice actors like Cantonese, Shanghainese, etc., can synthesize Chinese text.
# Request URL
POST
/api/voice/v1/request/tts
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
voice_id | String | True | Voice actor ID |
language | String | False | Language |
query | String | True | Text content for voice synthesis |
ssml | Boolean | False | Whether to use SSML |
phoneme | Boolean | False | Whether to return the URL of the phoneme file |
timeout | Integer | False | Timeout in ms. If a response is received within this time, return the TTS result directly; otherwise, return request_id |
pitch_offset | Float | False | Pitch, the higher the value, the sharper it is, the lower the value, the deeper it is, supported range [-60, 60]; default is 0 |
speed_ratio | Float | False | Speech rate, the higher the value, the slower the speech, supported range [0.5, 2]; default is 1.0 |
volume | Integer | False | Volume, the higher the value, the louder the sound, supported range [1, 400]; default is 100 |
subtitle_max_length | Integer | False | Maximum length of each subtitle line, default is 0, i.e., no limit |
subtitle_cut_by_punc | Boolean | False | Whether to split subtitles by punctuation for line breaks, default is false, i.e., no splitting |
word_time | Boolean | False | Whether to return word-level timestamps, default is false, i.e., not returned |
# Request Example
{
"voice_id": "xiaoling",
"query": "Text content to be synthesized xxxxxx",
"phoneme": true,
"timeout": 3000,
"word_time": true,
"pitch_offset": 0.0,
"speed_ratio": 1.0,
"volume": 100
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {} |
data.result.audio_url | String | TTS synthesized audio MP3 file URL |
data.result.srt_url | String | TTS audio subtitle SRT file URL |
data.result.phone_url | String | TTS audio phoneme file URL |
data.result.duration_ms | Integer | Duration of the TTS synthesized audio MP3 file in ms |
data.result.word_times | List | Word-level timestamps of the TTS audio file in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of the word in the TTS audio file in ms |
data.result.word_times.end_ms | Integer | End timestamp of the word in the TTS audio file in ms |
data.result.word_times.text | String | Text content of the word in the TTS audio file |
# Response Example
case1: Request Timeout
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
Case 2: The request did not time out and was successful.
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# TTS Status Query (Qid)
# Interface Description
Returns the current status of a specified TTS request based on input parameters.
# Request URL
POST
/api/voice/v3/tts/result
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Parameters
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {} |
data.result.audio_url | String | TTS synthesized audio MP3 file URL |
data.result.srt_url | String | TTS audio subtitle SRT file URL |
data.result.phone_url | String | TTS audio phoneme file URL |
data.result.duration_ms | Integer | Duration of TTS synthesized audio MP3 file, in ms |
data.result.word_times | List | Timestamps of TTS audio file at word level, in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of TTS audio file word, in ms |
data.result.word_times.end_ms | Integer | End timestamp of TTS audio file word, in ms |
data.result.word_times.text | String | Text content of TTS audio file word |
# Response Example
case1: Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# TTS Status Query (Old)
Note: This interface is currently in maintenance status and will not be updated with new features. It is recommended to use the Qid interface.
# Interface Description
Returns the current status of the specified TTS request based on input parameters.
# Request URL
POST
/api/voice/v1/result/tts
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Parameters
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {} |
data.result.audio_url | String | TTS synthesis audio MP3 file URL |
data.result.srt_url | String | TTS audio subtitles SRT file URL |
data.result.phone_url | String | TTS audio phoneme file URL |
data.result.duration_ms | Integer | Duration of TTS synthesis audio MP3 file, in ms |
data.result.word_times | List | Timestamps at word level for TTS audio file, in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of TTS audio file word, in ms |
data.result.word_times.end_ms | Integer | End timestamp of TTS audio file word, in ms |
data.result.word_times.text | String | Text content of TTS audio file word |
# Response Example
case1: Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
"duration_ms": 1000,
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Initiate ASR Request
# Interface Description
Analyzes the audio content provided by the user and transcribes it into the corresponding text content.
# Request URL
POST
/api/voice/v1/request/asr
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
audio_url | String | True | URL of the audio file for text recognition |
timeout | Integer | False | Timeout for waiting for a return, in ms. If returned within the timeout, the ASR result is directly returned, otherwise return request_id |
subtitle_max_length | Integer | False | Maximum length of each subtitle line, default is 0, meaning no length limit |
subtitle_cut_by_punc | Boolean | False | Whether to split subtitles by punctuation for line breaks, default is false, meaning no split |
word_time | Boolean | False | Whether to return word-level timestamps, default is false, meaning not returned |
# Request Example
{
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"timeout": 3000
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | ASR request result, if returned within timeout, this field is returned otherwise this field is empty {} |
data.result.srt_url | String | ASR audio subtitles SRT file URL |
data.result.text | String | ASR recognition result |
data.result.word_times | List | Timestamps at word level for ASR audio file, in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of ASR audio file word, in ms |
data.result.word_times.end_ms | Integer | End timestamp of ASR audio file word, in ms |
data.result.word_times.text | String | Text content of ASR audio file word |
# Response Example
case1: Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Request not timed out, request successful
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"text": "ASR recognition result",
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# ASR Status Query
# Interface Description
Returns the current status of the specified ASR request based on input parameters.
# Request URL
POST
/api/voice/v1/result/asr
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object | ASR request result, if returned within timeout, this field is returned otherwise this field is empty {} |
data.result.srt_url | String | ASR audio subtitles SRT file URL |
data.result.text | String | ASR recognition result |
data.result.word_times | List | Timestamps at word level for ASR audio file, in ms |
data.result.word_times.begin_ms | Integer | Start timestamp of ASR audio file word, in ms |
data.result.word_times.end_ms | Integer | End timestamp of ASR audio file word, in ms |
data.result.word_times.text | String | Text content of ASR audio file word |
# Response Example
case1: Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case2: Request not timed out, request successful
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
"text": "ASR recognition result",
"word_times": [
{
"begin_ms": 2080,
"end_ms": 2560,
"text": "字"
},
{
"begin_ms": 2560,
"end_ms": 3040,
"text": "级"
},
{
"begin_ms": 3040,
"end_ms": 3520,
"text": "别"
},
{
"begin_ms": 3520,
"end_ms": 4000,
"text": "时"
},
{
"begin_ms": 4000,
"end_ms": 4480,
"text": "间"
},
{
"begin_ms": 4480,
"end_ms": 4960,
"text": "戳"
} ]
}
}
}
# Initiate Polyphony Request
# Interface Description
Returns whether the text contains polyphonic characters and the selectable pronunciations for the polyphonic characters based on input parameters.
# Request URL
POST /api/voice/v1/request/polyphony
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
query | String | True | Request text |
format | string | false | The return format supports "pinyin" and "bopomofo", with "pinyin" as the default. |
timeout | Integer | False | Timeout for waiting for a return, in ms. If returned within the timeout, directly return the polyphony result, otherwise return request_id |
# Request Example
{
"query": "Text content for voice synthesis xxxxxx",
"timeout": 3000
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object List | Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{} |
data.result.text | String | Polyphonic text |
data.result.polyphony | String List | Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations |
data.result.polyphony_assist | List of String list | The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo. |
# Response Example
case1: Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": []
}
}
case2: Request not timed out, request successful
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": [
{
"text":"待",
"polyphony":["dai4","dai1"]
},
{
"text":"的",
"polyphony":["de5","di4","di1","di2"]
}
]
}
}
case3:Request for phonetic notation, not successfully timed out
{
"code": 0,
"data": {
"request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
"result": [
{
"polyphony": [
"lv4",
"lu4"
],
"polyphony_assist": [
["lv4", "ㄌㄩˋ"],
["lu4", "ㄌㄨˋ"]
],
"text": "绿"
},
{
"polyphony": [
"le5",
"liao3"
],
"polyphony_assist": [
["le5", "˙ㄌㄜ"],
["liao3", "ㄌㄧㄠˇ"]
],
"text": "了"
}
],
"status": 0
},
"message": "ok"
}
# Polyphony Status Query
# Interface Description
Returns the current status of the specified polyphony request based on input parameters.
# Request URL
POST /api/voice/v1/result/polyphony
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | Object List | Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{} |
data.result.text | String | Polyphonic text |
data.result.polyphony | String List | Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations |
data.result.polyphony_assist | List of String list | The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo. |
# Response Example
case1: Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": []
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": [
{
"text":"待",
"polyphony":["dai4","dai1"]
},
{
"text":"的",
"polyphony":["de5","di4","di1","di2"]
}
]
}
}
case3:Request for phonetic notation, not successfully timed out
{
"code": 0,
"data": {
"request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
"result": [
{
"polyphony": [
"lv4",
"lu4"
],
"polyphony_assist": [
["lv4", "ㄌㄩˋ"],
["lu4", "ㄌㄨˋ"]
],
"text": "绿"
},
{
"polyphony": [
"le5",
"liao3"
],
"polyphony_assist": [
["le5", "˙ㄌㄜ"],
["liao3", "ㄌㄧㄠˇ"]
],
"text": "了"
}
],
"status": 0
},
"message": "ok"
}
# Initiate Synthesis Request for Singing Audio
# API Description
Returns a new pure vocal audio and background music (if applicable) based on the input singing audio, singing audio attributes, and a specified singing tone ID.
# Request URL
POST /api/voice/v3/svc/request
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
audio_url | String | True | URL of the original singing audio |
sid | String | True | Singing tone ID |
with_bgm | Boolean | False | Indicates whether the audio file contains background music. If it includes background music, source separation will be performed, and the dry vocal and background music will be returned separately |
pitch | Integer | False | Pitch, the higher the value the sharper, the lower the value the deeper. Supported range is [-12, 12]; default is 0 |
timeout | Integer | False | Time in milliseconds to synchronously wait for task completion. If the task completes within the time limit, it directly returns the result; otherwise, it returns the request_id |
# Request Sample
{
"audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"sid": "sid1",
"with_bgm": true,
"pitch": 0,
"timeout": 0
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Response code, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | String | Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {} |
data.result.vocal_track_url | String | URL of the SVC result audio file |
data.result.original_instrumental_track_url | String | URL of the separated instrumental track, returned only if with_bgm is true |
data.result.original_vocal_track_url | String | URL of the separated original vocal track, returned only if with_bgm is true |
data.result.original_reverb_url | String | URL of the separated reverb track, returned only if with_bgm is true |
# Response Sample
case 1: Request Timeout
{
"code": 0,
"message": "created",
"data": {
"status": 1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case 2: Request Not Timeout, Request Successful
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
}
}
}
# Query Synthesis Request Status for Singing Audio
# API Description
Returns the current status of the specified synthesis request for singing audio based on the input parameters.
# Request URL
POST /api/voice/v3/svc/result
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Sample
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Response code, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | String | Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {} |
data.result.vocal_track_url | String | URL of the SVC result audio file |
data.result.original_instrumental_track_url | String | URL of the separated instrumental track, returned only if with_bgm is true |
data.result.original_vocal_track_url | String | URL of the separated original vocal track, returned only if with_bgm is true |
data.result.original_reverb_url | String | URL of the separated reverb track, returned only if with_bgm is true |
# Response Sample
case 1: Still Processing
{
"code": 0,
"message": "created",
"data": {
"status": 1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {}
}
}
case 2: Processing Complete
{
"code": 0,
"message": "ok",
"data": {
"status": 0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": {
"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
}
}
}
# Initiate Translation Request (Not Supported Yet)
# Interface Description
Returns the translated text content based on the input text and the specified target language.
# Request URL
POST /api/voice/v1/request/translate
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
query | String | True | Text to be translated, must be within 200,000 characters |
to | String | True | Target language code for the output text, supported language list see: Language List (opens new window) |
timeout | Integer | False | Timeout for waiting for a return, in ms. If returned within the timeout, directly return the translation result, otherwise return request_id |
# Request Example
{
"query": "Text content to be translated xxxxxx",
"to": "en",
"timeout": 3000
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code number, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | String | Translation request result, if returned within timeout, this field is returned, otherwise this field is empty{} |
# Response Example
case1: Request timed out
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": []
}
}
case2: Request not timed out, request successful
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": "The text to translate"
}
}
# Translation Status Query (Not Supported Yet)
# Interface Description
Returns the current status of the specified translation request based on input parameters.
# Request URL
POST /api/voice/v1/result/translate
# Request Parameters
Field | Type | Required | Description |
---|---|---|---|
request_id | String | True | Request ID |
# Request Example
{
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}
# Response Elements
Field | Type | Description |
---|---|---|
code | Integer | Code, see status table |
message | String | Status description |
data | Object | Result data |
data.status | Integer | Status, see status table |
data.request_id | String | Request ID |
data.result | String | Translation request result, if returned within the timeout, this field is returned, otherwise this field is empty {} |
# Response Example
case1: Still processing
{
"code": 0,
"message": "created",
"data": {
"status":1000,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": ""
}
}
case2: Processing completed
{
"code": 0,
"message": "ok",
"data": {
"status":0,
"request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
"result": "The text to translate"
}
}
# Status Table
# Language Detection Related Status
# TTS Language Detection Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Successfully processed |
3201 | Rejected: Required parameter missing |
3202 | Rejected: Illegal parameter--Request text is empty |
3301 | Request failed: Language not detected |
# Language Validation Related Status
# TTS Language Validation Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Successfully processed |
1201 | Rejected: Required parameter missing |
1202 | Rejected: Illegal parameter--Voice ID is empty |
1203 | Rejected: Illegal parameter--Request text is empty |
1204 | Rejected: Illegal parameter--Language is empty |
1205 | Rejected: Illegal parameter--Voice ID does not exist |
1206 | Rejected: Illegal parameter--Unknown language |
1207 | Rejected: Illegal parameter--Unknown supplier |
1301 | Request failed: Language not detected |
# TTS Related Status
# TTS Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Successfully processed |
1000 | Processing |
1002 | Rejected: Required parameter missing |
1003 | Rejected: Illegal parameter--Voice is empty |
1004 | Rejected: Illegal parameter--Text to synthesize is empty |
1005 | Rejected: Illegal parameter--Voice does not exist |
1102 | Request failed, not timed out: Service not connected |
1103 | Request failed, not timed out: Busy |
1104 | Request failed, not timed out: Internal error |
1105 | Request failed, not timed out: Request timed out |
# TTS Request Query Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Success |
1000 | Processing |
1102 | Failed: Service not connected |
1103 | Failed: Busy |
1104 | Failed: Internal error |
1105 | Failed: Request timed out |
1106 | Failed: Unknown request ID |
# ASR Related Status
# ASR Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Request successful |
1000 | Request timed out, a new request created |
2001 | Rejected: Required parameter missing |
2002 | Rejected: Illegal parameter--Audio URL is empty |
2102 | Request failed, not timed out: Service not connected |
2103 | Request failed, not timed out: Busy |
2104 | Request failed, not timed out: Internal error |
2105 | Request failed, not timed out: Request timed out |
2106 | Request failed, not timed out: Audio download error |
# ASR Request Query Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Success |
1000 | Processing |
2102 | Failed: Service not connected |
2103 | Failed: Busy |
2104 | Failed: Internal error |
2105 | Failed: Request timed out |
2106 | Failed: Audio download error |
2107 | Failed: Unknown request ID |
# Polyphone Related Status
# Polyphone Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Successfully processed |
1000 | Processing |
5002 | Rejected: Required parameter missing |
5004 | Rejected: Illegal parameter--Text to synthesize is empty |
5102 | Request failed, not timed out: Service not connected |
5103 | Request failed, not timed out: Busy |
5104 | Request failed, not timed out: Internal error |
5105 | Request failed, not timed out: Request timed out |
# Polyphone Request Query Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Success |
1000 | Processing |
5102 | Failed: Service not connected |
5103 | Failed: Busy |
5104 | Failed: Internal error |
5105 | Failed: Request timed out |
5106 | Failed: Unknown request ID |
# Translation Related Status
# Translation Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Request successful |
1000 | Processing |
3401 | Rejected: Required parameter missing |
3402 | Rejected: Illegal parameter--Request text is empty |
3403 | Rejected: Illegal parameter--Output language is empty |
3404 | Rejected: Illegal parameter--Translation text exceeds 200,000 characters |
3501 | Request failed, internal error |
3502 | Request failed, request timed out |
# Voice Conversion Request Response Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Successfully processed |
1000 | Processing |
1401 | Rejected: Required parameter missing |
1402 | Rejected: Illegal parameter--Audio URL is empty |
1403 | Rejected: Illegal parameter--vc_id does not exist |
1404 | Rejected: Illegal parameter--Audio duration exceeds 10 minutes |
1502 | Request failed: VC service not connected |
1503 | Request failed: VC service internal error |
1504 | Request failed: Request timed out |
1505 | Request failed: Audio download error |
1506 | Request failed: Incorrect audio format |
# Voice Conversion Request Query Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Success |
1000 | Processing |
1404 | Rejected: Illegal parameter--Audio duration exceeds 10 minutes |
1502 | Failed: VC service not connected |
1503 | Failed: VC service internal error |
1504 | Failed: Request timed out |
1505 | Failed: Audio download error |
1506 | Failed: Incorrect audio format |
1507 | Failed: Unknown request ID |
# Translation Request Query Status
code | Description |
---|---|
0 | Success |
x | Failure |
status | Description |
---|---|
0 | Success |
1000 | Processing |
3501 | Failed: Internal error |
3502 | Failed: Request timed out |
3503 | Failed: Unknown request ID |
# USSML Grammar Explanation
# Introduction
USSML (Unified Speech Synthesis Markup Language) aims to provide a unified SSML (Speech Synthesis Markup Language) syntax format. USSML supports the most commonly used SSML tags: pauses, specifying pronunciation, replacing synthesized text, and specifying reading methods, which can meet most of the voice synthesis needs.
# How to Use
# Syntax Format
The USSML syntax format is as follows:
<speak sttts:version="0.1">
<break time="string" />
<phoneme ph="string"></phoneme>
<sub alias="string"></sub>
<say-as interpret-as="string"></say-as>
</speak>
Except for the <speak>
tag, which is used to wrap other child tags, the remaining tags cannot be nested.
# Special Characters
In USSML, if the following special characters are used, they need to be escaped as shown in the table below. For the characters related to the USSML markup itself, no escape is required.
Supplementing examples of escaping.
Special Character | Escape Character |
---|---|
& | & |
< | < |
> | > |
" | " |
' | ' |
# Tag Explanation
# <speak>
The <speak>
tag is the root tag of USSML, used to wrap all USSML tags.
The attribute sttts:version
is used to specify the version number of USSML, which is currently 0.1
.
<speak sttts:version="0.1">
<!-- USSML Tags -->
</speak>
# <break>
The <break>
tag is used to specify a pause, and its time
attribute is used to specify the duration of the pause. The value of the time
attribute is a string, in seconds or milliseconds. The maximum duration of a pause is 5 seconds. Pure numbers cannot be transmitted; a unit is required. For example:
<break time="5s" />
or
<break time="5000ms" />
# <phoneme>
The <phoneme>
tag is used to specify pronunciation, and its ph
attribute is used to specify the content of the pronunciation. Since different providers support different language ranges, the <phoneme>
in USSML currently only supports Chinese Pinyin. Pinyin usage: separate the pinyin of each character with a space, and the number of pinyin must equal the number of characters. Each pinyin consists of pronunciation and tone, where the tone is a number from 1 to 5, with "5" representing the neutral tone. For example:
<phoneme ph="mai2 mo4">埋没</phoneme>
# Example
<speak sttts:version="0.1">
You say <phoneme ph="bo2">薄</phoneme>.
<break time="500ms" />
I say <phoneme ph="bao2">薄</phoneme>.
</speak>
# <sub>
The <sub>
tag is used to replace subtitle text during the synthesis process. The 'alias' attribute of this tag is used to specify the text content to be replaced. During synthesis, the text contained in the 'alias' attribute will replace the original text for synthesis, and if there are subtitles, the subtitle content will be the original text inside the tag.
The content of the <sub>
tag and the text of the 'alias' attribute must not be empty.
For example:
<sub alias="World Wide Web Consortium">W3C</sub>
In the example above, the subtitle displays as: W3C, and the reading content is: World Wide Web Consortium.
# <say-as>
The <say-as>
tag allows you to synthesize the content of the tag using a specific reading method. The 'interpret-as' attribute of this tag is used to specify the reading method. Different providers may produce slightly different results for the same reading method.
The interpret-as attribute supports multiple values, including:
Value | Description | Example |
---|---|---|
cardinal | Pronounced as a numerical value | “1487” is read as “one thousand four hundred eighty-seven” |
digit | Pronounced as a series of digits | “12345” is read as “one two three four five” |
phone | Pronounced as a telephone number | “1301001155” is read as “one three zero one zero zero one one five five” |
address | Pronounced as an address | “市台路388-301号” is read as “Shi Tai Road three eight eight dash three zero one number” |
date | Pronounced as a date | “1998-12-12” is read as “nineteen ninety-eight December twelfth” |
clock | Pronounced as a time | “12:00:12” is read as “twelve o'clock twelve seconds” |
# Example
<say-as interpret-as="cardinal">12345</say-as>
Example 1:
<speak sttts:version="0.1">
You say <phoneme ph="bo2">薄</phoneme>.
<break time="500ms" />
I say <phoneme ph="bao2">薄</phoneme>.
</speak>
Example 2:
<speak sttts:version="0.1">
<sub alias="World Wide Web Consortium">W3C</sub>
is an international standardization organization.
</speak>
Example 3:
<speak sttts:version="0.1">
<sub alias="TsingTao Beer">青岛啤酒</sub>
in Henan dialect, it is,
<phoneme ph="qing2 dao1 pi4 jiu1">TsingTao Beer</phoneme>.
<say-as interpret-as="cardinal">12345</say-as>
</speak>
The above content relates to the platform's voice processing capabilities.