# Speech Processing

Beyond the basic capability of integrating multiple elements in video synthesis, the platform also provides related capabilities for speech processing, supporting users to preprocess the speech-related content of video synthesis. It offers multiple capabilities such as TTS, ASR, NLP, polyphone processing, timbre transfer, etc. Additionally, it provides USSML (Unified Speech Synthesis Markup Language) capability, normalizing the SSML (Speech Synthesis Markup Language) syntax format of various TTS providers.

# Capability Introduction

TTS (Text-to-Speech)
- Reads out the text content passed in by the user in the voice of the selected speaker, supporting the adjustment of pitch, speed, and volume of the reading voice. Note that speakers of all languages can synthesize English text; speakers of all languages can synthesize text in their own language; speakers of Chinese dialects such as Cantonese and Shanghainese can synthesize Chinese text.
ASR (Automatic Speech Recognition)
- Analyzes the audio content provided by the user and transcribes it into corresponding text content.
NLP (Natural Language Processing)
- Analyzes the semantics and meanings of the text content provided by the user, and effectively understands and expands it to feedback the text content expected by the user.
Polyphone Processing
- Searches for polyphones in the text content provided by the user, and feeds back the possible polyphones and their corresponding pronunciation options to the user.
Timbre Transfer
- Transfers the original audio content provided by the user to the specified timbre and returns the result audio after synthesis.

# API Description

To call all API services of the platform, users need to access the service entry point: aigc.softsugar.com, and add token information in the request header.

# TTS Language Detection

# Interface Description

Identifies the language of the text content provided by the user and returns the corresponding language code.

# Request URL

POST /api/voice/v1/nlp/language

# Request Parameters

Field	Type	Required	Description
query	String	True	Text content

# Request Example

{
    "query": "Text content to be synthesized into speechxxxxxx"
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.result	String	Language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1: Request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": "zh"
    }
}

# TTS Language Validation (Qid)

# Interface Description

Based on the text content, language, and QID provided by the user, it determines whether the elements match each other and returns the detection result's language code.

# Request URL

POST /api/voice/v3/tts/validate

# Request Parameters

Field	Type	Required	Description
query	String	True	Text content
qid	String	True	Voice-Qid
ssml	Boolean	False	Whether to use SSML, default is no

# Request Example

{
    "query": "Text content to be synthesized into speechxxxxxx",
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK"
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.result	String	Validation result
data.result.valid	Boolean	Whether it matches
data.result.language	String	Detected language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1: Request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": {
            "valid": true,
            "language": "zh"
        }
    }
}

# Voice ID Migration to QID

# Interface Description

Actively migrates the old timbre represented by the existing Voice ID to the new timbre represented by Qid and returns the corresponding Qid value. It is recommended to store the Qid for subsequent inference requests.

# Request URL

POST /api/voice/v3/tts/migrate

# Request Parameters

Field	Type	Required	Description
voice_id	String	True	The voice id to be converted

# Request Example

{
  "voice_id": "xiaoning"
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.result	String	Qid, represents the pre-configured Qid. This field is long, it is recommended to be stored as `VARCHAR(512)` type

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": "JQb7Qv:AEA_Z10Mqp9GYwDGdLzMvPzEzIqwo"
  }
}

# QID Details Interface

# Interface Description

Obtains detailed information about the QID, including the parameters that the timbre supports for adjustment.

# Request URL

GET /api/voice/v3/tts/qid/{qid}

# Request Parameters

Field	Type	Required	Description
qid	String	True	The QID to be requested details

# Request Example

  https://domain/api/voice/v3/tts/qid/mwvA2f:AEBEvMy850Y_Z10Mqp9GUwTMr8xMyI3Tzk3Q

# Response Elements

Field	Type	Description
code	int	Code number
message	string	Status description
data	object	Result data
-status	int	Status, see status table
-result	object	QID details
-pitch	boolean	Whether pitch adjustment is supported
-speed	boolean	Whether speed adjustment is supported
-volume	boolean	Whether volume adjustment is supported
-phone	boolean	Whether returning phonemes is supported
-subtitle	boolean	Whether returning subtitles, sentence-level timestamps, word-level timestamps is supported
-ussml	object	Whether using USSML syntax is supported
-break	boolean	Whether the `<break>` tag in USSML is effective
-phoneme	string list	Whether specifying pronunciation in USSML is effective, supports `<pinyin><ipa>` tags
-sub	boolean	Whether text substitution in USSML is effective, supports the `<sub>` tag
-sayas	string list	Whether specifying the reading method in USSML is effective, supports `<cardinal><digit><phone><address><date><clock>` tags
-languages	string list	Supported language list

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": {
      "pitch": true,
      "speed": true,
      "volume": true,
      "phone": true,
      "subtitle": true,
      "ussml": {
        "break": true,
        "phoneme": ["pinyin"],
        "sub": true,
        "sayas": ["cardinal", "digit", "phone", "address", "date", "clock"]
      },
      "languages": [
        "zh-CN",
        "en-US"
      ]
    }
  }
}

# Initiate a TTS Request (Qid)

# Interface Description

Based on the selected voice of the speaker combined with the input text content, the text is read aloud. It supports adjusting the pitch, speed, and volume of the voice.

# Request URL

POST /api/voice/v3/tts/request

# Request Parameters

Field	Type	Required	Description
qid	String	True	Speaker's Qid
query	String	True	Text content to be synthesized into speech
ssml	Boolean	False	Whether to use USSML
phoneme	Boolean	False	Whether to return the URL of the phoneme file
timeout	Integer	False	Timeout duration in ms. If returned within the timeout, the TTS result is directly returned; otherwise, return request_id
pitch_offset	Float	False	Pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supported range [-60, 60]; default 0
speed_ratio	Float	False	Speech speed, the higher the value, the slower the speed, supported range [0.5, 2]; default 1.0
volume	Integer	False	Volume, the higher the value, the louder the sound, supported range [1, 400]; default 100
subtitle_max_length	Integer	False	Maximum length of each subtitle line, default 0, i.e., no length limit
subtitle_cut_by_punc	Boolean	False	Whether to split and line break subtitles based on punctuation, default false, i.e., no split
word_time	Boolean	False	Whether to return word-level timestamps, default false, i.e., not returned

# Request Example

{
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK",
    "query": "Text content to be synthesized into speechxxxxxx",
    "phoneme": true,
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {}
data.result.audio_url	String	TTS synthesized audio MP3 file URL
data.result.srt_url	String	TTS audio subtitles SRT file URL
data.result.phone_url	String	TTS audio phoneme file URL
data.result.duration_ms	Integer	Duration of the TTS synthesized audio MP3 file in ms
data.result.word_times	List	Word-level timestamps of the TTS audio file in ms
data.result.word_times.begin_ms	Integer	Start timestamp of the TTS audio file word in ms
data.result.word_times.end_ms	Integer	End timestamp of the TTS audio file word in ms
data.result.word_times.text	String	Text content of the TTS audio file word

# Response Example

case1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

Case 2: The request did not time out and was successful.

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Initiate TTS Request (Legacy)

Note: This interface is currently under maintenance and will not be updated with new features. It is recommended to use the Qid interface instead.

# Interface Description

Based on the user's choice of voice actor combined with the input text content, the text is read aloud. It supports adjusting the tone, speed, and volume of the reading voice. It is important to note that all language voice actors can synthesize English text; all language voice actors can synthesize text in their own language; dialect voice actors like Cantonese, Shanghainese, etc., can synthesize Chinese text.

# Request URL

POST /api/voice/v1/request/tts

# Request Parameters

Field	Type	Required	Description
voice_id	String	True	Voice actor ID
language	String	False	Language
query	String	True	Text content for voice synthesis
ssml	Boolean	False	Whether to use SSML
phoneme	Boolean	False	Whether to return the URL of the phoneme file
timeout	Integer	False	Timeout in ms. If a response is received within this time, return the TTS result directly; otherwise, return request_id
pitch_offset	Float	False	Pitch, the higher the value, the sharper it is, the lower the value, the deeper it is, supported range [-60, 60]; default is 0
speed_ratio	Float	False	Speech rate, the higher the value, the slower the speech, supported range [0.5, 2]; default is 1.0
volume	Integer	False	Volume, the higher the value, the louder the sound, supported range [1, 400]; default is 100
subtitle_max_length	Integer	False	Maximum length of each subtitle line, default is 0, i.e., no limit
subtitle_cut_by_punc	Boolean	False	Whether to split subtitles by punctuation for line breaks, default is false, i.e., no splitting
word_time	Boolean	False	Whether to return word-level timestamps, default is false, i.e., not returned

# Request Example

{
    "voice_id": "xiaoling",
    "query": "Text content to be synthesized xxxxxx",
    "phoneme": true,
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field	Type	Description
code	Integer	Code, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {}
data.result.audio_url	String	TTS synthesized audio MP3 file URL
data.result.srt_url	String	TTS audio subtitle SRT file URL
data.result.phone_url	String	TTS audio phoneme file URL
data.result.duration_ms	Integer	Duration of the TTS synthesized audio MP3 file in ms
data.result.word_times	List	Word-level timestamps of the TTS audio file in ms
data.result.word_times.begin_ms	Integer	Start timestamp of the word in the TTS audio file in ms
data.result.word_times.end_ms	Integer	End timestamp of the word in the TTS audio file in ms
data.result.word_times.text	String	Text content of the word in the TTS audio file

# Response Example

case1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

Case 2: The request did not time out and was successful.

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (Qid)

# Interface Description

Returns the current status of a specified TTS request based on input parameters.

# Request URL

POST /api/voice/v3/tts/result

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.audio_url	String	TTS synthesized audio MP3 file URL
data.result.srt_url	String	TTS audio subtitle SRT file URL
data.result.phone_url	String	TTS audio phoneme file URL
data.result.duration_ms	Integer	Duration of TTS synthesized audio MP3 file, in ms
data.result.word_times	List	Timestamps of TTS audio file at word level, in ms
data.result.word_times.begin_ms	Integer	Start timestamp of TTS audio file word, in ms
data.result.word_times.end_ms	Integer	End timestamp of TTS audio file word, in ms
data.result.word_times.text	String	Text content of TTS audio file word

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (Old)

Note: This interface is currently in maintenance status and will not be updated with new features. It is recommended to use the Qid interface.

# Interface Description

Returns the current status of the specified TTS request based on input parameters.

# Request URL

POST /api/voice/v1/result/tts

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.audio_url	String	TTS synthesis audio MP3 file URL
data.result.srt_url	String	TTS audio subtitles SRT file URL
data.result.phone_url	String	TTS audio phoneme file URL
data.result.duration_ms	Integer	Duration of TTS synthesis audio MP3 file, in ms
data.result.word_times	List	Timestamps at word level for TTS audio file, in ms
data.result.word_times.begin_ms	Integer	Start timestamp of TTS audio file word, in ms
data.result.word_times.end_ms	Integer	End timestamp of TTS audio file word, in ms
data.result.word_times.text	String	Text content of TTS audio file word

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Initiate ASR Request

# Interface Description

Analyzes the audio content provided by the user and transcribes it into the corresponding text content.

# Request URL

POST /api/voice/v1/request/asr

# Request Parameters

Field	Type	Required	Description
audio_url	String	True	URL of the audio file for text recognition
timeout	Integer	False	Timeout for waiting for a return, in ms. If returned within the timeout, the ASR result is directly returned, otherwise return request_id
subtitle_max_length	Integer	False	Maximum length of each subtitle line, default is 0, meaning no length limit
subtitle_cut_by_punc	Boolean	False	Whether to split subtitles by punctuation for line breaks, default is false, meaning no split
word_time	Boolean	False	Whether to return word-level timestamps, default is false, meaning not returned

# Request Example

{
    "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
    "timeout": 3000
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	ASR request result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.srt_url	String	ASR audio subtitles SRT file URL
data.result.text	String	ASR recognition result
data.result.word_times	List	Timestamps at word level for ASR audio file, in ms
data.result.word_times.begin_ms	Integer	Start timestamp of ASR audio file word, in ms
data.result.word_times.end_ms	Integer	End timestamp of ASR audio file word, in ms
data.result.word_times.text	String	Text content of ASR audio file word

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
        "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
        "text": "ASR recognition result",
        "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# ASR Status Query

# Interface Description

Returns the current status of the specified ASR request based on input parameters.

# Request URL

POST /api/voice/v1/result/asr

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object	ASR request result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.srt_url	String	ASR audio subtitles SRT file URL
data.result.text	String	ASR recognition result
data.result.word_times	List	Timestamps at word level for ASR audio file, in ms
data.result.word_times.begin_ms	Integer	Start timestamp of ASR audio file word, in ms
data.result.word_times.end_ms	Integer	End timestamp of ASR audio file word, in ms
data.result.word_times.text	String	Text content of ASR audio file word

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "text": "ASR recognition result",
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# Initiate Polyphony Request

# Interface Description

Returns whether the text contains polyphonic characters and the selectable pronunciations for the polyphonic characters based on input parameters.

# Request URL

POST /api/voice/v1/request/polyphony

# Request Parameters

Field	Type	Required	Description
query	String	True	Request text
format	string	false	The return format supports "pinyin" and "bopomofo", with "pinyin" as the default.
timeout	Integer	False	Timeout for waiting for a return, in ms. If returned within the timeout, directly return the polyphony result, otherwise return request_id

# Request Example

{
    "query": "Text content for voice synthesis xxxxxx",
    "timeout": 3000
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object List	Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{}
data.result.text	String	Polyphonic text
data.result.polyphony	String List	Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations
data.result.polyphony_assist	List of String list	The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo.

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": []
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

case3：Request for phonetic notation, not successfully timed out

{
    "code": 0,
    "data": {
        "request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
        "result": [
            {
                "polyphony": [
                    "lv4",
                    "lu4"
                ],
                "polyphony_assist": [
                    ["lv4", "ㄌㄩˋ"],
                    ["lu4", "ㄌㄨˋ"]                    
                ],
                "text": "绿"
            },
            {
                "polyphony": [
                    "le5",
                    "liao3"
                ],
                "polyphony_assist": [
                    ["le5", "˙ㄌㄜ"],
                    ["liao3", "ㄌㄧㄠˇ"]
                ],
                "text": "了"
            }
        ],
        "status": 0
    },
    "message": "ok"
}

# Polyphony Status Query

# Interface Description

Returns the current status of the specified polyphony request based on input parameters.

# Request URL

POST /api/voice/v1/result/polyphony

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	Object List	Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{}
data.result.text	String	Polyphonic text
data.result.polyphony	String List	Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations
data.result.polyphony_assist	List of String list	The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo.

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": []
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

case3：Request for phonetic notation, not successfully timed out

{
    "code": 0,
    "data": {
        "request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
        "result": [
            {
                "polyphony": [
                    "lv4",
                    "lu4"
                ],
                "polyphony_assist": [
                    ["lv4", "ㄌㄩˋ"],
                    ["lu4", "ㄌㄨˋ"]                    
                ],
                "text": "绿"
            },
            {
                "polyphony": [
                    "le5",
                    "liao3"
                ],
                "polyphony_assist": [
                    ["le5", "˙ㄌㄜ"],
                    ["liao3", "ㄌㄧㄠˇ"]
                ],
                "text": "了"
            }
        ],
        "status": 0
    },
    "message": "ok"
}

# Initiate Synthesis Request for Singing Audio

# API Description

Returns a new pure vocal audio and background music (if applicable) based on the input singing audio, singing audio attributes, and a specified singing tone ID.

# Request URL

POST /api/voice/v3/svc/request

# Request Parameters

Field	Type	Required	Description
audio_url	String	True	URL of the original singing audio
sid	String	True	Singing tone ID
with_bgm	Boolean	False	Indicates whether the audio file contains background music. If it includes background music, source separation will be performed, and the dry vocal and background music will be returned separately
pitch	Integer	False	Pitch, the higher the value the sharper, the lower the value the deeper. Supported range is [-12, 12]; default is 0
timeout	Integer	False	Time in milliseconds to synchronously wait for task completion. If the task completes within the time limit, it directly returns the result; otherwise, it returns the request_id

# Request Sample

{
    "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
    "sid": "sid1",
    "with_bgm": true,
    "pitch": 0,
    "timeout": 0
}

# Response Elements

Field	Type	Description
code	Integer	Response code, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	String	Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {}
data.result.vocal_track_url	String	URL of the SVC result audio file
data.result.original_instrumental_track_url	String	URL of the separated instrumental track, returned only if with_bgm is true
data.result.original_vocal_track_url	String	URL of the separated original vocal track, returned only if with_bgm is true
data.result.original_reverb_url	String	URL of the separated reverb track, returned only if with_bgm is true

# Response Sample

case 1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
        "status": 1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case 2: Request Not Timeout, Request Successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
      		"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
      		"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
      		"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
      		"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
    	}
    }
}

# Query Synthesis Request Status for Singing Audio

# API Description

Returns the current status of the specified synthesis request for singing audio based on the input parameters.

# Request URL

POST /api/voice/v3/svc/result

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Sample

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field	Type	Description
code	Integer	Response code, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	String	Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {}
data.result.vocal_track_url	String	URL of the SVC result audio file
data.result.original_instrumental_track_url	String	URL of the separated instrumental track, returned only if with_bgm is true
data.result.original_vocal_track_url	String	URL of the separated original vocal track, returned only if with_bgm is true
data.result.original_reverb_url	String	URL of the separated reverb track, returned only if with_bgm is true

# Response Sample

case 1: Still Processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status": 1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case 2: Processing Complete

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
      		"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
      		"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
      		"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
      		"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
    	}
    }
}

# Initiate Translation Request (Not Supported Yet)

# Interface Description

Returns the translated text content based on the input text and the specified target language.

# Request URL

POST /api/voice/v1/request/translate

# Request Parameters

Field	Type	Required	Description
query	String	True	Text to be translated, must be within 200,000 characters
to	String	True	Target language code for the output text, supported language list see: Language List (opens new window)
timeout	Integer	False	Timeout for waiting for a return, in ms. If returned within the timeout, directly return the translation result, otherwise return request_id

# Request Example

{
    "query": "Text content to be translated xxxxxx",
    "to": "en",
    "timeout": 3000
}

# Response Elements

Field	Type	Description
code	Integer	Code number, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	String	Translation request result, if returned within timeout, this field is returned, otherwise this field is empty{}

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": []
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": "The text to translate"
    }
}

# Translation Status Query (Not Supported Yet)

# Interface Description

Returns the current status of the specified translation request based on input parameters.

# Request URL

POST /api/voice/v1/result/translate

# Request Parameters

Field	Type	Required	Description
request_id	String	True	Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field	Type	Description
code	Integer	Code, see status table
message	String	Status description
data	Object	Result data
data.status	Integer	Status, see status table
data.request_id	String	Request ID
data.result	String	Translation request result, if returned within the timeout, this field is returned, otherwise this field is empty {}

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": ""
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": "The text to translate"
    }
}

# Status Table

# TTS Language Detection Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Successfully processed
3201	Rejected: Required parameter missing
3202	Rejected: Illegal parameter--Request text is empty
3301	Request failed: Language not detected

# TTS Language Validation Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Successfully processed
1201	Rejected: Required parameter missing
1202	Rejected: Illegal parameter--Voice ID is empty
1203	Rejected: Illegal parameter--Request text is empty
1204	Rejected: Illegal parameter--Language is empty
1205	Rejected: Illegal parameter--Voice ID does not exist
1206	Rejected: Illegal parameter--Unknown language
1207	Rejected: Illegal parameter--Unknown supplier
1301	Request failed: Language not detected

# TTS Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Successfully processed
1000	Processing
1002	Rejected: Required parameter missing
1003	Rejected: Illegal parameter--Voice is empty
1004	Rejected: Illegal parameter--Text to synthesize is empty
1005	Rejected: Illegal parameter--Voice does not exist
1102	Request failed, not timed out: Service not connected
1103	Request failed, not timed out: Busy
1104	Request failed, not timed out: Internal error
1105	Request failed, not timed out: Request timed out

# TTS Request Query Status

code	Description
0	Success
x	Failure

status	Description
0	Success
1000	Processing
1102	Failed: Service not connected
1103	Failed: Busy
1104	Failed: Internal error
1105	Failed: Request timed out
1106	Failed: Unknown request ID

# ASR Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Request successful
1000	Request timed out, a new request created
2001	Rejected: Required parameter missing
2002	Rejected: Illegal parameter--Audio URL is empty
2102	Request failed, not timed out: Service not connected
2103	Request failed, not timed out: Busy
2104	Request failed, not timed out: Internal error
2105	Request failed, not timed out: Request timed out
2106	Request failed, not timed out: Audio download error

# ASR Request Query Status

code	Description
0	Success
x	Failure

status	Description
0	Success
1000	Processing
2102	Failed: Service not connected
2103	Failed: Busy
2104	Failed: Internal error
2105	Failed: Request timed out
2106	Failed: Audio download error
2107	Failed: Unknown request ID

# Polyphone Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Successfully processed
1000	Processing
5002	Rejected: Required parameter missing
5004	Rejected: Illegal parameter--Text to synthesize is empty
5102	Request failed, not timed out: Service not connected
5103	Request failed, not timed out: Busy
5104	Request failed, not timed out: Internal error
5105	Request failed, not timed out: Request timed out

# Polyphone Request Query Status

code	Description
0	Success
x	Failure

status	Description
0	Success
1000	Processing
5102	Failed: Service not connected
5103	Failed: Busy
5104	Failed: Internal error
5105	Failed: Request timed out
5106	Failed: Unknown request ID

# Translation Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Request successful
1000	Processing
3401	Rejected: Required parameter missing
3402	Rejected: Illegal parameter--Request text is empty
3403	Rejected: Illegal parameter--Output language is empty
3404	Rejected: Illegal parameter--Translation text exceeds 200,000 characters
3501	Request failed, internal error
3502	Request failed, request timed out

# Voice Conversion Request Response Status

code	Description
0	Success
x	Failure

status	Description
0	Successfully processed
1000	Processing
1401	Rejected: Required parameter missing
1402	Rejected: Illegal parameter--Audio URL is empty
1403	Rejected: Illegal parameter--vc_id does not exist
1404	Rejected: Illegal parameter--Audio duration exceeds 10 minutes
1502	Request failed: VC service not connected
1503	Request failed: VC service internal error
1504	Request failed: Request timed out
1505	Request failed: Audio download error
1506	Request failed: Incorrect audio format

# Voice Conversion Request Query Status

code	Description
0	Success
x	Failure

status	Description
0	Success
1000	Processing
1404	Rejected: Illegal parameter--Audio duration exceeds 10 minutes
1502	Failed: VC service not connected
1503	Failed: VC service internal error
1504	Failed: Request timed out
1505	Failed: Audio download error
1506	Failed: Incorrect audio format
1507	Failed: Unknown request ID

# Translation Request Query Status

code	Description
0	Success
x	Failure

status	Description
0	Success
1000	Processing
3501	Failed: Internal error
3502	Failed: Request timed out
3503	Failed: Unknown request ID

# USSML Grammar Explanation

# Introduction

USSML (Unified Speech Synthesis Markup Language) aims to provide a unified SSML (Speech Synthesis Markup Language) syntax format. USSML supports the most commonly used SSML tags: pauses, specifying pronunciation, replacing synthesized text, and specifying reading methods, which can meet most of the voice synthesis needs.

# How to Use

# Syntax Format

The USSML syntax format is as follows:

<speak sttts:version="0.1">
    <break time="string" />
    <phoneme ph="string"></phoneme>
    <sub alias="string"></sub>
    <say-as interpret-as="string"></say-as>
</speak>

Except for the <speak> tag, which is used to wrap other child tags, the remaining tags cannot be nested.

# Special Characters

In USSML, if the following special characters are used, they need to be escaped as shown in the table below. For the characters related to the USSML markup itself, no escape is required.

Supplementing examples of escaping.

Special Character	Escape Character
&	`&`
<	`<`
>	`>`
"	`"`
'	`'`

# Tag Explanation

# `<speak>`

The <speak> tag is the root tag of USSML, used to wrap all USSML tags.

The attribute sttts:version is used to specify the version number of USSML, which is currently 0.1.

<speak sttts:version="0.1">
    <!-- USSML Tags -->
</speak>

# `<break>`

The <break> tag is used to specify a pause, and its time attribute is used to specify the duration of the pause. The value of the time attribute is a string, in seconds or milliseconds. The maximum duration of a pause is 5 seconds. Pure numbers cannot be transmitted; a unit is required. For example:

<break time="5s" />

<break time="5000ms" />

# `<phoneme>`

The <phoneme> tag is used to specify pronunciation, and its ph attribute is used to specify the content of the pronunciation. Since different providers support different language ranges, the <phoneme> in USSML currently only supports Chinese Pinyin. Pinyin usage: separate the pinyin of each character with a space, and the number of pinyin must equal the number of characters. Each pinyin consists of pronunciation and tone, where the tone is a number from 1 to 5, with "5" representing the neutral tone. For example:

<phoneme ph="mai2 mo4">埋没</phoneme>

# Example

<speak sttts:version="0.1">
    You say <phoneme ph="bo2">薄</phoneme>.
    <break time="500ms" />
    I say <phoneme ph="bao2">薄</phoneme>.
</speak>

# `<sub>`

The <sub> tag is used to replace subtitle text during the synthesis process. The 'alias' attribute of this tag is used to specify the text content to be replaced. During synthesis, the text contained in the 'alias' attribute will replace the original text for synthesis, and if there are subtitles, the subtitle content will be the original text inside the tag.

The content of the <sub> tag and the text of the 'alias' attribute must not be empty.

For example:

<sub alias="World Wide Web Consortium">W3C</sub>

In the example above, the subtitle displays as: W3C, and the reading content is: World Wide Web Consortium.

# `<say-as>`

The <say-as> tag allows you to synthesize the content of the tag using a specific reading method. The 'interpret-as' attribute of this tag is used to specify the reading method. Different providers may produce slightly different results for the same reading method. The interpret-as attribute supports multiple values, including:

Value	Description	Example
cardinal	Pronounced as a numerical value	“1487” is read as “one thousand four hundred eighty-seven”
digit	Pronounced as a series of digits	“12345” is read as “one two three four five”
phone	Pronounced as a telephone number	“1301001155” is read as “one three zero one zero zero one one five five”
address	Pronounced as an address	“市台路388-301号” is read as “Shi Tai Road three eight eight dash three zero one number”
date	Pronounced as a date	“1998-12-12” is read as “nineteen ninety-eight December twelfth”
clock	Pronounced as a time	“12:00:12” is read as “twelve o'clock twelve seconds”

# Example

<say-as interpret-as="cardinal">12345</say-as>

Example 1:

<speak sttts:version="0.1">
    You say <phoneme ph="bo2">薄</phoneme>.
    <break time="500ms" />
    I say <phoneme ph="bao2">薄</phoneme>.
</speak>

Example 2:

<speak sttts:version="0.1">
    <sub alias="World Wide Web Consortium">W3C</sub>
    is an international standardization organization.
</speak>

Example 3:

<speak sttts:version="0.1">
    <sub alias="TsingTao Beer">青岛啤酒</sub>
    in Henan dialect, it is,
    <phoneme ph="qing2 dao1 pi4 jiu1">TsingTao Beer</phoneme>. 
    <say-as interpret-as="cardinal">12345</say-as>
</speak>

The above content relates to the platform's voice processing capabilities.

← Public Material Acquisition Streaming Voice Processing →