# Speech Processing

Beyond the basic capability of integrating multiple elements in video synthesis, the platform also provides related capabilities for speech processing, supporting users to preprocess the speech-related content of video synthesis. It offers multiple capabilities such as TTS, ASR, NLP, polyphone processing, timbre transfer, etc. Additionally, it provides USSML (Unified Speech Synthesis Markup Language) capability, normalizing the SSML (Speech Synthesis Markup Language) syntax format of various TTS providers.

# Capability Introduction

  • TTS (Text-to-Speech)
    • Reads out the text content passed in by the user in the voice of the selected speaker, supporting the adjustment of pitch, speed, and volume of the reading voice. Note that speakers of all languages can synthesize English text; speakers of all languages can synthesize text in their own language; speakers of Chinese dialects such as Cantonese and Shanghainese can synthesize Chinese text.
  • ASR (Automatic Speech Recognition)
    • Analyzes the audio content provided by the user and transcribes it into corresponding text content.
  • NLP (Natural Language Processing)
    • Analyzes the semantics and meanings of the text content provided by the user, and effectively understands and expands it to feedback the text content expected by the user.
  • Polyphone Processing
    • Searches for polyphones in the text content provided by the user, and feeds back the possible polyphones and their corresponding pronunciation options to the user.
  • Timbre Transfer
    • Transfers the original audio content provided by the user to the specified timbre and returns the result audio after synthesis.

# API Description

To call all API services of the platform, users need to access the service entry point: aigc.softsugar.com, and add token information in the request header.

# TTS Language Detection

# Interface Description

Identifies the language of the text content provided by the user and returns the corresponding language code.

# Request URL

POST /api/voice/v1/nlp/language

# Request Parameters

Field Type Required Description
query String True Text content

# Request Example

{
    "query": "Text content to be synthesized into speechxxxxxx"
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.result String Language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1: Request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": "zh"
    }
}

# TTS Language Validation (Qid)

# Interface Description

Based on the text content, language, and QID provided by the user, it determines whether the elements match each other and returns the detection result's language code.

# Request URL

POST /api/voice/v3/tts/validate

# Request Parameters

Field Type Required Description
query String True Text content
qid String True Voice-Qid
ssml Boolean False Whether to use SSML, default is no

# Request Example

{
    "query": "Text content to be synthesized into speechxxxxxx",
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK"
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.result String Validation result
data.result.valid Boolean Whether it matches
data.result.language String Detected language code, refer to ISO 639-1 (opens new window). Currently supports detection of: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1: Request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": {
            "valid": true,
            "language": "zh"
        }
    }
}

# Voice ID Migration to QID

# Interface Description

Actively migrates the old timbre represented by the existing Voice ID to the new timbre represented by Qid and returns the corresponding Qid value. It is recommended to store the Qid for subsequent inference requests.

# Request URL

POST /api/voice/v3/tts/migrate

# Request Parameters

Field Type Required Description
voice_id String True The voice id to be converted

# Request Example

{
  "voice_id": "xiaoning"
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.result String Qid, represents the pre-configured Qid. This field is long, it is recommended to be stored as VARCHAR(512) type

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": "JQb7Qv:AEA_Z10Mqp9GYwDGdLzMvPzEzIqwo"
  }
}

# QID Details Interface

# Interface Description

Obtains detailed information about the QID, including the parameters that the timbre supports for adjustment.

# Request URL

GET /api/voice/v3/tts/qid/{qid}

# Request Parameters

Field Type Required Description
qid String True The QID to be requested details

# Request Example

  https://domain/api/voice/v3/tts/qid/mwvA2f:AEBEvMy850Y_Z10Mqp9GUwTMr8xMyI3Tzk3Q

# Response Elements

Field Type Description
code int Code number
message string Status description
data object Result data
 -status int Status, see status table
 -result object QID details
  -pitch boolean Whether pitch adjustment is supported
  -speed boolean Whether speed adjustment is supported
  -volume boolean Whether volume adjustment is supported
  -phone boolean Whether returning phonemes is supported
  -subtitle boolean Whether returning subtitles, sentence-level timestamps, word-level timestamps is supported
  -ussml object Whether using USSML syntax is supported
   -break boolean Whether the <break> tag in USSML is effective
   -phoneme string list Whether specifying pronunciation in USSML is effective, supports <pinyin><ipa> tags
   -sub boolean Whether text substitution in USSML is effective, supports the <sub> tag
   -sayas string list Whether specifying the reading method in USSML is effective, supports <cardinal><digit><phone><address><date><clock> tags
  -languages string list Supported language list

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": {
      "pitch": true,
      "speed": true,
      "volume": true,
      "phone": true,
      "subtitle": true,
      "ussml": {
        "break": true,
        "phoneme": ["pinyin"],
        "sub": true,
        "sayas": ["cardinal", "digit", "phone", "address", "date", "clock"]
      },
      "languages": [
        "zh-CN",
        "en-US"
      ]
    }
  }
}

# Initiate a TTS Request (Qid)

# Interface Description

Based on the selected voice of the speaker combined with the input text content, the text is read aloud. It supports adjusting the pitch, speed, and volume of the voice.

# Request URL

POST /api/voice/v3/tts/request

# Request Parameters

Field Type Required Description
qid String True Speaker's Qid
query String True Text content to be synthesized into speech
ssml Boolean False Whether to use USSML
phoneme Boolean False Whether to return the URL of the phoneme file
timeout Integer False Timeout duration in ms. If returned within the timeout, the TTS result is directly returned; otherwise, return request_id
pitch_offset Float False Pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supported range [-60, 60]; default 0
speed_ratio Float False Speech speed, the higher the value, the slower the speed, supported range [0.5, 2]; default 1.0
volume Integer False Volume, the higher the value, the louder the sound, supported range [1, 400]; default 100
subtitle_max_length Integer False Maximum length of each subtitle line, default 0, i.e., no length limit
subtitle_cut_by_punc Boolean False Whether to split and line break subtitles based on punctuation, default false, i.e., no split
word_time Boolean False Whether to return word-level timestamps, default false, i.e., not returned

# Request Example

{
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK",
    "query": "Text content to be synthesized into speechxxxxxx",
    "phoneme": true,
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {}
data.result.audio_url String TTS synthesized audio MP3 file URL
data.result.srt_url String TTS audio subtitles SRT file URL
data.result.phone_url String TTS audio phoneme file URL
data.result.duration_ms Integer Duration of the TTS synthesized audio MP3 file in ms
data.result.word_times List Word-level timestamps of the TTS audio file in ms
data.result.word_times.begin_ms Integer Start timestamp of the TTS audio file word in ms
data.result.word_times.end_ms Integer End timestamp of the TTS audio file word in ms
data.result.word_times.text String Text content of the TTS audio file word

# Response Example

case1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

Case 2: The request did not time out and was successful.

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Initiate TTS Request (Legacy)

Note: This interface is currently under maintenance and will not be updated with new features. It is recommended to use the Qid interface instead.

# Interface Description

Based on the user's choice of voice actor combined with the input text content, the text is read aloud. It supports adjusting the tone, speed, and volume of the reading voice. It is important to note that all language voice actors can synthesize English text; all language voice actors can synthesize text in their own language; dialect voice actors like Cantonese, Shanghainese, etc., can synthesize Chinese text.

# Request URL

POST /api/voice/v1/request/tts

# Request Parameters

Field Type Required Description
voice_id String True Voice actor ID
language String False Language
query String True Text content for voice synthesis
ssml Boolean False Whether to use SSML
phoneme Boolean False Whether to return the URL of the phoneme file
timeout Integer False Timeout in ms. If a response is received within this time, return the TTS result directly; otherwise, return request_id
pitch_offset Float False Pitch, the higher the value, the sharper it is, the lower the value, the deeper it is, supported range [-60, 60]; default is 0
speed_ratio Float False Speech rate, the higher the value, the slower the speech, supported range [0.5, 2]; default is 1.0
volume Integer False Volume, the higher the value, the louder the sound, supported range [1, 400]; default is 100
subtitle_max_length Integer False Maximum length of each subtitle line, default is 0, i.e., no limit
subtitle_cut_by_punc Boolean False Whether to split subtitles by punctuation for line breaks, default is false, i.e., no splitting
word_time Boolean False Whether to return word-level timestamps, default is false, i.e., not returned

# Request Example

{
    "voice_id": "xiaoling",
    "query": "Text content to be synthesized xxxxxx",
    "phoneme": true,
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS synthesis result, if returned within the timeout, this field is returned otherwise this field is empty {}
data.result.audio_url String TTS synthesized audio MP3 file URL
data.result.srt_url String TTS audio subtitle SRT file URL
data.result.phone_url String TTS audio phoneme file URL
data.result.duration_ms Integer Duration of the TTS synthesized audio MP3 file in ms
data.result.word_times List Word-level timestamps of the TTS audio file in ms
data.result.word_times.begin_ms Integer Start timestamp of the word in the TTS audio file in ms
data.result.word_times.end_ms Integer End timestamp of the word in the TTS audio file in ms
data.result.word_times.text String Text content of the word in the TTS audio file

# Response Example

case1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

Case 2: The request did not time out and was successful.

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (Qid)

# Interface Description

Returns the current status of a specified TTS request based on input parameters.

# Request URL

POST /api/voice/v3/tts/result

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.audio_url String TTS synthesized audio MP3 file URL
data.result.srt_url String TTS audio subtitle SRT file URL
data.result.phone_url String TTS audio phoneme file URL
data.result.duration_ms Integer Duration of TTS synthesized audio MP3 file, in ms
data.result.word_times List Timestamps of TTS audio file at word level, in ms
data.result.word_times.begin_ms Integer Start timestamp of TTS audio file word, in ms
data.result.word_times.end_ms Integer End timestamp of TTS audio file word, in ms
data.result.word_times.text String Text content of TTS audio file word

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (Old)

Note: This interface is currently in maintenance status and will not be updated with new features. It is recommended to use the Qid interface.

# Interface Description

Returns the current status of the specified TTS request based on input parameters.

# Request URL

POST /api/voice/v1/result/tts

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS synthesis result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.audio_url String TTS synthesis audio MP3 file URL
data.result.srt_url String TTS audio subtitles SRT file URL
data.result.phone_url String TTS audio phoneme file URL
data.result.duration_ms Integer Duration of TTS synthesis audio MP3 file, in ms
data.result.word_times List Timestamps at word level for TTS audio file, in ms
data.result.word_times.begin_ms Integer Start timestamp of TTS audio file word, in ms
data.result.word_times.end_ms Integer End timestamp of TTS audio file word, in ms
data.result.word_times.text String Text content of TTS audio file word

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Initiate ASR Request

# Interface Description

Analyzes the audio content provided by the user and transcribes it into the corresponding text content.

# Request URL

POST /api/voice/v1/request/asr

# Request Parameters

Field Type Required Description
audio_url String True URL of the audio file for text recognition
timeout Integer False Timeout for waiting for a return, in ms. If returned within the timeout, the ASR result is directly returned, otherwise return request_id
subtitle_max_length Integer False Maximum length of each subtitle line, default is 0, meaning no length limit
subtitle_cut_by_punc Boolean False Whether to split subtitles by punctuation for line breaks, default is false, meaning no split
word_time Boolean False Whether to return word-level timestamps, default is false, meaning not returned

# Request Example

{
    "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
    "timeout": 3000
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object ASR request result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.srt_url String ASR audio subtitles SRT file URL
data.result.text String ASR recognition result
data.result.word_times List Timestamps at word level for ASR audio file, in ms
data.result.word_times.begin_ms Integer Start timestamp of ASR audio file word, in ms
data.result.word_times.end_ms Integer End timestamp of ASR audio file word, in ms
data.result.word_times.text String Text content of ASR audio file word

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
        "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
        "text": "ASR recognition result",
        "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# ASR Status Query

# Interface Description

Returns the current status of the specified ASR request based on input parameters.

# Request URL

POST /api/voice/v1/result/asr

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object ASR request result, if returned within timeout, this field is returned otherwise this field is empty {}
data.result.srt_url String ASR audio subtitles SRT file URL
data.result.text String ASR recognition result
data.result.word_times List Timestamps at word level for ASR audio file, in ms
data.result.word_times.begin_ms Integer Start timestamp of ASR audio file word, in ms
data.result.word_times.end_ms Integer End timestamp of ASR audio file word, in ms
data.result.word_times.text String Text content of ASR audio file word

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "text": "ASR recognition result",
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# Initiate Polyphony Request

# Interface Description

Returns whether the text contains polyphonic characters and the selectable pronunciations for the polyphonic characters based on input parameters.

# Request URL

POST /api/voice/v1/request/polyphony

# Request Parameters

Field Type Required Description
query String True Request text
format string false The return format supports "pinyin" and "bopomofo", with "pinyin" as the default.
timeout Integer False Timeout for waiting for a return, in ms. If returned within the timeout, directly return the polyphony result, otherwise return request_id

# Request Example

{
    "query": "Text content for voice synthesis xxxxxx",
    "timeout": 3000
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object List Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{}
data.result.text String Polyphonic text
data.result.polyphony String List Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations
data.result.polyphony_assist List of String list The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo.

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": []
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

case3:Request for phonetic notation, not successfully timed out

{
    "code": 0,
    "data": {
        "request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
        "result": [
            {
                "polyphony": [
                    "lv4",
                    "lu4"
                ],
                "polyphony_assist": [
                    ["lv4", "ㄌㄩˋ"],
                    ["lu4", "ㄌㄨˋ"]                    
                ],
                "text": "绿"
            },
            {
                "polyphony": [
                    "le5",
                    "liao3"
                ],
                "polyphony_assist": [
                    ["le5", "˙ㄌㄜ"],
                    ["liao3", "ㄌㄧㄠˇ"]
                ],
                "text": "了"
            }
        ],
        "status": 0
    },
    "message": "ok"
}

# Polyphony Status Query

# Interface Description

Returns the current status of the specified polyphony request based on input parameters.

# Request URL

POST /api/voice/v1/result/polyphony

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object List Polyphony request result, if returned within timeout, this field is returned, otherwise this field is empty{}
data.result.text String Polyphonic text
data.result.polyphony String List Selectable pronunciations for the polyphonic character, recommended pronunciation first, followed by candidate pronunciations
data.result.polyphony_assist List of String list The optional pronunciations of polyphonic characters with their corresponding Pinyin and Zhuyin annotations, with the recommended pronunciation listed first and the alternative pronunciation(s) following. Return only when the format is bopomofo.

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": []
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

case3:Request for phonetic notation, not successfully timed out

{
    "code": 0,
    "data": {
        "request_id": "66d043c7-adbd-4a12-a52e-aa7c3c226a05",
        "result": [
            {
                "polyphony": [
                    "lv4",
                    "lu4"
                ],
                "polyphony_assist": [
                    ["lv4", "ㄌㄩˋ"],
                    ["lu4", "ㄌㄨˋ"]                    
                ],
                "text": "绿"
            },
            {
                "polyphony": [
                    "le5",
                    "liao3"
                ],
                "polyphony_assist": [
                    ["le5", "˙ㄌㄜ"],
                    ["liao3", "ㄌㄧㄠˇ"]
                ],
                "text": "了"
            }
        ],
        "status": 0
    },
    "message": "ok"
}

# Initiate Synthesis Request for Singing Audio

# API Description

Returns a new pure vocal audio and background music (if applicable) based on the input singing audio, singing audio attributes, and a specified singing tone ID.

# Request URL

POST /api/voice/v3/svc/request

# Request Parameters

Field Type Required Description
audio_url String True URL of the original singing audio
sid String True Singing tone ID
with_bgm Boolean False Indicates whether the audio file contains background music. If it includes background music, source separation will be performed, and the dry vocal and background music will be returned separately
pitch Integer False Pitch, the higher the value the sharper, the lower the value the deeper. Supported range is [-12, 12]; default is 0
timeout Integer False Time in milliseconds to synchronously wait for task completion. If the task completes within the time limit, it directly returns the result; otherwise, it returns the request_id

# Request Sample

{
    "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
    "sid": "sid1",
    "with_bgm": true,
    "pitch": 0,
    "timeout": 0
}

# Response Elements

Field Type Description
code Integer Response code, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result String Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {}
data.result.vocal_track_url String URL of the SVC result audio file
data.result.original_instrumental_track_url String URL of the separated instrumental track, returned only if with_bgm is true
data.result.original_vocal_track_url String URL of the separated original vocal track, returned only if with_bgm is true
data.result.original_reverb_url String URL of the separated reverb track, returned only if with_bgm is true

# Response Sample

case 1: Request Timeout

{
    "code": 0,
    "message": "created",
    "data": {
        "status": 1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case 2: Request Not Timeout, Request Successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
      		"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
      		"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
      		"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
      		"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
    	}
    }
}

# Query Synthesis Request Status for Singing Audio

# API Description

Returns the current status of the specified synthesis request for singing audio based on the input parameters.

# Request URL

POST /api/voice/v3/svc/result

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Sample

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Response code, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result String Translation request result. If returned within the timeout period, this field will be populated; otherwise, this field will be empty {}
data.result.vocal_track_url String URL of the SVC result audio file
data.result.original_instrumental_track_url String URL of the separated instrumental track, returned only if with_bgm is true
data.result.original_vocal_track_url String URL of the separated original vocal track, returned only if with_bgm is true
data.result.original_reverb_url String URL of the separated reverb track, returned only if with_bgm is true

# Response Sample

case 1: Still Processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status": 1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case 2: Processing Complete

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
      		"vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
      		"original_instrumental_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_acc.mp3",
      		"original_vocal_track_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_vocal.mp3",
      		"original_reverb_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1_reverb.mp3"
    	}
    }
}

# Initiate Translation Request (Not Supported Yet)

# Interface Description

Returns the translated text content based on the input text and the specified target language.

# Request URL

POST /api/voice/v1/request/translate

# Request Parameters

Field Type Required Description
query String True Text to be translated, must be within 200,000 characters
to String True Target language code for the output text, supported language list see: Language List (opens new window)
timeout Integer False Timeout for waiting for a return, in ms. If returned within the timeout, directly return the translation result, otherwise return request_id

# Request Example

{
    "query": "Text content to be translated xxxxxx",
    "to": "en",
    "timeout": 3000
}

# Response Elements

Field Type Description
code Integer Code number, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result String Translation request result, if returned within timeout, this field is returned, otherwise this field is empty{}

# Response Example

case1: Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": []
    }
}

case2: Request not timed out, request successful

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": "The text to translate"
    }
}

# Translation Status Query (Not Supported Yet)

# Interface Description

Returns the current status of the specified translation request based on input parameters.

# Request URL

POST /api/voice/v1/result/translate

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String Status description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result String Translation request result, if returned within the timeout, this field is returned, otherwise this field is empty {}

# Response Example

case1: Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": ""
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": "The text to translate"
    }
}

# Status Table

# TTS Language Detection Request Response Status
code Description
0 Success
x Failure
status Description
0 Successfully processed
3201 Rejected: Required parameter missing
3202 Rejected: Illegal parameter--Request text is empty
3301 Request failed: Language not detected
# TTS Language Validation Request Response Status
code Description
0 Success
x Failure
status Description
0 Successfully processed
1201 Rejected: Required parameter missing
1202 Rejected: Illegal parameter--Voice ID is empty
1203 Rejected: Illegal parameter--Request text is empty
1204 Rejected: Illegal parameter--Language is empty
1205 Rejected: Illegal parameter--Voice ID does not exist
1206 Rejected: Illegal parameter--Unknown language
1207 Rejected: Illegal parameter--Unknown supplier
1301 Request failed: Language not detected
# TTS Request Response Status
code Description
0 Success
x Failure
status Description
0 Successfully processed
1000 Processing
1002 Rejected: Required parameter missing
1003 Rejected: Illegal parameter--Voice is empty
1004 Rejected: Illegal parameter--Text to synthesize is empty
1005 Rejected: Illegal parameter--Voice does not exist
1102 Request failed, not timed out: Service not connected
1103 Request failed, not timed out: Busy
1104 Request failed, not timed out: Internal error
1105 Request failed, not timed out: Request timed out
# TTS Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
1102 Failed: Service not connected
1103 Failed: Busy
1104 Failed: Internal error
1105 Failed: Request timed out
1106 Failed: Unknown request ID
# ASR Request Response Status
code Description
0 Success
x Failure
status Description
0 Request successful
1000 Request timed out, a new request created
2001 Rejected: Required parameter missing
2002 Rejected: Illegal parameter--Audio URL is empty
2102 Request failed, not timed out: Service not connected
2103 Request failed, not timed out: Busy
2104 Request failed, not timed out: Internal error
2105 Request failed, not timed out: Request timed out
2106 Request failed, not timed out: Audio download error
# ASR Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
2102 Failed: Service not connected
2103 Failed: Busy
2104 Failed: Internal error
2105 Failed: Request timed out
2106 Failed: Audio download error
2107 Failed: Unknown request ID
# Polyphone Request Response Status
code Description
0 Success
x Failure
status Description
0 Successfully processed
1000 Processing
5002 Rejected: Required parameter missing
5004 Rejected: Illegal parameter--Text to synthesize is empty
5102 Request failed, not timed out: Service not connected
5103 Request failed, not timed out: Busy
5104 Request failed, not timed out: Internal error
5105 Request failed, not timed out: Request timed out
# Polyphone Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
5102 Failed: Service not connected
5103 Failed: Busy
5104 Failed: Internal error
5105 Failed: Request timed out
5106 Failed: Unknown request ID
# Translation Request Response Status
code Description
0 Success
x Failure
status Description
0 Request successful
1000 Processing
3401 Rejected: Required parameter missing
3402 Rejected: Illegal parameter--Request text is empty
3403 Rejected: Illegal parameter--Output language is empty
3404 Rejected: Illegal parameter--Translation text exceeds 200,000 characters
3501 Request failed, internal error
3502 Request failed, request timed out
# Voice Conversion Request Response Status
code Description
0 Success
x Failure
status Description
0 Successfully processed
1000 Processing
1401 Rejected: Required parameter missing
1402 Rejected: Illegal parameter--Audio URL is empty
1403 Rejected: Illegal parameter--vc_id does not exist
1404 Rejected: Illegal parameter--Audio duration exceeds 10 minutes
1502 Request failed: VC service not connected
1503 Request failed: VC service internal error
1504 Request failed: Request timed out
1505 Request failed: Audio download error
1506 Request failed: Incorrect audio format
# Voice Conversion Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
1404 Rejected: Illegal parameter--Audio duration exceeds 10 minutes
1502 Failed: VC service not connected
1503 Failed: VC service internal error
1504 Failed: Request timed out
1505 Failed: Audio download error
1506 Failed: Incorrect audio format
1507 Failed: Unknown request ID
# Translation Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
3501 Failed: Internal error
3502 Failed: Request timed out
3503 Failed: Unknown request ID

# USSML Grammar Explanation

# Introduction

USSML (Unified Speech Synthesis Markup Language) aims to provide a unified SSML (Speech Synthesis Markup Language) syntax format. USSML supports the most commonly used SSML tags: pauses, specifying pronunciation, replacing synthesized text, and specifying reading methods, which can meet most of the voice synthesis needs.

# How to Use

# Syntax Format

The USSML syntax format is as follows:

<speak sttts:version="0.1">
    <break time="string" />
    <phoneme ph="string"></phoneme>
    <sub alias="string"></sub>
    <say-as interpret-as="string"></say-as>
</speak>

Except for the <speak> tag, which is used to wrap other child tags, the remaining tags cannot be nested.

# Special Characters

In USSML, if the following special characters are used, they need to be escaped as shown in the table below. For the characters related to the USSML markup itself, no escape is required.

Supplementing examples of escaping.

Special Character Escape Character
& &amp;
< &lt;
> &gt;
" &quot;
' &apos;
# Tag Explanation
# <speak>

The <speak> tag is the root tag of USSML, used to wrap all USSML tags.

The attribute sttts:version is used to specify the version number of USSML, which is currently 0.1.

<speak sttts:version="0.1">
    <!-- USSML Tags -->
</speak>
# <break>

The <break> tag is used to specify a pause, and its time attribute is used to specify the duration of the pause. The value of the time attribute is a string, in seconds or milliseconds. The maximum duration of a pause is 5 seconds. Pure numbers cannot be transmitted; a unit is required. For example:

<break time="5s" />

or

<break time="5000ms" />
# <phoneme>

The <phoneme> tag is used to specify pronunciation, and its ph attribute is used to specify the content of the pronunciation. Since different providers support different language ranges, the <phoneme> in USSML currently only supports Chinese Pinyin. Pinyin usage: separate the pinyin of each character with a space, and the number of pinyin must equal the number of characters. Each pinyin consists of pronunciation and tone, where the tone is a number from 1 to 5, with "5" representing the neutral tone. For example:

<phoneme ph="mai2 mo4">埋没</phoneme>
# Example
<speak sttts:version="0.1">
    You say <phoneme ph="bo2"></phoneme>.
    <break time="500ms" />
    I say <phoneme ph="bao2"></phoneme>.
</speak>
# <sub>

The <sub> tag is used to replace subtitle text during the synthesis process. The 'alias' attribute of this tag is used to specify the text content to be replaced. During synthesis, the text contained in the 'alias' attribute will replace the original text for synthesis, and if there are subtitles, the subtitle content will be the original text inside the tag.

The content of the <sub> tag and the text of the 'alias' attribute must not be empty.

For example:

<sub alias="World Wide Web Consortium">W3C</sub>

In the example above, the subtitle displays as: W3C, and the reading content is: World Wide Web Consortium.

# <say-as>

The <say-as> tag allows you to synthesize the content of the tag using a specific reading method. The 'interpret-as' attribute of this tag is used to specify the reading method. Different providers may produce slightly different results for the same reading method. The interpret-as attribute supports multiple values, including:

Value Description Example
cardinal Pronounced as a numerical value “1487” is read as “one thousand four hundred eighty-seven”
digit Pronounced as a series of digits “12345” is read as “one two three four five”
phone Pronounced as a telephone number “1301001155” is read as “one three zero one zero zero one one five five”
address Pronounced as an address “市台路388-301号” is read as “Shi Tai Road three eight eight dash three zero one number”
date Pronounced as a date “1998-12-12” is read as “nineteen ninety-eight December twelfth”
clock Pronounced as a time “12:00:12” is read as “twelve o'clock twelve seconds”
# Example
<say-as interpret-as="cardinal">12345</say-as>

Example 1:

<speak sttts:version="0.1">
    You say <phoneme ph="bo2"></phoneme>.
    <break time="500ms" />
    I say <phoneme ph="bao2"></phoneme>.
</speak>

Example 2:

<speak sttts:version="0.1">
    <sub alias="World Wide Web Consortium">W3C</sub>
    is an international standardization organization.
</speak>

Example 3:

<speak sttts:version="0.1">
    <sub alias="TsingTao Beer">青岛啤酒</sub>
    in Henan dialect, it is,
    <phoneme ph="qing2 dao1 pi4 jiu1">TsingTao Beer</phoneme>. 
    <say-as interpret-as="cardinal">12345</say-as>
</speak>

The above content relates to the platform's voice processing capabilities.

Last Updated: 9/4/2024, 9:32:01 PM