# Voice Processing

In addition to the multi-element integrated video synthesis foundational capabilities, the platform also provides voice processing capabilities, supporting users in preprocessing voice-related content for video synthesis. Multiple capabilities are provided including TTS, ASR, polyphonic character handling, etc. Furthermore, USSML (Unified Speech Synthesis Markup Language) capabilities are provided to normalize the SSML (Speech Synthesis Markup Language) syntax formats across various TTS vendors.

# Capability Introduction

  • TTS语音合成
    • Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.
  • ASR语音识别
    • Parses the audio content uploaded by the user and transcribes it into corresponding text content.
  • 多音字处理
    • Searches for polyphonic characters based on the text content uploaded by the user, providing feedback on possible polyphonic characters and their alternative pronunciations.

# API Reference

To call all API services on the platform, users must access the service endpoint: aigc.softsugar.com, and include the token information in the request header.

# TTS Language Detection

# 接口Description

Identifies the language of the text content uploaded by the user and returns the corresponding language code.

# Request URL

POST /api/voice/v1/nlp/language

# Request Parameters

Field Type Required Description
query String True Text content

# Request Example

{
    "query": "待合成语音的文本内容xxxxxx"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.result String Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1:请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": "zh"
    }
}

# TTS Language Validation (QID)

# 接口Description

Determines whether the text content, language, and QID provided by the user match each other, and returns the detected language code.

# Request URL

POST /api/voice/v3/tts/validate

# Request Parameters

Field Type Required Description
query String True Text content
qid String True 音色-QID
ssml Boolean False Whether to use SSML,默认为否

# Request Example

{
    "query": "待合成语音的文本内容xxxxxx",
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.result String Validation result
data.result.valid Boolean Whether matched
data.result.language String 检测到的Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1:请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": {
            "valid": true,
            "language": "zh"
        }
    }
}

# TTS Language Validation (Legacy)

Note: This API is currently in maintenance mode and will not receive new feature updates. It is recommended to use the QID API.

# 接口Description

Determines whether the text content, language, voice, and vendor ID provided by the user match each other, and returns the detected language code.

# Request URL

POST /api/voice/v1/tts/validate

# Request Parameters

Field Type Required Description
query String True Text content
voice_id String True Voice ID
language String True 语言,支持 BCP47 (opens new window).格式,如 "en-US"、"zh-CN"
vendor_id Integer True 供应商ID
ssml Boolean False Whether to use SSML

# Request Example

{
    "query": "待合成语音的文本内容xxxxxx",
    "voice_id": "xiaoling",
    "language": "zh-CN",
    "vendor_id": 3
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.result String Validation result
data.result.valid Boolean Whether matched
data.result.language String 检测到的Language code,参考标准 ISO 639-1 (opens new window). 目前支持检测:af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

# Response Example

case1:请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status": 0,
        "result": {
            "valid": true,
            "language": "zh"
        }
    }
}

# Voice ID Migration to QID

# 接口Description

Actively migrates existing voices represented by Voice ID to new voices represented by QID, and returns the corresponding QID value. It is recommended to store the QID for subsequent inference requests.

# Request URL

POST /api/voice/v3/tts/migrate

# Request Parameters

Field Type Required Description
voice_id String True Voice ID to be converted

# Request Example

{
  "voice_id": "xiaoning"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.result String QID,表示预配置 QID。此Field的长度较长,建议以 VARCHAR(512) Type进行存储

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": "JQb7Qv:AEA_Z10Mqp9GYwDGdLzMvPzEzIqwo"
  }
}

# QID Details API

# 接口Description

Gets detailed information about a QID, including the adjustable parameter details supported by the voice.

# Request URL

GET /api/voice/v3/tts/qid/{qid}

# Request Parameters

Field Type Required Description
qid String True QID to query for details

# Request Example

  https://domain/api/voice/v3/tts/qid/mwvA2f:AEBEvMy850Y_Z10Mqp9GUwTMr8xMyI3Tzk3Q

# Response Elements

Field Type Description
code int code 码
message string 状态Description
data object Result data
 -status int Status, see status table
 -result object QID 详情
  -pitch boolean Whether pitch adjustment is supported
  -speed boolean Whether speed adjustment is supported
  -volume boolean Whether volume adjustment is supported
  -subtitle boolean Whether subtitle, sentence-level timestamp, and word-level timestamp return is supported
  -ussml object Whether USSML syntax is supported
   -break boolean USSML 停顿是否生效,即是否支持 <break> 标签
   -phoneme string list USSML 指定读音是否生效,即是否支持 <phoneme> 标签。<pinyin><ipa>两种标签
   -sub boolean USSML 替换文本是否生效,即是否支持 <sub> 标签
   -sayas string list USSML 指定朗读方式是否生效,即是否支持 <say-as> 标签。<cardinal><digit><phone><address><date><clock>六种标签
  -languages string list Supported language list

# Response Example

{
  "code": 0,
  "message": "ok",
  "data": {
    "status": 0,
    "result": {
      "pitch": true,
      "speed": true,
      "volume": true,
      "phone": true,
      "subtitle": true,
      "ussml": {
        "break": true,
        "phoneme": ["pinyin"],
        "sub": true,
        "sayas": ["cardinal", "digit", "phone", "address", "date", "clock"]
      },
      "languages": [
        "zh-CN",
        "en-US"
      ]
    }
  }
}

# Submit TTS Request (QID)

# 接口Description

Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.

# Request URL

POST /api/voice/v3/tts/request

# Request Parameters

Field Type Required Description
qid String True Speaker QID
query String True 待合成语音的Text content
ssml Boolean False Whether to use USSML
timeout Integer False 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回TTS结果,否则返回 request_id
pitch_offset Float False 语调,数值越大越尖锐,越低越低沉,支持范围 [-60, 60];默认0
speed_ratio Float False 语速,数值越大语速越慢,支持范围 [0.5, 2];默认1.0
volume Integer False 音量,数值越大声音越大,支持范围 [1, 400];默认100
subtitle_max_length Integer False Maximum length per subtitle line, default is 0 (no limit)
subtitle_cut_by_punc Boolean False Whether to split subtitles by punctuation, default is false (no splitting)
word_time Boolean False Whether to return word-level timestamps, default is false (not returned)

# Request Example

{
    "qid": "AEBEvMy850Y_Z10Mqp9GUwDGHMSi0tS_TMr8xMyI3Tzk3QyqsK",
    "query": "待合成语音的文本内容xxxxxx",
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.audio_url String TTS合成音频MP3文件URL
data.result.srt_url String TTS音频字幕SRT文件URL
data.result.duration_ms Integer TTS合成音频MP3文件音频时长,单位为 ms
data.result.word_times List TTS音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer TTS音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer TTS音频文件字结束时间戳,单位为 ms
data.result.word_times.text String TTS音频文件字Text content

# Response Example

case1:Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

case2:请求未超时,请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (QID)

# 接口Description

Returns the current status of a specified TTS request based on input parameters.

# Request URL

POST /api/voice/v3/tts/result

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.audio_url String TTS合成音频MP3文件URL
data.result.srt_url String TTS音频字幕SRT文件URL
data.result.duration_ms Integer TTS合成音频MP3文件音频时长,单位为 ms
data.result.word_times List TTS音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer TTS音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer TTS音频文件字结束时间戳,单位为 ms
data.result.word_times.text String TTS音频文件字Text content

# Response Example

case1:Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Submit ASR Request

# 接口Description

Parses the audio content uploaded by the user and transcribes it into corresponding text content.

# Request URL

POST /api/voice/v1/request/asr

# Request Parameters

Field Type Required Description
audio_url String True Audio file URL of the text to be recognized
timeout Integer False 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回ASR结果,否则返回 request_id
subtitle_max_length Integer False Maximum length per subtitle line, default is 0 (no limit)
subtitle_cut_by_punc Boolean False Whether to split subtitles by punctuation, default is false (no splitting)
word_time Boolean False Whether to return word-level timestamps, default is false (not returned)

# Request Example

{
    "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
    "timeout": 3000
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object ASR请求结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.srt_url String ASR音频字幕SRT文件URL
data.result.text String ASR识别结果
data.result.word_times List ASR音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer ASR音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer ASR音频文件字结束时间戳,单位为 ms
data.result.word_times.text String ASR音频文件字Text content

# Response Example

case1:Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2:请求未超时,请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
        "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
        "text": "ASR识别结果",
        "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# ASR Status Query

# 接口Description

Returns the current status of a specified ASR request based on input parameters.

# Request URL

POST /api/voice/v1/result/asr

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object ASR请求结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.srt_url String ASR音频字幕SRT文件URL
data.result.text String ASR识别结果
data.result.word_times List ASR音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer ASR音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer ASR音频文件字结束时间戳,单位为 ms
data.result.word_times.text String ASR音频文件字Text content

# Response Example

case1:Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2:请求未超时,请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "text": "ASR识别结果",
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
        }
    }
}

# Submit Polyphonic Character Request

# 接口Description

Returns whether the text contains polyphonic characters and their selectable pronunciations based on input parameters.

# Request URL

POST /api/voice/v1/request/polyphony

# Request Parameters

Field Type Required Description
query String True Request text
timeout Integer False 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回多音字结果,否则返回 request_id

# Request Example

{
    "query": "待合成语音的文本内容xxxxxx",
    "timeout": 3000
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object List 多音字请求结果,如在超时时间内返回,返回此Field,否则此Field为空{}
data.result.text String Polyphonic character text
data.result.polyphony String List Selectable pronunciations for the polyphonic character; recommended pronunciation first, alternatives after

# Response Example

case1:Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": []
    }
}

case2:请求未超时,请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

# Polyphonic Character Status Query

# 接口Description

Returns the current status of a specified polyphonic character request based on input parameters.

# Request URL

POST /api/voice/v1/result/polyphony

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object List 多音字请求结果,如在超时时间内返回,返回此Field,否则此Field为空{}
data.result.text String Polyphonic character text
data.result.polyphony String List Selectable pronunciations for the polyphonic character; recommended pronunciation first, alternatives after

# Response Example

case1:Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": []
    }
}

case2:Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": [
            {
                "text":"待",
                "polyphony":["dai4","dai1"]
            },
            {
                "text":"的",
                "polyphony":["de5","di4","di1","di2"]
            }
        ]
    }
}

# Submit TTS Request (Legacy)

Note: This API is currently in maintenance mode and will not receive new feature updates. It is recommended to use the QID API.

# 接口Description

Reads out the text content based on the user-selected speaker voice, supporting adjustment of pitch, speech rate, and volume.

# Request URL

POST /api/voice/v1/request/tts

# Request Parameters

Field Type Required Description
voice_id String True Speaker ID
language String False 语言
query String True 待合成语音的Text content
ssml Boolean False Whether to use SSML
timeout Integer False 等待返回的超时时间,单位为ms。如在超时时间内返回,直接返回TTS结果,否则返回 request_id
pitch_offset Float False 语调,数值越大越尖锐,越低越低沉,支持范围 [-60, 60];默认0
speed_ratio Float False 语速,数值越大语速越慢,支持范围 [0.5, 2];默认1.0
volume Integer False 音量,数值越大声音越大,支持范围 [1, 400];默认100
subtitle_max_length Integer False Maximum length per subtitle line, default is 0 (no limit)
subtitle_cut_by_punc Boolean False Whether to split subtitles by punctuation, default is false (no splitting)
word_time Boolean False Whether to return word-level timestamps, default is false (not returned)

# Request Example

{
    "voice_id": "xiaoling",
    "query": "待合成语音的文本内容xxxxxx",
    "timeout": 3000,
    "word_time": true,
    "pitch_offset": 0.0,
    "speed_ratio": 1.0,
    "volume": 100
}

# Response Elements

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.audio_url String TTS合成音频MP3文件URL
data.result.srt_url String TTS音频字幕SRT文件URL
data.result.duration_ms Integer TTS合成音频MP3文件音频时长,单位为 ms
data.result.word_times List TTS音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer TTS音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer TTS音频文件字结束时间戳,单位为 ms
data.result.word_times.text String TTS音频文件字Text content

# Response Example

case1:Request timed out

{
    "code": 0,
    "message": "created",
    "data": {
       "status":1000,
       "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
       "result": {}
    }
}

case2:请求未超时,请求Success

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# TTS Status Query (Legacy)

Note: This API is currently in maintenance mode and will no longer receive new feature updates. It is recommended to use the QID API.

# 接口Description

Returns the current status of a specified TTS request based on input parameters.

# Request URL

POST /api/voice/v1/result/tts

# Request Parameters

Field Type Required Description
request_id String True Request ID

# Request Example

{
    "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7"
}

# Response Parameters

Field Type Description
code Integer Code, see status table
message String 状态Description
data Object Result data
data.status Integer Status, see status table
data.request_id String Request ID
data.result Object TTS合成结果,如在超时时间内返回,返回此Field否则此Field为空{}
data.result.audio_url String TTS合成音频MP3文件URL
data.result.srt_url String TTS音频字幕SRT文件URL
data.result.duration_ms Integer TTS合成音频MP3文件音频时长,单位为 ms
data.result.word_times List TTS音频字文件字级别时间戳 ,单位为 ms
data.result.word_times.begin_ms Integer TTS音频文件字开始时间戳,单位为 ms
data.result.word_times.end_ms Integer TTS音频文件字结束时间戳,单位为 ms
data.result.word_times.text String TTS音频文件字Text content

# Response Example

case1:Still processing

{
    "code": 0,
    "message": "created",
    "data": {
        "status":1000,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {}
    }
}

case2: Processing completed

{
    "code": 0,
    "message": "ok",
    "data": {
        "status":0,
        "request_id": "99c3f13f-c2b7-4388-91ff-8f3244d2c5b7",
        "result": {
            "audio_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.mp3",
            "srt_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.srt",
            "phone_url": "https://xxx.oss-cn-hangzhou.aliyuncs.com/xxx/audio1.phone",
            "duration_ms": 1000,
            "word_times": [
        {
          "begin_ms": 2080,
          "end_ms": 2560,
          "text": "字"
        },
        {
          "begin_ms": 2560,
          "end_ms": 3040,
          "text": "级"
        },
        {
          "begin_ms": 3040,
          "end_ms": 3520,
          "text": "别"
        },
        {
          "begin_ms": 3520,
          "end_ms": 4000,
          "text": "时"
        },
        {
          "begin_ms": 4000,
          "end_ms": 4480,
          "text": "间"
        },
        {
          "begin_ms": 4480,
          "end_ms": 4960,
          "text": "戳"
        }  ]
       }
    }
}

# Status Table

# TTS Language Detection Request Response Status
code Description
0 Success
x Failure
status Description
0 处理Success
3201 Rejected: Required parameter missing
3202 拒绝:非法的参数--Request text为空
3301 请求Failure:未能检测到语言
# TTS Language Validation Request Response Status
code Description
0 Success
x Failure
status Description
0 处理Success
1201 Rejected: Required parameter missing
1202 Rejected: Invalid parameter — voice ID is empty
1203 拒绝:非法的参数--Request text为空
1204 Rejected: Invalid parameter — language is empty
1205 Rejected: Invalid parameter — voice ID does not exist
1206 Rejected: Invalid parameter — unknown language
1207 Rejected: Invalid parameter — unknown vendor
1301 请求Failure:未能检测到语言
# TTS Request Response Status
code Description
0 Success
x Failure
status Description
0 处理Success
1000 Processing
1002 Rejected: Required parameter missing
1003 Rejected: Invalid parameter — speaker is empty
1004 Rejected: Invalid parameter — synthesis text is empty
1005 Rejected: Invalid parameter — speaker does not exist
1102 Failure:未连接到服务
1103 Failure:繁忙
1104 Failure:内部错误
1105 Failure:Request timed out
# TTS Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
1102 Failure:未连接到服务
1103 Failure:繁忙
1104 Failure:内部错误
1105 Failure:Request timed out
1106 Failure:未知的request ID
# ASR Request Response Status
code Description
0 Success
x Failure
status Description
0 请求Success
1000 Request timed out, new request created
2001 Rejected: Required parameter missing
2002 Rejected: Invalid parameter — audio URL is empty
2102 Failure:未连接到服务
2103 Failure:繁忙
2104 Failure:内部错误
2105 Failure:Request timed out
2106 Failure:下载音频错误
# ASR Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
2102 Failure:未连接到服务
2103 Failure:繁忙
2104 Failure:内部错误
2105 Failure:Request timed out
2106 Failure:下载音频错误
2107 Failure:未知的request ID
# Polyphonic Character Request Response Status
code Description
0 Success
x Failure
status Description
0 处理Success
1000 Processing
5002 Rejected: Required parameter missing
5004 Rejected: Invalid parameter — synthesis text is empty
5102 Failure:未连接到服务
5103 Failure:繁忙
5104 Failure:内部错误
5105 Failure,未超时:Request timed out
# Polyphonic Character Request Query Status
code Description
0 Success
x Failure
status Description
0 Success
1000 Processing
5102 Failure:未连接到服务
5103 Failure:繁忙
5104 Failure:内部错误
5105 Failure:Request timed out
5106 Failure:未知的request ID

# USSML Syntax Reference

# Introduction

USSML(Unified Speech Synthesis Markup Language)旨在提供统一的 SSML(Speech Synthesis Markup Language)语法格式。USSML支持最常用的几个SSML标签:停顿、指定读音、替换合成文本和指定朗读方式,可以满足大部分的语音合成需求。

# Usage

# Syntax Format

USSML 语法格式如下:

<speak sttts:version="0.1">
    <break time="string" />
    <phoneme ph="string"></phoneme>
    <sub alias="string"></sub>
    <say-as interpret-as="string"></say-as>
</speak>

标签用于包裹其他子标签外,剩余标签均不可嵌套使用。

# Special Characters

在 USSML 中,如果使用了以下的特殊字符,需要进行转义,如下表所示。对于 USSML 标记本身的相关字符,则无需转义。

补充转译的例子。

特殊字符 转义字符
& &amp;
< &lt;
> &gt;
" &quot;
' &apos;
# Tag Description
# <speak>

<speak> 标签是 USSML 的根标签,用于包裹所有的 USSML 标签。

属性 sttts:version 用于指定 USSML 的版本号,目前 USSML 的版本号为 0.1

<speak sttts:version="0.1">
    <!-- USSML 标签 -->
</speak>
# <break>

<break> 标签用于指定停顿,其 time 属性用于指定停顿的时长。time 属性的值是一个字符串,单位为秒或毫秒。停顿的最大时长为5秒。不可以传输纯数字,需要有单位。例如:

<break time="5s" />

<break time="5000ms" />
# <phoneme>

<phoneme> 标签用于指定读音,其 ph 属性用于指定读音的内容。由于不同供应商支持语言范围不同,目前 USSML 中的 <phoneme> 仅支持汉语拼音。拼音用法:字与字的拼音用空格分隔,拼音的数目必须与字数相等。每个拼音由发音和音调组成,音调为1~5的数字编号,其中”5”表示轻声。例如:

<phoneme ph="mai2 mo4">埋没</phoneme>
# 示例
<speak sttts:version="0.1">
    你说<phoneme ph="bo2"></phoneme><break time="500ms" />
    我说<phoneme ph="bao2"></phoneme></speak>
# <sub>

<sub> 标签用于在合成过程中替换字幕文本。该标签的 ‘alias’ 属性用于指定要替换的Text content。合成时,‘alias’ 属性所包含的文本将会取代原始文本进行合成,若有字幕,字幕内容为标签内的原始文本。

<sub> 标签内容与 ‘alias’ 属性的文本均不得为空。

例如:

<sub alias="World Wide Web Consortium">W3C</sub>

以上例子中,字幕显示为:W3C,朗读内容为:World Wide Web Consortium。

# <say-as>

<say-as>标签允许您使用特定的朗读方式来合成标签内容。该标签的 ‘interpret-as’ 属性用于指定朗读方式。不同的供应商对相同的朗读方式可能产生略微不同的结果。 interpret-as 属性支持多种值,包括:

Description 样例
cardinal 按照数值方式进行发音 “1487” 读作 “一千四百八十七”
digit 按数字串发音 “12345” 读作 “一二三四五”
phone 按电话号码常用方式发音 “1301001155” 读作 “幺三零幺零零幺幺五五”
address 按地址发音 “市台路388-301号” 读作 “市台路三八八杠三零幺号”
date 按日期发音 “1998-12-12” 读作 “一九九八年十二月十二日”
clock 按时刻发音 “12:00:12” 读作 “十二点零分十二秒”
# 示例
<say-as interpret-as="cardinal">12345</say-as>

示例 1:

<speak sttts:version="0.1">
    你说<phoneme ph="bo2"></phoneme><break time="500ms" />
    我说<phoneme ph="bao2"></phoneme></speak>

示例 2:

<speak sttts:version="0.1">
    <sub alias="World Wide Web Consortium">W3C</sub>
    是一个国际性的标准化组织。
</speak>

示例 3:

<speak sttts:version="0.1">
    <sub alias="青岛啤酒">TsingTao</sub>
    用河南话说就是,
    <phoneme ph="qing2 dao1 pi4 jiu1">青岛啤酒</phoneme><say-as interpret-as="cardinal">12345</say-as>
</speak>

# Streaming Voice Processing

平台提供HTTP方式语音处理能力之外,还提供流式处理能力,包括ASR、TTS等。

# Invocation Method

使用 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。

# Invocation Flow

  1. 建立 WebSocket 连接;
  2. 发送 Starter 包,内容为后续请求的通用配置信息;
  3. 收到响应,表示鉴权Success或Failure;
  4. 发送 Data 数据包;
  5. 收到对应返回结果,4、5 步可重复;
  6. 如果当前没有更多任务,可以直接断开(没有链接断开报文的设计);

# Invocation Limits

  1. 建立 Websocket 连接后,10 秒内未发送 Starter 包会被断开 WebSocket 连接;
  2. 如果 60 秒内没有收到任何请求,服务端会主动断开,建议以一定间隔发送 Ping 包进行保活;

# Request Message Format

# Starter

每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
type Workflow Type string Required 填写能力对应的服务Type,例如:"TTS"、"ASR5"
device Device ID string 空字符串 设备 ID,建议填写,以便追溯和定位问题
session Session ID string 随机 UUIDv4 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题
asr ASR Config object 使用 ASR 能力则Required ASR 专属配置,具体信息见下文
tts TTS Config object 使用 TTS 能力则Required TTS 专属配置,具体信息见下文

# Starter Message Example

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "ASR5",
  "device": "device-weye",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "asr": {
    "language": "zh-CN"
  }
}

# Data

Starter 包发送并Success建立连接后,后续可重复发送多个 Data 数据包。Data 包格式见对应能力文档。

# Response Message Format

# Authentication Result

发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即auth
session Session ID string Yes 当前连接的 Session ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
# Authentication Result Example
{
  "service": "auth",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
# Result Data

根据能力不同,每个请求会返回一个或多个Result data包。格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块
session Session ID string Yes 当前连接的 Session ID
trace Trace ID string Yes 当前请求对应的 Trace ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
asr ASR Content object No 如果Success,返回的识别结果
nlp NLP Content object No 如果Success,返回的答复结果
tts TTS Content object No 如果Success,返回的合成结果

# ASR (Speech Recognition) Integration Guide

中控 WebSocket 全双工接口 ASR 调用方式的Description,链接方式为 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。

# Invocation Flow

  1. 建立 WebSocket 连接;地址通常为 ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}
  2. 发送 Starter 包,内容为后续请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
  3. 收到响应,表示鉴权Success或Failure;
  4. 发送 Data 二进制数据包,内容为 PCM 音频;
  5. 音频发送完成后,发送 EOF 包;(可选,不发送则需额外发送 500 毫秒以上的环境静音,帮助 VAD 结束)
  6. 收到对应ASR结果,格式为 JSON 文本;
  7. 发送 EOF 请求包的情况下,收到 EOF 结果包,表示所有识别结果发送完毕;
  8. 如果当前没有更多语音识别任务,可以直接断开(没有链接断开报文的设计);

# Request Message Format

# Starter

每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
type Workflow Type string Required 填写能力对应的服务Type,中文识别请选择:"ASR5"
device Device ID string 空字符串 设备 ID,建议填写,以便追溯和定位问题
session Session ID string 随机 UUIDv4 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题
asr ASR Config object Required ASR 专属配置,具体信息见下文

ASR Config 配置见下:

Field Name Type Default Description
language Language Code string zh-CN 可选Field,待识别的语言
mic_volume Microphone Volume float 1.0 可选Field,麦克风音量用于 ASR 进行自增益,支持范围为 0 到 1
subtitle Subtitle Format string 空字符串 可选Field,返回字幕的格式,空表示不返回,支持:srt
subtitle_max_length Subtitle Max Length int 0 可选Field,返回每行字幕的最大字数,0表示不限制字数
intermediate Return Intermediate Result bool false 可选Field,是否返回中间结果
sentence_time Return Sentence-Level Timestamp bool false 可选Field,是否返回句级别时间戳
word_time Return Word-Level Timestamp bool false 可选Field,是否返回字级别时间戳
pause_time_msec Speech Pause Time (msec) int 500 可选Field,语音暂停时间,用于判断语音的边界和分段,默认为500毫秒

# Data

Starter 包发送并Success建立连接后,后续可重复发送多个二进制 Data 包流式提交音频。

输入的音频流格式为 PCM,使用 16KHz 采样率,16bit 数据位宽,单通道,小端。即 sox -t raw -r 16000 -e signed -b 16 -c 1 可转格式,或 ffmpeg -acodec pcm_s16le -ac 1 -ar 16000 -f s16le 可转格式。

发送速率:

音频按照从麦克风读取的速率发送,建议为每 40 毫秒发送 1280 字节,或每 160 毫秒发送 5120 字节。

# EOF

音频包发送完成后,发送 EOF 包,表示结束识别。可选,不发送则需额外发送 500 毫秒以上的环境静音,帮助 VAD 结束。

如果需要获取字幕,则必须发送 EOF 包,以通知 ASR 服务端进行字幕生成。

格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
signal 结束标记 string Required 固定为 eof
trace Trace ID string 随机 UUIDv4 可选Field,建议调用者自行生成 Trace ID 并填写,以便追溯和定位问题

# Response Message Format

# Authentication Result

发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即auth
session Session ID string Yes 当前连接的 Session ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息

# ASR Result Data

ASR会持续返回多个文字识别Result data包,并在收到 EOF 请求后返回字幕、字幕文件地址数据包。如果在 Starter 请求中未要求返回字幕和字幕地址,则仅返回文字识别结果。

返回报文的格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即asr
session Session ID string Yes 当前连接的 Session ID
trace Trace ID string Yes 当前句子对应的 Trace ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
asr ASR Content object No 如果Success,返回的识别结果,具体Field含义见下

具体识别结果位于 ASR Content 中:

Field Name Type Required Description
index Index No. int Yes 返回包序列号
type Package Type enum Yes 文字结果包为 text,中间结果包为 intermediate,字幕包为 subtitle,表示全部发送完毕为 eof
text Text string Yes 文字识别结果,在字幕包中亦会出现,但内容为空
subtitle Subtitle string No 字幕内容,仅在字幕包中有
sentence_time Sentence-Level Timestamp object No 句子级别时间戳
word_times Word-Level Timestamp object No 字级别时间戳
# Text Result

每句被完整识别的文字都会返回一条报文,中间识别结果不返回,如果无法识别或识别结果为空白字符,亦不返回。

文字结果样例:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "c9cb36d8-3ca9-4e2b-9034-29f2c4edc3de",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "你好。"
    }
}
# Subtitle Result

请求字幕且发送 EOF 后,返回生成的字幕结果。

字幕结果样例:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "9a971f17-f871-4b73-9084-b856b67537d5",
    "asr": {
        "index": 3,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:01,280\n你好。\n\n2\n00:00:02,960 --> 00:00:04,240\n再见。\n\n"
    }
}
# EOF

收到 EOF 请求的情况下,发送 EOF 结果包,表示结果全部发送完毕。

EOF 样例:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "16ff049a-41fb-4c7a-ac5e-b26dbc3218e5",
    "asr": {
        "index": 5,
        "type": "eof"
    }
}

# Practical Flow Example Analysis

# Case 1: Minimum Configuration Flow

Request: Starter

{
    "type": "ASR5",
    "asr": {}
}

Request: 二进制 Data

Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721"
}

Response: 2

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "e1c44bdc-4f9a-487c-806e-005679db7d0d",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "早知道你喜欢十里春光"
    }
}

Response: 3

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "f7551818-5025-4d83-b41b-136bb19b5b5f",
    "asr": {
        "index": 2,
        "type": "text",
        "text": "我一定会在麦田里种满玫瑰和山茶"
    }
}

Response: 4

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "89d3a8b4-a291-4cdc-9b78-d3f912d06223",
    "asr": {
        "index": 3,
        "type": "text",
        "text": "你路过这片土地才算浪漫"
    }
}

# Case 2: Full Configuration Flow

Request: Starter

{
    "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
    "type": "ASR5",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "asr": {
        "subtitle": "srt",
        "intermediate": true,
        "mic_volume": 0.67
    }
}

Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}

Request: 二进制 Data(略)

Request: EOF

{
    "signal": "eof",
    "trace": "52517513-875a-47b6-bd30-f11a75e26745"
}

Response: 2

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "dcf88fbe-6cda-452d-8f51-e316cb4a0943",
  "asr": {
    "index": 1,
    "type": "intermediate",
    "text": "介"
  }
}

Response: 3

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "879d4700-746f-411b-954c-f83a2c6cd300",
  "asr": {
    "index": 2,
    "type": "intermediate",
    "text": "介绍下长"
  }
}

Response: 4

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "20e6a5e5-6d5d-4284-9013-4d410f1a5d37",
  "asr": {
    "index": 3,
    "type": "intermediate",
    "text": "介绍下长宁图书"
  }
}

Response: 5

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "cf7733e2-da28-442d-b5bb-282fc8f352f3",
  "asr": {
    "index": 4,
    "type": "text",
    "text": "介绍一下长宁图书馆。",
    "sentence_time": {
      "begin_ms": 2080,
      "end_ms": 4640
    },
    "word_times": [
      {
        "begin_ms": 2080,
        "end_ms": 2560,
        "text": "介"
      },
      {
        "begin_ms": 2560,
        "end_ms": 2800,
        "text": "绍"
      },
      {
        "begin_ms": 2800,
        "end_ms": 2920,
        "text": "一"
      },
      {
        "begin_ms": 2920,
        "end_ms": 3040,
        "text": "下"
      },
      {
        "begin_ms": 3040,
        "end_ms": 3280,
        "text": "长"
      },
      {
        "begin_ms": 3280,
        "end_ms": 3480,
        "text": "宁"
      },
      {
        "begin_ms": 3480,
        "end_ms": 3640,
        "text": "图"
      },
      {
        "begin_ms": 3640,
        "end_ms": 3880,
        "text": "书"
      },
      {
        "begin_ms": 3880,
        "end_ms": 4640,
        "text": "馆"
      }
    ]
  }
}

Response: 6

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
    "asr": {
        "index": 5,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:02,280\n介绍一下长宁图书馆\n\n"
    }
}

Response: 7

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "2bd4cbce-0f72-402c-8e88-0f2704a22868",
    "asr": {
        "index": 6,
        "type": "eof"
    }
}

# TTS (Speech Synthesis) Integration Guide (QID)

中控 WebSocket 全双工接口 TTS 调用方式的Description,链接方式为 WebSocket 协议,报文皆为使用 UTF-8 编码的 JSON 文本。

# Invocation Flow

  1. 建立 WebSocket 连接,地址通常为 ws://aigc.softsugar.com/api/voice/stream/v3?Authorization=Bearer {token}
  2. 发送 Starter 包,内容为后续 TTS 请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
  3. 收到响应,表示鉴权Success或Failure;
  4. 发送 Task 包,内容为特定需要合成的文字和格式信息;
  5. 收到对应 Task 的数据包;
  6. 如果当前没有更多语音合成任务,可以直接断开(没有链接断开报文的设计);

# Request Message Format

# Starter

每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
type Workflow Type string Required 仅支持填写TTS
device Device ID string 空字符串 设备 ID,建议填写,以便追溯和定位问题
session Session ID string 随机 UUIDv4 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题
tts TTS Config object Required TTS 专属配置,具体信息见下

TTS Config 配置见下:

Field Name Type Default Description
qid Qid string - RequiredField
pitch_offset Pitch Offset float 0.0 可选Field,音调,数值越大越尖锐,越低越低沉,支持范围 [-10, 10]
speed_ratio Speed Ratio float 1.0 可选Field,语速,数值越大语速越慢,支持范围 [0.5, 2]
sample_rate Sample Rate int 16000 可选Field,采样率,支持:8000, 16000, 22050, 24000, 44100
volume Volume int 100 可选Field,音量,数值越大声音越大,支持范围 [1, 400]
format File Format string pcm 可选Field,音频文件和内容,根据所选音色,可能支持 pcm, wav, mp3,但只有 pcm 支持流式返回
omit_error Omit Error Message in Response bool false 可选Field,是否删去报错信息,即默认会返回
polyphone Return Polyphone bool false 可选Field,是否返回 query 中的多音字,默认不返回
subtitle Subtitle Format string 空字符串 可选Field,返回格式字幕的格式,空表示不返回,支持:srt
subtitle_max_length Subtitle Max Length int 0 可选Field,返回每行字幕/句级别时间戳的最大字数,0表示不限制字数,仅在返回字幕或句级别时间戳时有效
subtitle_cut_by_punc Subtitle Cut by Punctuation bool false 可选Field,是否根据标点符号对字幕/句级别时间戳进行换行并去掉标点,仅在返回字幕或句级别时间戳时有效。
sentence_time Return Sentence-Level Timestamp bool false 可选Field,是否返回句级别时间戳
word_time Return Word-Level Timestamp bool false 可选Field,是否返回字级别时间戳

# Task

Starter 包发送并Success建立连接后,后续可重复发送多个 Task 来提交合成任务。Task 包格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
id Task ID string 随机 UUIDv4 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回
query Query string Required 待合成语音的Text content
ssml Use SSML bool false 可选Field,是否使用 SSML 来对合成文本进行标记,写法参考 ONES 使用文档

# Response Message Format

# Authentication Result

发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即auth
session Session ID string Yes 当前连接的 Session ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息

# TTS Result Data

每个Success的 Task 持续返回多个数据包,分别为音频、字幕、时间戳、和多音字包。同Type数据包按照逻辑顺序依次返回,不保证不同Type数据包的返回顺序。

返回报文的格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即tts
session Session ID string Yes 当前连接的 Session ID
trace Trace ID string Yes 当前 Task 对应的 Trace ID
status Status Name enum Yes 当前 Task 的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
tts TTS Content object No 如果Success,返回的合成结果,具体Field含义见下

具体合成结果位于 TTS Content 中:

Field Name Type Required Description
id Task ID string Yes 当前 Task 对应的 ID
index Index No. int Yes 返回音频包、音素包序列号
type Package Type enum Yes 音频包为 audio,字幕包为 subtitle,多音字包为 polyphone,时间戳包为 timestamp,表示全部发送完毕为eof
audio_data Base64-encoded Audio Data string No 音频数据,仅在音频包中有
polyphones Polyphone Data object No 多音字数据,仅在多音字包中有
subtitle_data Base64-encoded Subtitles string No 字幕数据,仅在字幕包中有
sentence_time Sentence-Level Timestamp object No 句子级别时间戳,仅在时间戳包中有
word_times Word-Level Timestamp object No 字级别时间戳,仅在时间戳包中有
# Audio Package

包含 Base64 编码的合成音频数据结果。

当请求音频格式为 pcm 时,分为多包流式返回,其他格式会在音频合成后单包返回。

音频包样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}
# Subtitles

包含 Base64 编码的合成字幕数据结果。

字幕包样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}
# Timestamp Package

包含句子级别和字级别的时间戳信息。

时间戳样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}
# Polyphonic Character Package

包含多音字信息,推荐读音在前,其他读音在后。

多音字样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}
# EOF

EOF 结果包,表示结果全部发送完毕。

EOF 样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Practical Flow Example Analysis

# Case 1: Minimum Configuration Flow

Request: Starter

{
  "type": "TTS",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好!"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Full Configuration Flow

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "qid": "8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 音频

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 时间戳

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 3, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 4 多音字

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 5 字幕

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 5,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 6 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}

# LLM (Large Language Model) Integration Guide

中控 WebSocket 全双工接口 LLM 调用方式的Description,链接方式为 WebSocket 协议,控制报文为使用 UTF-8 编码的 JSON 文本。 WS接口需将将令牌放置在Authorization头部Field中或者在URL中拼接,Header中传递的token具有更高优先级。 (例:详见FAQ)。

# Invocation Flow

  1. 建立 WebSocket 连接;地址通常为 ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token}
  2. 发送 Starter 包,内容为后续请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
  3. 收到响应,表示鉴权Success或Failure;
  4. 发送 Query 文本包,内容为单条对话内容;
  5. 收到对应的应答结果,格式为 JSON;
  6. 如果当前没有更多问答对话,可以直接断开(没有链接断开报文的设计);

# Request Message Format

# Starter

每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
type Workflow Type string Required 填写能力对应的服务引擎编号,例如:“NLP7”
NLP7(SenseChat),NLP10(商汤拟人大模型)
device Device ID string 空字符串 设备 ID,建议填写,以便追溯和定位问题
session Session ID string 随机 UUIDv4 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题
nlp NLP Config object Required NLP 专属配置,具体信息见下

NLP Config 配置见下:

Field Name Type Default Description
omit_error Omit Error Message in Response bool false 可选Field,是否删去报错信息,即默认会返回
know_ids Knowledge IDs string list 空列表 可选Field,知识库 ID 列表,仅部分语言大模型引擎支持
prompt_header System Role of Prompt string 空字符串 可选Field,Prompt 的背景Description,为空时使用配置中的预设值,仅部分语言大模型引擎支持。
NLP10(拟人大模型)有专门的json定义,需要按照要求传输。
max_reply_token Max Toke in Reply int 500 可选Field,回复内容的最大 token 数,实际最大可用值和模型相关,仅部分语言大模型引擎支持

# Query

Starter 包发送并Success建立连接后,后续可发送多个 Query 文本包提交用户问题,格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
id Trace ID string 随机 UUIDv4 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回
query Query string Required 用户问题的Text content

# Response Message Format

# Authentication Result

发送 Starter 请求后会返回包含鉴权结果的报文。格式为JSon文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即auth
session Session ID string Yes 当前连接的 Session ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息

# NLP Result Data

针对每一条Success处理的 Query ,都会有一条报文返回,当 omit_errorfalse 时,出错报文亦会返回。返回报文的基本格式见下:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即nlp
session Session ID string Yes 当前连接的 Session ID
trace Trace ID string Yes 当前句子对应的 Trace ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
nlp NLP Content object No 如果Success,返回的答复结果,具体Field含义见下

具体识别结果位于 NLP Content 中:

Field Name Type Required Description
index Index No. int Yes 返回包序列号
query Query string Yes 提交的问题文本
answer Answer string Yes 返回的播报文本,数字人前端调用 TTS 进行播报
text Text string No 返回的展示文本,数字人前端上屏作为文本展示
finish_reason Text string Yes 停止生成的原因,枚举值
因结束符停止生成:stop
因达到最大生成长度停止生成:length
因触发敏感词停止生成: sensitive
因触发模型上下文长度限制(若要继续接受后续内容,则在下一条query中发送“请继续”): context

# Anthropomorphic LLM (NLP10) Parameter Definition

# Data Definition

Name Type 必须 Default 可选值 Description
name string - - 角色姓名,长度不超过50个Unicode字符
gender string - - 角色性别,长度不超过50个Unicode字符
identity string - - 角色身份,长度不超过200个Unicode字符
nickname string - - 角色别名,长度不超过50个Unicode字符
feeling_toward object[] - - 好感度设定
detail_setting string - - 详细设定,长度不超过500个Unicode字符
other_setting json string - - 其他设定,长度不超过3000个Unicode字符

feeling_toward定义:

Name Type 必须 Default 可选值 Description
name string - - 角色姓名,只能选择character_settings中已设定的name
level int - [1,3] 对该角色的好感度,数字越大代表好感度越高

参考示例:

"prompt_header": "[{\"name\":\"周梓柔\",\"gender\":\"女\",\"identity\":\"我一直信赖的姐姐\",\"nickname\":\"\",\"feeling_toward\":[{\"name\":\"弟弟\",\"level\":3}],\"detail_setting\":\"周梓柔具有卓越成就感,学业和职场表现都极为杰出,从小就是个学霸,经常被长辈提及为榜样。外表冷艳,给人以远离尘嚣的印象,令人印象深刻的气质既独立又自信。对外或许保持距离,但在我面前总是展现出无限的温柔与包容,耐心倾听我的烦恼,用细腻的关怀化解我的困惑。MBTI人格是ENTJ。\",\"other_setting\":\"\"},{\"name\":\"弟弟\",\"gender\":\"男\",\"identity\":\"弟弟\",\"nickname\":\"\",\"detail_setting\":\"周梓柔总是在弟弟面前展现出无限的温柔与包容,耐心倾听我的烦恼,用细腻的关怀化解我的困惑。\",\"other_setting\":\"\"}]"

# TTS (Speech Synthesis) Integration Guide (Legacy)

Note: This API is currently in maintenance mode and will not receive new feature updates.

中控 WebSocket 全双工接口 TTS 调用方式的Description,链接方式为 WebSocket 协议,报文皆为使用 UTF-8 编码的 JSON 文本。

# Invocation Flow

  1. 建立 WebSocket 连接,地址通常为 ws://aigc.softsugar.com/api/voice/stream/v1?Authorization={Token}
  2. 发送 Starter 包,内容为后续 TTS 请求的通用配置信息,如果格式错误或超过 10 秒未发送会被断开 WebSocket 连接;
  3. 收到响应,表示鉴权Success或Failure;
  4. 发送 Task 包,内容为特定需要合成的文字和格式信息;
  5. 收到对应 Task 的数据包;
  6. 如果当前没有更多语音合成任务,可以直接断开(没有链接断开报文的设计);

# Request Message Format

# Starter

每次建立连接后发送的第一个包,表示此连接的目的和后续数据包的解析方式。格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
auth AuthN Token string 空字符串 设备鉴权 Token,如服务端开启鉴权则Required
type Workflow Type string Required 填写能力对应的服务引擎编号,例如:"TTS3"
device Device ID string 空字符串 设备 ID,建议填写,以便追溯和定位问题
session Session ID string 随机 UUIDv4 建议调用者自行生成 Session ID 并填写,以便追溯和定位问题
tts TTS Config object Required TTS 专属配置,具体信息见下

TTS Config 配置见下:

Field Name Type Default Description
language Language Code string zh-CN 可选Field,待合成的语言,需发音人支持
voice Voice ID string 服务引擎不同,默认发音人不同 可选Field,可选发音人
pitch_offset Pitch Offset float 0.0 可选Field,音调,数值越大越尖锐,越低越低沉,支持范围 [-10, 10]
style Style string 可选Field,表示发音人的情感
speed_ratio Speed Ratio float 1.0 可选Field,语速,数值越大语速越慢,支持范围 [0.5, 2]
sample_rate Sample Rate int 16000 可选Field,采样率,支持:8000, 16000, 22050, 24000, 44100
volume Volume int 100 可选Field,音量,数值越大声音越大,支持范围 [1, 400]
format File Format string pcm 可选Field,音频文件和内容,根据所选音色,可能支持 pcm, wav, mp3,但只有 pcm 支持流式返回
omit_error Omit Error Message in Response bool false 可选Field,是否删去报错信息,即默认会返回
polyphone Return Polyphone bool false 可选Field,是否返回 query 中的多音字,默认不返回
subtitle Subtitle Format string 空字符串 可选Field,返回格式字幕的格式,空表示不返回,支持:srt
subtitle_max_length Subtitle Max Length int 0 可选Field,返回每行字幕/句级别时间戳的最大字数,0表示不限制字数,仅在返回字幕或句级别时间戳时有效
subtitle_cut_by_punc Subtitle Cut by Punctuation bool false 可选Field,是否根据标点符号对字幕/句级别时间戳进行换行并去掉标点,仅在返回字幕或句级别时间戳时有效。
sentence_time Return Sentence-Level Timestamp bool false 可选Field,是否返回句级别时间戳
word_time Return Word-Level Timestamp bool false 可选Field,是否返回字级别时间戳

# Task

Starter 包发送并Success建立连接后,后续可重复发送多个 Task 来提交合成任务。Task 包格式为 JSON 文本,包含以下Field:

Field Name Type Default Description
id Task ID string 随机 UUIDv4 可选Field,建议调用者自行生成并填写,用于区分并发请求时不同请求的返回
query Query string Required 待合成语音的Text content
ssml Use SSML bool false 可选Field,是否使用 SSML 来对合成文本进行标记,写法参考 ONES 使用文档

# Response Message Format

# Authentication Result

发送 Starter 请求后会返回包含鉴权结果的报文。格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即auth
session Session ID string Yes 当前连接的 Session ID
status Status Name enum Yes 当前会话的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息

# TTS Result Data

每个Success的 Task 持续返回多个数据包,分别为音频、字幕、时间戳和多音字包。同Type数据包按照逻辑顺序依次返回,不保证不同Type数据包的返回顺序。

返回报文的格式为 JSON 文本,包含以下Field:

Field Name Type Required Description
service Service Name string Yes 当前请求对应的服务模块,即tts
session Session ID string Yes 当前连接的 Session ID
trace Trace ID string Yes 当前 Task 对应的 Trace ID
status Status Name enum Yes 当前 Task 的状态,正常为 ok,Failure为 fail
error Error Message string No 如果Failure,返回的错误信息
tts TTS Content object No 如果Success,返回的合成结果,具体Field含义见下

具体合成结果位于 TTS Content 中:

Field Name Type Required Description
id Task ID string Yes 当前 Task 对应的 ID
index Index No. int Yes 返回音频包、音素包序列号
type Package Type enum Yes 音频包为 audio,多音字包为 polyphone,时间戳包为 timestamp,表示全部发送完毕为eof
audio_data Base64-encoded Audio Data string No 音频数据,仅在音频包中有
polyphones Polyphone Data object No 多音字数据,仅在多音字包中有
subtitle_data Base64-encoded Subtitles string No 字幕数据,仅在字幕包中有
sentence_time Sentence-Level Timestamp object No 句子级别时间戳,仅在时间戳包中有
word_times Word-Level Timestamp object No 字级别时间戳,仅在时间戳包中有
# Audio Package

包含 Base64 编码的合成音频数据结果。

当请求音频格式为 pcm 时,分为多包流式返回,其他格式会在音频合成后单包返回。

音频包样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}
# Subtitles

包含 Base64 编码的合成字幕数据结果。

字幕包样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}
# Timestamp Package

包含句子级别和字级别的时间戳信息。

时间戳样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}
# Polyphonic Character Package

包含多音字信息,推荐读音在前,其他读音在后。

多音字样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}
# EOF

EOF 结果包,表示结果全部发送完毕。

EOF 样例:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Practical Flow Example Analysis

# Case 1: Minimum Configuration Flow

Request: Starter

{
  "type": "TTS3",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好!"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Full Configuration Flow

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "language": "zh-CN",
    "voice": "xiaoling",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 音频

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 时间戳

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 2, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 4 多音字

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 3,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 5 字幕

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 6 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}

# Create TTS Personal Voice Model Generation Task (QID)

# 接口Description

TTS个人音色模型生成( QID)服务可根据用户上传的真人采集或录制的语音素材文件,以及声音复刻同意文件,通过算法训练产出发音效果与声音素材提供者一致的数字人TTS音色模型。为保证训练效果,请在采集时遵照商汤数字人音色复制采集制作规范,内容包括环境要求、设备要求、发音要求、授权要求、朗读脚本,具体参考:采集规范 (opens new window),PaaS平台支持7天在线存储,需要及时转存,7天后生成内容将无法下载。

# Request URL

POST /api/2dvh/v1/material/voice/clone/qid/create

# Request Headers

Content-Type: application/json

# Request Parameters

Field Type Required Description
audioUrl String True Training audio file URL。素材支持格式:wav, mp3, m4a, mp4, mov, aac
audioLanguage String True Primary language used in the audio file。zh-CN 中文普通话,en-US 美式英文。遵循 BCP 47 标准
consent Object True User consent declaration information
audioUrl String True 用户同意音频文件 URL。用户同意文件应与音频文件在相同环境下录制并使用同一种语言。
中文的用户同意声明文本为:”我(发音人姓名)确认我的声音将会被(公司Name)使用于创建合成版本语音。”。
English: "I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice."
Japanese: "私(姓名を記入)は自身の音声を(会社名を記入)が使用し、合成音声を作り使用されることに同意します。"
Korean: "나는 [본인의 이름을 말씀하세요] 내 목소리의 녹음을 이용해 합성 버전을 만들어 사용된다는 것을 [회사 이름을 말씀하세요]알고 있습니다."
素材支持格式:wav, mp3, m4a, mp4, mov, aac
speakerName String True 用户同意音频文件中使用的发音人姓名,必须与音频文件中的发音人姓名保持一致。长度限制不大于64字符
companyName String True 用户同意文件中使用的公司Name,必须与音频文件中的公司Name保持一致。长度限制不大于64字符
taskType String True Training algorithm type。TTS3,TTS6,TTS7,TTS8,TTS101。默认填写TTS3。更多需求请咨询技术支持
voice Object True Speaker information
name String True Speaker name。长度限制不大于64字符
gender Integer True Speaker gender (1: Male, 2 : Female)
musicSep Boolean False Whether to perform audio background music removal (source separation)
trainMode String False 训练模式,仅对TTS3有效。common: 常规训练模式,默认为 common模式;backend_only: 极速训练模式,大幅度压缩模型训练时长,效果也会有影响

# Request Example

{
  "audioUrl": "http://oss.com/abc/object.mp3",
  "audioLanguage": "zh-CN",
  "consent": {
      "audioUrl":"http://oss.com/abc/xx.mp3",
      "speakerName": "xiaowang",
      "companyName": "XXXX"
  },
  "taskType": "TTS3",
  "voice": {
    "name": "xiaotang0",
    "gender": 2
  },
  "musicSep": false,
  "trainMode": "common"
}

# Response Elements

Field Type Required Description
code Integer True 0 - Success, 其他 - 异常
message String True Error details
data Object False Task ID

# Response Example

{
    "code": 0,
    "message": "success",
    "data": 11890
}
# TTS Voice Training Audio Duration Requirements
Training algorithm type Duration requirement
TTS3 At least 5 minutes; 20+ minutes for better results
TTS6 30-90秒
TTS7 30-300秒
TTS8 30-300秒
TTS101 At least 5 minutes; 20+ minutes for better results
# TTS Language Standards (BCP 47)
Code Language (Region)
en-US 英语(美国)
zh-CN 中文(中国)
af-ZA 南非荷兰语(南非)
am-ET 阿姆哈拉语(埃塞俄比亚)
ar-EG 阿拉伯语(埃及)
ar-SA 阿拉伯语(沙特阿拉伯)
az-AZ 阿塞拜疆语(阿塞拜疆)
bg-BG 保加利亚语(保加利亚)
bn-BD 孟加拉语(孟加拉国)
bn-IN 孟加拉语(印度)
bs-BA 波斯尼亚语(波斯尼亚和黑塞哥维那)
ca-ES 加泰罗尼亚语(西班牙)
cs-CZ 捷克语(捷克)
cy-GB 威尔士语(英国)
da-DK 丹麦语(丹麦)
de-AT 德语(奥地利)
de-CH 德语(瑞士)
de-DE 德语(德国)
el-GR 希腊语(希腊)
en-AU 英语(澳大利亚)
en-CA 英语(加拿大)
en-GB 英语(英国)
en-IE 英语(爱尔兰)
en-IN 英语(印度)
es-ES 西班牙语(西班牙)
es-MX 西班牙语(墨西哥)
et-EE 爱沙尼亚语(爱沙尼亚)
eu-ES 巴斯克语(西班牙)
fa-IR 波斯语(伊朗)
fi-FI 芬兰语(芬兰)
fil-PH 菲律宾语(菲律宾)
fr-BE 法语(比利时)
fr-CA 法语(加拿大)
fr-CH 法语(瑞士)
fr-FR 法语(法国)
ga-IE 爱尔兰语(爱尔兰)
gl-ES 加利西亚语(西班牙)
he-IL 希伯来语(以色列)
hi-IN 印地语(印度)
hr-HR 克罗地亚语(克罗地亚)
hu-HU 匈牙利语(匈牙利)
hy-AM 亚美尼亚语(亚美尼亚)
id-ID 印度尼西亚语(印度尼西亚)
is-IS 冰岛语(冰岛)
it-IT 意大利语(意大利)
ja-JP 日语(日本)
jv-ID 爪哇语(印度尼西亚)
ka-GE 格鲁吉亚语(格鲁吉亚)
kk-KZ 哈萨克语(哈萨克斯坦)
km-KH 高棉语(柬埔寨)
kn-IN 卡纳达语(印度)
ko-KR 韩语(韩国)
lo-LA 老挝语(老挝)
lt-LT 立陶宛语(立陶宛)
lv-LV 拉脱维亚语(拉脱维亚)
mk-MK 马其顿语(马其顿)
ml-IN 马拉雅拉姆语(印度)
mn-MN 蒙古语(蒙古)
ms-MY 马来语(马来西亚)
mt-MT 马耳他语(马耳他)
my-MM 缅甸语(缅甸)
nb-NO 挪威语(博克马尔,挪威)
ne-NP 尼泊尔语(尼泊尔)
nl-BE 荷兰语(比利时)
nl-NL 荷兰语(荷兰)
pl-PL 波兰语(波兰)
ps-AF 普什图语(阿富汗)
pt-BR 葡萄牙语(巴西)
pt-PT 葡萄牙语(葡萄牙)
ro-RO 罗马尼亚语(罗马尼亚)
ru-RU 俄语(俄罗斯)
si-LK 僧伽罗语(斯里兰卡)
sk-SK 斯洛伐克语(斯洛伐克)
sl-SI 斯洛文尼亚语(斯洛文尼亚)
so-SO 索马里语(索马里)
sq-AL 阿尔巴尼亚语(阿尔巴尼亚)
sr-RS 塞尔维亚语(塞尔维亚)
su-ID 巽他语(印度尼西亚)
sv-SE 瑞典语(瑞典)
sw-KE 斯瓦希里语(肯尼亚)
ta-IN 泰米尔语(印度)
te-IN 泰卢固语(印度)
th-TH 泰语(泰国)
tr-TR 土耳其语(土耳其)
uk-UA 乌克兰语(乌克兰)
ur-PK 乌尔都语(巴基斯坦)
uz-UZ 乌兹别克语(乌兹别克斯坦)
vi-VN 越南语(越南)
zh-HK 中文(香港)
zh-TW 中文(台湾)
zu-ZA 祖鲁语(南非)

# 接口Description

TTS个人音色模型生成服务可根据用户上传的真人采集或录制的语音素材文件通过算法训练产出发音效果与声音素材提供者一致的数字人TTS音色模型。为保证训练效果,训练音频时长不得短于5分钟,请在采集时遵照商汤数字人音色复制采集制作规范,内容包括环境要求、设备要求、发音要求、授权要求、朗读脚本,具体参考:采集规范 (opens new window),PaaS平台支持7天在线存储,需要及时转存,7天后生成内容将无法下载。

# Request URL

POST /api/2dvh/v1/material/voice/clone/create

# Request Headers

Content-Type: application/json

# Request Parameters

Field Type Required Description
url String True Training audio file URL, duration must be at least 5 minutes
voice Object True Voice parameters
name String True Speaker name
gender Integer True Speaker gender(1: Male,2 :Female)
language String True Speaker language (currently only supports zh-CN: Mandarin Chinese)
musicSep Boolean False Whether to perform audio background music removal
sampleAudioMsg String False Sample audio content text. No sample audio generated by default. Maximum 500 characters.
trainMode String False 训练模式,common: 常规训练模式,默认为 common模式;backend_only: 极速训练模式,大幅度压缩模型训练时长,效果也会有影响。

# Request Example

{
  "url": "http://oss.com/abc/object.zip",
  "voice": {
    "name": "xiaotang0",
    "gender": 2,
    "language": "zh-CN"
  },
  "sampleAudioMsg": "我是商汤数字人!",
  "musicSep": true,
  "trainMode": "common"
}

# Response Elements

Field Type Required Description
code Integer True 0 - Success, 其他 - 异常
message String True Error details
data Object False Task ID

# Response Example

{
    "code": 0,
    "message": "success",
    "data": 11890
}

The above covers all voice processing capabilities provided by the platform.

Last Updated: 4/10/2026, 3:13:22 PM