# Streaming Voice Processing

In addition to providing voice processing capabilities via HTTP, the platform also offers streaming capabilities, including ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).

# Invocation Method

Use the WebSocket protocol, with control messages being UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

  1. Establish a WebSocket connection;
  2. Send a Starter package, which contains generic configuration information for subsequent requests;
  3. Receive a response, indicating whether authentication was successful or failed;
  4. Send Data packets;
  5. Receive corresponding results, steps 4 and 5 can be repeated;
  6. If there are no more tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Calling Restrictions

  1. If a Starter package is not sent within 10 seconds after establishing a WebSocket connection, the WebSocket connection will be disconnected;
  2. If no request is received within 60 seconds, the server will actively disconnect. It is recommended to send Ping packets at regular intervals to keep the connection alive;

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packets. The format is JSON text and contains the following fields:

Field Name Type Default Value Description
type Workflow Type string Required Fill in the service engine number or combination corresponding to the capability, e.g., "TTS3", "ASR5", "TTS3+ASR5"
device Device ID string Empty string Device ID, recommended to fill in for troubleshooting
session Session ID string Random UUIDv4 It is recommended that the caller generates and fills in the Session ID for troubleshooting
asr ASR Config object Required for ASR use ASR specific configuration, see below for details
tts TTS Config object Required for TTS use TTS specific configuration, see below for details

# Starter Message Example

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "ASR5",
  "device": "device-weye",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "asr": {
    "mic_volume": 0.67
  }
}

# Data

After sending the Starter package and successfully establishing a connection, multiple Data packets can be sent repeatedly. The format of the Data package is described in the corresponding capability document.

# Return Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The current request corresponding service module, i.e., auth
session Session ID string Yes The current connection's Session ID
status Status Name enum Yes The current session's status, ok for normal, fail for fail
error Error Message string No If failed, the error message returned
# Authentication Result Example
{
  "service": "auth",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}
# Result Data

Depending on the capability, each request will return one or more result data packets. The format is JSON text and contains the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The current request corresponding service module
session Session ID string Yes The current connection's Session ID
trace Trace ID string Yes The current request's Trace ID
status Status Name enum Yes The current session's status, ok for normal, fail for fail
error Error Message string No If failed, the error message returned
asr ASR Content object No If successful, the recognition result returned
nlp NLP Content object No If successful, the reply result returned
tts TTS Content object No If successful, the synthesis result returned

# ASR (Automatic Speech Recognition) Capability Integration Instructions

The central control WebSocket full-duplex interface ASR calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

  1. Establish a WebSocket connection;the address is usually ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token};
  2. Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
  3. Receive a response, indicating whether authentication was successful or failed;
  4. Send Data binary data packets, containing PCM audio;
  5. After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
  6. Receive corresponding ASR results, formatted as JSON text;
  7. If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
  8. If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text and contains the following fields:

Field Name Type Default Value Description
type Workflow Type string Required Fill in the service engine number for Chinese recognition, please choose: "ASR5"
device Device ID string Empty string Device ID, recommended to fill in for troubleshooting
session Session ID string Random UUIDv4 It is recommended that the caller generates and fills in the Session ID for troubleshooting
asr ASR Config object Required ASR specific configuration, see below for details

ASR Config configuration is as follows:

Field Name Type Default Value Description
language Language Code string zh-CN Optional field, the language to be recognized
mic_volume Microphone Volume float 1.0 Optional field, microphone volume for ASR to perform auto gain, supported range is 0 to 1
subtitle Subtitle Format string Empty string Optional field, the format of returned subtitles, empty means no return, supports: srt
subtitle_max_length Subtitle Max Length int 0 Optional field, the maximum number of characters per line of subtitle, 0 means no limit
intermediate Return Intermediate Result bool false Optional field, whether to return intermediate results
sentence_time Return Sentence-Level Timestamp bool false Optional field, whether to return sentence-level timestamps
word_time Return Word-Level Timestamp bool false Optional field, whether to return word-level timestamps
cache_url Return Cache URL for Data bool false Optional field, whether to upload subtitle files to Object Store storage and return cache URL
pause_time_msec Speech Pause Time (msec) int 500 Optional field, speech pause time, used to determine the boundaries and segmentation of speech, default is 500 milliseconds

# Data

After sending the Starter package and successfully establishing a connection, multiple binary Data packages can be sent repeatedly to stream audio.

The audio stream format is PCM, using a 16KHz sampling rate, 16bit data width, single channel, little endian. For example, use sox -t raw -r 16000 -e signed -b 16 -c 1 to convert formats, or ffmpeg -acodec pcm_s16le -ac 1 -ar 16000 -f s16le to convert formats.

The requirement for the audio sending rate is different between streaming and non-streaming ASR:

  • Streaming ASR: Audio is sent at the rate it is read from the microphone, recommended to send 1280 bytes every 40 milliseconds, or 5120 bytes every 160 milliseconds.
  • Non-streaming ASR: Audio data can be sent all at once, each Data package size is limited to within 1920KB (1 minute).

# EOF

After completing the audio package transmission, send an EOF package to indicate the end of recognition.

The requirement for EOF package usage is different between streaming and non-streaming ASR:

  • Streaming ASR: After completing the audio Data package transmission, send an EOF package. Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end.
  • Non-streaming ASR: After completing the audio Data package transmission, must send an EOF package.

If subtitles are needed, regardless of whether it is streaming or not, an EOF package must be sent to notify the ASR server to generate subtitles.

The format is JSON text and contains the following fields:

Field Name Type Default Value Description
signal End Mark string Required Fixed as eof
trace Trace ID string Random UUIDv4 Optional field, it is recommended that the caller generates and fills in the Trace ID for troubleshooting

# Return Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The current request corresponding service module, i.e., auth
session Session ID string Yes The current connection's Session ID
status Status Name enum Yes The current session's status, ok for normal, fail for fail
error Error Message string No If failed, the error message returned

# ASR Result Data

ASR will continuously return multiple text recognition result data packets and return subtitle, subtitle file address data packets after receiving the EOF request. If subtitle and subtitle address are not requested in the Starter request, only text recognition results are returned.

The return message format is JSON text, containing the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The current request corresponding service module, i.e., asr
session Session ID string Yes The current connection's Session ID
trace Trace ID string Yes The current sentence's Trace ID
status Status Name enum Yes The current session's status, ok for normal, fail for fail
error Error Message string No If failed, the error message returned
asr ASR Content object No If successful, the recognition result returned, with specific fields described below

Specific recognition results are located within the ASR Content:

Field Name Type Mandatory Description
index Index No. int Yes Return package sequence number
type Package Type enum Yes Text result package is text, intermediate result package is intermediate, subtitle package is subtitle, subtitle address package is subtitle_url, indicating all transmissions are complete is eof
text Text string Yes Text recognition result, will also appear in subtitle package but content is empty
subtitle Subtitle string No Subtitle content, only in subtitle package
subtitle_url Subtitle URL string No Subtitle content download address, only in subtitle package
sentence_time Sentence-Level Timestamp object No Sentence-level timestamp
word_times Word-Level Timestamp object No Word-level timestamp
# Text Result

Each sentence that is completely recognized will return a message. Intermediate recognition results are not returned. If it cannot be recognized or the recognition result is blank characters, it is also not returned.

Text result example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "c9cb36d8-3ca9-4e2b-9034-29f2c4edc3de",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "Hello."
    }
}
# Subtitle Result

After requesting subtitles and sending EOF, the generated subtitle result is returned.

Subtitle result example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "9a971f17-f871-4b73-9084-b856b67537d5",
    "asr": {
        "index": 3,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:01,280\n你好。\n\n2\n00:00:02,960 --> 00:00:04,240\n再见。\n\n"
    }
}
# Subtitle Address

Request subtitles, request Cache URL, and after sending EOF, return the cache URL of the subtitle file uploaded to Object Store.

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "25b30001-0b5a-4e9e-892c-4cc4d5bf134d",
    "asr": {
        "index": 4,
        "type": "subtitle_url",
        "subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/eab708a8-7aca-4237-a0a3-a6422ade8a23_25b30001-0b5a-4e9e-892c-4cc4d5bf134d.srt"
    }
}
# EOF

When an EOF request is received, send an EOF result packet, indicating that all results have been sent.

EOF example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "16ff049a-41fb-4c7a-ac5e-b26dbc3218e5",
    "asr": {
        "index": 5,
        "type": "eof"
    }
}

# Practical Process Example Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
    "type": "ASR5",
    "asr": {}
}

Request: Binary Data Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721"
}

Response: 2

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "e1c44bdc-4f9a-487c-806e-005679db7d0d",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "早知道你喜欢十里春光"
    }
}

Response: 3

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "f7551818-5025-4d83-b41b-136bb19b5b5f",
    "asr": {
        "index": 2,
        "type": "text",
        "text": "我一定会在麦田里种满玫瑰和山茶"
    }
}

Response: 4

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "89d3a8b4-a291-4cdc-9b78-d3f912d06223",
    "asr": {
        "index": 3,
        "type": "text",
        "text": "你路过这片土地才算浪漫"
    }
}

# Case 2: Complete Configuration Process

Request: Starter

{
    "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
    "type": "ASR5",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "asr": {
        "subtitle": "srt",
        "cache_url": true,
        "intermediate": true,
        "mic_volume": 0.67
    }
}

Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}

Request: Binary Data

Request: EOF

{
    "signal": "eof",
    "trace": "52517513-875a-47b6-bd30-f11a75e26745"
}

Response: 2

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "dcf88fbe-6cda-452d-8f51-e316cb4a0943",
  "asr": {
    "index": 1,
    "type": "intermediate",
    "text": "介"
  }
}

Response: 3

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "879d4700-746f-411b-954c-f83a2c6cd300",
  "asr": {
    "index": 2,
    "type": "intermediate",
    "text": "介绍下长"
  }
}

Response: 4

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "20e6a5e5-6d5d-4284-9013-4d410f1a5d37",
  "asr": {
    "index": 3,
    "type": "intermediate",
    "text": "介绍下长宁图书"
  }
}

Response: 5

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "cf7733e2-da28-442d-b5bb-282fc8f352f3",
  "asr": {
    "index": 4,
    "type": "text",
    "text": "介绍一下长宁图书馆。",
    "sentence_time": {
      "begin_ms": 2080,
      "end_ms": 4640
    },
    "word_times": [
      {
        "begin_ms": 2080,
        "end_ms": 2560,
        "text": "介"
      },
      {
        "begin_ms": 2560,
        "end_ms": 2800,
        "text": "绍"
      },
      {
        "begin_ms": 2800,
        "end_ms": 2920,
        "text": "一"
      },
      {
        "begin_ms": 2920,
        "end_ms": 3040,
        "text": "下"
      },
      {
        "begin_ms": 3040,
        "end_ms": 3280,
        "text": "长"
      },
      {
        "begin_ms": 3280,
        "end_ms": 3480,
        "text": "宁"
      },
      {
        "begin_ms": 3480,
        "end_ms": 3640,
        "text": "图"
      },
      {
        "begin_ms": 3640,
        "end_ms": 3880,
        "text": "书"
      },
      {
        "begin_ms": 3880,
        "end_ms": 4640,
        "text": "馆"
      }
    ]
  }
}

Response: 6

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
    "asr": {
        "index": 5,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:02,280\n介绍一下长宁图书馆\n\n"
    }
}

Response: 7

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
    "asr": {
        "index": 6,
        "type": "subtitle_url",
        "subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/8f97055c-bd29-41c7-92d1-3933fed566fa_3dcafe20-d6e0-4bce-a2ba-932b442e9e92.srt"
    }
}

Response: 8

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "2bd4cbce-0f72-402c-8e88-0f2704a22868",
    "asr": {
        "index": 7,
        "type": "eof"
    }
}

# TTS (Text-to-Speech) Integration Guide (Qid)

Central control WebSocket full-duplex interface TTS calling method description. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.

# Calling Process

  1. Establish a WebSocket connection, the address is usually ws://aigc.softsugar.com/api/voice/stream/v3?Authorization=Bearer {token};
  2. Send a Starter package, which contains the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
  3. Receive a response, indicating whether authentication was successful or failed;
  4. Send a Task package, which contains specific text and format information that needs to be synthesized;
  5. Receive data packages corresponding to the Task;
  6. If there are no more voice synthesis tasks currently, you can directly disconnect (there is no design for disconnect message);

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text, containing the following fields:

Field Name Type Default Value Description
type Workflow Type string Required Only supports TTS
device Device ID string Empty string Device ID, recommended to fill in for troubleshooting and locating problems
session Session ID string Random UUIDv4 It is recommended that the caller generates and fills in the Session ID for troubleshooting and locating problems
tts TTS Config object Required TTS dedicated configuration, see below for details

TTS Config configuration details:

Field Name Type Default Value Description
qid Qid string 8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg Required field
pitch_offset Pitch Offset float 0.0 Optional field, pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supports range [-10, 10]
speed_ratio Speed Ratio float 1.0 Optional field, speech speed, the higher the value, the slower the speech, supports range [0.5, 2]
sample_rate Sample Rate int 16000 Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000
volume Volume int 100 Optional field, volume, the higher the value, the louder the sound, supports range [1, 400]
format File Format string pcm Optional field, audio file and content, may support pcm, wav, mp3, but only pcm supports streaming return
omit_error Omit Error Message in Response bool false Optional field, whether to omit error messages, defaults to returning
audio Return Audio Data bool true Optional field, whether to return audio, defaults to returning
phone Return Phonetic Symbols bool false Optional field, whether to return phonetic symbols, defaults not to return
polyphone Return Polyphone bool false Optional field, whether to return polyphones in the query, defaults not to return
subtitle Subtitle Format string Empty string Optional field, the format of the returned subtitles, empty means not to return, supports: srt
subtitle_max_length Subtitle Max Length int 0 Optional field, the maximum number of characters for each line of subtitles/sentence-level timestamps, 0 means no limit on the number of characters, only valid when returning subtitles or sentence-level timestamps
subtitle_cut_by_punc Subtitle Cut by Punctuation bool false Optional field, whether to wrap subtitles/sentence-level timestamps based on punctuation and remove punctuation, only valid when returning subtitles or sentence-level timestamps. The range of punctuation marks is seen in the Quick Reference section under Subtitle Wrapping Punctuation Range
sentence_time Return Sentence-Level Timestamp bool false Optional field, whether to return sentence-level timestamps
word_time Return Word-Level Timestamp bool false Optional field, whether to return word-level timestamps

# Task

After sending the Starter package and successfully establishing a connection, multiple Task packages can be repeatedly sent to submit synthesis tasks. The Task package format is JSON text, containing the following fields:

Field Name Type Default Value Description
id Task ID string Random UUIDv4 Optional field, recommended that the caller generates and fills in, used to distinguish different requests in concurrent requests
query Query string Required Text content to be synthesized
ssml Use SSML bool false Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing method
no_cache Disable Cache bool false Optional field, whether to disable result caching for the current request, if enabled, results will neither use cache nor be stored in cache for the current request
override TTS Config object Empty Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: it's a direct replacement, not a merge)

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The service module corresponding to the current request, i.e., auth
session Session ID string Yes The Session ID of the current connection
status Status Name enum Yes The status of the current session, normally ok, failure is fail
error Error Message string No If failed, the error message returned

# TTS Result Data

Each successful Task continuously returns multiple data packages, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packages. Packages of the same type are returned in logical order, but the order of different types of data packages is not guaranteed. If phonetic symbols, subtitles, Cache URL are not requested in the Starter request, only audio will be returned.

The format of the return message is JSON text, containing the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The service module corresponding to the current request, i.e., tts
session Session ID string Yes The Session ID of the current connection
trace Trace ID string Yes The Trace ID corresponding to the current Task
status Status Name enum Yes The status of the current Task, normally ok, failure is fail
error Error Message string No If failed, the error message returned
tts TTS Content object No If successful, the synthesized result returned, see below for specific field meanings

Specific synthesized results are located within TTS Content:

Field Name Type Mandatory Description
id Task ID string Yes The ID corresponding to the current Task
index Index No. int Yes Sequence number of audio package, phonetic symbol package
type Package Type enum Yes audio for audio package, audio_url for audio file URL package, phone for phonetic symbol package, phone_url for phonetic symbol file URL package, subtitle_url for subtitle file URL package, subtitle for subtitle package, polyphone for polyphone package, timestamp for timestamp package, eof indicates all data has been sent
audio_data Base64-encoded Audio Data string No Audio data, only in audio package
phone_data Base64-encoded Phonetic Symbols string No Phonetic symbol data, only in phonetic symbol package
polyphones Polyphone Data object No Polyphone data, only in polyphone package
subtitle_data Base64-encoded Subtitles string No Subtitle data, only in subtitle package
sentence_time Sentence-Level Timestamp object No Sentence-level timestamps, only in timestamp package
word_times Word-Level Timestamp object No Word-level timestamps, only in timestamp package
facefeature_data Base64-encoded Face Feature string No Face Feature data, only in Face Feature package
audio_url URL of Audio File string No Audio file URL, only in audio package (Strikethrough indicates deprecated or unused features)
phone_url URL of JSON for Phonetic Symbols string No Phonetic symbol file URL, only in phonetic symbol package
subtitle_url URL of Subtitle File string No Subtitle file URL, only in subtitle package
resource Phonetic Symbols Info object No Phonetic symbol information, only in phonetic symbol package (used when TTS2 single return interface is used, this field will be returned)
# Audio Package

Contains Base64 encoded synthesized audio data results.

When the requested audio format is pcm, it is returned in multiple packages for streaming, other formats will be returned in a single package after audio synthesis.

Audio package example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}
# Phoneme Package

Includes the synthesis phoneme data results encoded in Base64.

Example of a phoneme package:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 2,
    "type": "phone",
    "phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
  }
}
# Subtitles

Contains Base64 encoded synthetic subtitle data results.

Example of a subtitle pack:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}
# Timestamp Packet

Includes sentence-level and character-level timestamp information.

Example of a timestamp:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}

# Homophones Package

Includes homophone information, with recommended pronunciations first, followed by other pronunciations.

Example of homophones:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}
# EOF

EOF packet, indicating that all results have been sent.

EOF example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Real-Process Example Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
  "type": "TTS",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好!"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Complete Configuration Process

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "qid": "8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "phone": true,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true,
    "cache_url": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 Audio

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 Phoneme

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 2,
    "type": "phone",
    "phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
  }
}

Response: 4 Timestamp

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 3, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 5 Polyphone

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 6 Subtitles

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 5,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 7 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}

# LLM (Large Language Model) Capability Integration Description

The central control WebSocket full-duplex interface LLM calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

  1. Establish a WebSocket connection;the address is usually ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token};
  2. Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
  3. Receive a response, indicating whether authentication was successful or failed;
  4. Send Data binary data packets, containing PCM audio;
  5. After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
  6. Receive corresponding ASR results, formatted as JSON text;
  7. If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
  8. If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Request Message Format

# Starter

The first packet sent after establishing a connection, indicating the purpose of this connection and how subsequent data packets should be parsed. The format is JSON text, containing the following fields:

Field Name Type Default Value Description
type Workflow Type string Required Fill in the service engine number corresponding to the capability, e.g., "NLP7"
NLP7 (sensechat), NLP10 (SenseTime humanoid model)
device Device ID string Empty string Device ID, recommended to fill in for traceability and troubleshooting
session Session ID string Random UUIDv4 It is recommended that the caller generates a Session ID for traceability and troubleshooting
nlp NLP Config object Required NLP-specific configuration, see details below

NLP Config details are as follows:

Field Name Type Default Value Description
omit_error Omit Error Message in Response bool false Optional field, whether to omit error messages. By default, error messages are returned
know_ids Knowledge IDs string list Empty list Optional field, list of knowledge base IDs, supported only by certain large language model engines
prompt_header System Role of Prompt string Empty string Optional field, background description of the prompt. If empty, the preset value in the configuration is used. Supported only by certain large language model engines
max_reply_token Max Token in Reply int 500 Optional field, maximum number of tokens in the reply. The actual maximum value depends on the model, supported only by certain large language model engines

# Query

After the Starter packet is sent and the connection is successfully established, multiple Query text packets can be sent to submit user questions. The format is JSON text, containing the following fields:

Field Name Type Default Value Description
id Trace ID string Random UUIDv4 Optional field, it is recommended that the caller generates and fills in the Trace ID to distinguish between different concurrent requests
query Query string Required The text content of the user's question

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:

Field Name Type Required Description
service Service Name string Yes The service module corresponding to the current request, i.e., auth
session Session ID string Yes The Session ID of the current connection
status Status Name enum Yes The status of the current session. Normal is ok, failure is fail
error Error Message string No If failed, the returned error message

# NLP Result Data

For each successfully processed Query, a message will be returned. When omit_error is false, error messages will also be returned. The basic format of the return message is as follows:

Field Name Type Required Description
service Service Name string Yes The service module corresponding to the current request, i.e., nlp
session Session ID string Yes The Session ID of the current connection
trace Trace ID string Yes The Trace ID corresponding to the current sentence
status Status Name enum Yes The status of the current session. Normal is ok, failure is fail
error Error Message string No If failed, the returned error message
nlp NLP Content object No If successful, the returned answer result. See the specific field meanings below

The specific recognition results are located in NLP Content:

Field Name Type Required Description
index Index No. int Yes The sequence number of the return packet
query Query string Yes The submitted question text
answer Answer string Yes The returned broadcast text, the digital human front end calls TTS for broadcasting
text Text string No The returned display text, displayed as text on the digital human front end
finish_reason Finish Reason string Yes The reason for stopping generation, enumerated values
Stopped by end token: stop
Stopped by reaching the maximum generation length: length
Stopped by triggering sensitive words: sensitive
Stopped by triggering the model context length limit: context

# TTS (Text-to-Speech) Integration Guide (Old)

Note: This interface is currently in maintenance mode and will not be updated with new features.

This document explains how to use the TTS capability via the Central Control WebSocket full-duplex interface. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.

# Invocation Process

  1. Establish a WebSocket connection, usually at ws://aigc.softsugar.com/api/voice/stream/v1?Authorization={Token};
  2. Send a Starter packet, containing the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be terminated;
  3. Receive a response indicating whether authentication was successful or failed;
  4. Send a Task packet, containing specific text and format information to be synthesized;
  5. Receive data packets corresponding to the Task;
  6. If there are no more voice synthesis tasks, you can disconnect directly (there is no design for disconnect message);

# Request Message Format

# Starter

The first packet sent after establishing a connection, indicating the purpose of this connection and the parsing method for subsequent data packets. The format is a JSON text, including the following fields:

Field Name Type Default Value Description
type Workflow Type string Mandatory Fill in the service engine number corresponding to the capability, e.g., "TTS3"
device Device ID string Empty string Device ID, recommended to fill in for troubleshooting and tracking issues
session Session ID string Random UUIDv4 It is recommended to generate and fill in the Session ID for troubleshooting and tracking issues
tts TTS Config object Mandatory TTS-specific configuration, see below for details

TTS Config details:

Field Name Type Default Value Description
language Language Code string zh-CN Optional field, the language to be synthesized, must be supported by the voice actor
voice Voice ID string Different default per engine Optional field, selectable voice actor
pitch_offset Pitch Offset float 0.0 Optional field, pitch, the higher the value, the sharper the voice, and vice versa. Range [-10, 10]
style Style string Empty Optional field, indicates the emotion of the voice actor
speed_ratio Speed Ratio float 1.0 Optional field, speech speed, the higher the value, the slower the speech. Range [0.5, 2]
sample_rate Sample Rate int 16000 Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000
volume Volume int 100 Optional field, volume, the higher the value, the louder the sound. Range [1, 400]
format File Format string pcm Optional field, audio file and content format, may support pcm, wav, mp3, but only pcm supports streaming return
omit_error Omit Error Message in Response bool false Optional field, whether to omit error messages, by default, they will be returned
audio Return Audio Data bool true Optional field, whether to return audio, by default, it will be returned
phone Return Phonetic Symbols bool false Optional field, whether to return phonetic symbols, by default, they will not be returned
polyphone Return Polyphone bool false Optional field, whether to return polyphones in the query, by default, they will not be returned
subtitle Subtitle Format string Empty string Optional field, the format of the subtitles to be returned, empty means not returned, supports: srt
subtitle_max_length Subtitle Max Length int 0 Optional field, the maximum number of characters per subtitle line/time-stamped sentence, 0 means unlimited, only effective when returning subtitles or sentence-level timestamps
subtitle_cut_by_punc Subtitle Cut by Punctuation bool false Optional field, whether to line-break and remove punctuation for subtitles/sentence-level timestamps based on punctuation, only effective when returning subtitles or sentence-level timestamps. See Quick Reference for the range of punctuation
sentence_time Return Sentence-Level Timestamp bool false Optional field, whether to return sentence-level timestamps
word_time Return Word-Level Timestamp bool false Optional field, whether to return word-level timestamps
cache_url Return Cache URL for Data bool false Optional field, whether to upload audio, phonetic symbols, subtitle files to Object Store and return cache URL. Deprecated field, no longer maintained, please stop using as soon as possible.

# Task

After sending the Starter packet and successfully establishing a connection, multiple Task packets can be sent to submit synthesis tasks. The Task packet format is a JSON text, including the following fields:

Field Name Type Default Value Description
id Task ID string Random UUIDv4 Optional field, it is recommended to generate and fill in to distinguish different requests in concurrent scenarios
query Query string Mandatory The text content to be synthesized
ssml Use SSML bool false Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing methods
no_cache Disable Cache bool false Optional field, whether to disable result caching for the current request, if enabled, neither cache results will be used nor will the result be stored in the cache
override TTS Config object Empty Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: this is a direct replacement, not a merger)

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is a JSON text, including the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The service module corresponding to the current request, i.e., auth
session Session ID string Yes The Session ID of the current connection
status Status Name enum Yes The status of the current session, normally ok, failure fail
error Error Message string No If failed, the error message returned

# TTS Result Data

Each successful Task continuously returns multiple data packets, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packets. Data packets of the same type are returned in logical order, but the order of different types of data packets is not guaranteed. If phonetic symbols, subtitles, Cache URL are not requested in the Starter request, only audio will be returned.

The format of the return message is a JSON text, including the following fields:

Field Name Type Mandatory Description
service Service Name string Yes The service module corresponding to the current request, i.e., tts
session Session ID string Yes The Session ID of the current connection
trace Trace ID string Yes The Trace ID corresponding to the current Task
status Status Name enum Yes The status of the current Task, normally ok, failure fail
error Error Message string No If failed, the error message returned
tts TTS Content object No If successful, the synthesis result returned, see below for specific field meanings

Specific synthesis results are located within TTS Content:

Field Name Type Mandatory Description
id Task ID string Yes The ID corresponding to the current Task
index Index No. int Yes Sequence number of audio packets, phonetic symbol packets
type Package Type enum Yes audio for audio packets, audio_url for audio address packets, phone for phonetic symbol packets, phone_url for phonetic symbol address packets, subtitle for subtitle packets, subtitle_url for subtitle address packets, polyphone for polyphone packets, timestamp for timestamp packets, eof indicates all packets have been sent
audio_data Base64-encoded Audio Data string No Audio data, only present in audio packets
phone_data Base64-encoded Phonetic Symbols string No Phonetic symbol data, only present in phonetic symbol packets
polyphones Polyphone Data object No Polyphone data, only present in polyphone packets
subtitle_data Base64-encoded Subtitles string No Subtitle data, only present in subtitle packets
sentence_time Sentence-Level Timestamp object No Sentence-level timestamp, only present in timestamp packets
word_times Word-Level Timestamp object No Word-level timestamp, only present in timestamp packets
resource Phonetic Symbols Info object No Phonetic symbol information, only present in phonetic symbol packets (When using TTS2 single return interface, this field will always be returned)
# Audio Packet

Contains Base64 encoded synthesized audio data results.

When the audio format is pcm, it is returned in multiple streaming packets; other formats will be returned in a single packet after audio synthesis.

Audio packet example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}
# Phoneme Pack

Includes the synthesis phoneme data results encoded in Base64.

Example of a phoneme pack:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 2,
    "type": "phone",
    "phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
  }
}
# Subtitles

Contains the result of synthesized subtitle data encoded in Base64.

Example of a subtitle package:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}
# Timestamp Package

Includes sentence-level and character-level timestamp information.

Example of timestamp:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}

# Polyphonic Character Package

Includes information on polyphonic characters, with the recommended pronunciation first, followed by other pronunciations.

Examples of polyphonic characters:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}
# EOF

EOF result packet, indicating that all results have been sent.

EOF example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Practical Process Sample Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
  "type": "TTS3",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好!"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Complete Configuration Process

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "language": "zh-CN",
    "voice": "xiaoling",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "phone": true,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 audio

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 Phoneme

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 2,
    "type": "phone",
    "phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
  }
}

Response: 4 Timestamp

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 3, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 5 Polyphone

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 6 subtitle

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 5,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 7 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}
Last Updated: 9/4/2024, 9:32:01 PM