# Streaming Voice Processing

In addition to providing voice processing capabilities via HTTP, the platform also offers streaming capabilities, including ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).

# Invocation Method

Use the WebSocket protocol, with control messages being UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

Establish a WebSocket connection;
Send a Starter package, which contains generic configuration information for subsequent requests;
Receive a response, indicating whether authentication was successful or failed;
Send Data packets;
Receive corresponding results, steps 4 and 5 can be repeated;
If there are no more tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Calling Restrictions

If a Starter package is not sent within 10 seconds after establishing a WebSocket connection, the WebSocket connection will be disconnected;
If no request is received within 60 seconds, the server will actively disconnect. It is recommended to send Ping packets at regular intervals to keep the connection alive;

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packets. The format is JSON text and contains the following fields:

Field	Name	Type	Default Value	Description
`type`	Workflow Type	string	Required	Fill in the service engine number or combination corresponding to the capability, e.g., "TTS3", "ASR5", "TTS3+ASR5"
`device`	Device ID	string	Empty string	Device ID, recommended to fill in for troubleshooting
`session`	Session ID	string	Random UUIDv4	It is recommended that the caller generates and fills in the Session ID for troubleshooting
`asr`	ASR Config	object	Required for ASR use	ASR specific configuration, see below for details
`tts`	TTS Config	object	Required for TTS use	TTS specific configuration, see below for details

# Starter Message Example

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "ASR5",
  "device": "device-weye",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "asr": {
    "mic_volume": 0.67
  }
}

# Data

After sending the Starter package and successfully establishing a connection, multiple Data packets can be sent repeatedly. The format of the Data package is described in the corresponding capability document.

# Return Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The current request corresponding service module, i.e., `auth`
`session`	Session ID	string	Yes	The current connection's Session ID
`status`	Status Name	enum	Yes	The current session's status, `ok` for normal, `fail` for fail
`error`	Error Message	string	No	If failed, the error message returned

# Authentication Result Example

{
  "service": "auth",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}

# Result Data

Depending on the capability, each request will return one or more result data packets. The format is JSON text and contains the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The current request corresponding service module
`session`	Session ID	string	Yes	The current connection's Session ID
`trace`	Trace ID	string	Yes	The current request's Trace ID
`status`	Status Name	enum	Yes	The current session's status, `ok` for normal, `fail` for fail
`error`	Error Message	string	No	If failed, the error message returned
`asr`	ASR Content	object	No	If successful, the recognition result returned
`nlp`	NLP Content	object	No	If successful, the reply result returned
`tts`	TTS Content	object	No	If successful, the synthesis result returned

# ASR (Automatic Speech Recognition) Capability Integration Instructions

The central control WebSocket full-duplex interface ASR calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

Establish a WebSocket connection;the address is usually ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token};
Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
Receive a response, indicating whether authentication was successful or failed;
Send Data binary data packets, containing PCM audio;
After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
Receive corresponding ASR results, formatted as JSON text;
If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text and contains the following fields:

Field	Name	Type	Default Value	Description
`type`	Workflow Type	string	Required	Fill in the service engine number for Chinese recognition, please choose: "ASR5"
`device`	Device ID	string	Empty string	Device ID, recommended to fill in for troubleshooting
`session`	Session ID	string	Random UUIDv4	It is recommended that the caller generates and fills in the Session ID for troubleshooting
`asr`	ASR Config	object	Required	ASR specific configuration, see below for details

ASR Config configuration is as follows:

Field	Name	Type	Default Value	Description
`language`	Language Code	string	`zh-CN`	Optional field, the language to be recognized
`mic_volume`	Microphone Volume	float	`1.0`	Optional field, microphone volume for ASR to perform auto gain, supported range is 0 to 1
`subtitle`	Subtitle Format	string	Empty string	Optional field, the format of returned subtitles, empty means no return, supports: srt
`subtitle_max_length`	Subtitle Max Length	int	0	Optional field, the maximum number of characters per line of subtitle, 0 means no limit
`intermediate`	Return Intermediate Result	bool	`false`	Optional field, whether to return intermediate results
`sentence_time`	Return Sentence-Level Timestamp	bool	`false`	Optional field, whether to return sentence-level timestamps
`word_time`	Return Word-Level Timestamp	bool	`false`	Optional field, whether to return word-level timestamps
`cache_url`	Return Cache URL for Data	bool	`false`	Optional field, whether to upload subtitle files to Object Store storage and return cache URL
`pause_time_msec`	Speech Pause Time (msec)	int	`500`	Optional field, speech pause time, used to determine the boundaries and segmentation of speech, default is 500 milliseconds

# Data

After sending the Starter package and successfully establishing a connection, multiple binary Data packages can be sent repeatedly to stream audio.

The audio stream format is PCM, using a 16KHz sampling rate, 16bit data width, single channel, little endian. For example, use sox -t raw -r 16000 -e signed -b 16 -c 1 to convert formats, or ffmpeg -acodec pcm_s16le -ac 1 -ar 16000 -f s16le to convert formats.

The requirement for the audio sending rate is different between streaming and non-streaming ASR:

Streaming ASR: Audio is sent at the rate it is read from the microphone, recommended to send 1280 bytes every 40 milliseconds, or 5120 bytes every 160 milliseconds.
Non-streaming ASR: Audio data can be sent all at once, each Data package size is limited to within 1920KB (1 minute).

# EOF

After completing the audio package transmission, send an EOF package to indicate the end of recognition.

The requirement for EOF package usage is different between streaming and non-streaming ASR:

Streaming ASR: After completing the audio Data package transmission, send an EOF package. Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end.
Non-streaming ASR: After completing the audio Data package transmission, must send an EOF package.

If subtitles are needed, regardless of whether it is streaming or not, an EOF package must be sent to notify the ASR server to generate subtitles.

The format is JSON text and contains the following fields:

Field	Name	Type	Default Value	Description
`signal`	End Mark	string	Required	Fixed as `eof`
`trace`	Trace ID	string	Random UUIDv4	Optional field, it is recommended that the caller generates and fills in the Trace ID for troubleshooting

# Return Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text and contains the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The current request corresponding service module, i.e., `auth`
`session`	Session ID	string	Yes	The current connection's Session ID
`status`	Status Name	enum	Yes	The current session's status, `ok` for normal, `fail` for fail
`error`	Error Message	string	No	If failed, the error message returned

# ASR Result Data

ASR will continuously return multiple text recognition result data packets and return subtitle, subtitle file address data packets after receiving the EOF request. If subtitle and subtitle address are not requested in the Starter request, only text recognition results are returned.

The return message format is JSON text, containing the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The current request corresponding service module, i.e., `asr`
`session`	Session ID	string	Yes	The current connection's Session ID
`trace`	Trace ID	string	Yes	The current sentence's Trace ID
`status`	Status Name	enum	Yes	The current session's status, `ok` for normal, `fail` for fail
`error`	Error Message	string	No	If failed, the error message returned
`asr`	ASR Content	object	No	If successful, the recognition result returned, with specific fields described below

Specific recognition results are located within the ASR Content:

Field	Name	Type	Mandatory	Description
`index`	Index No.	int	Yes	Return package sequence number
`type`	Package Type	enum	Yes	Text result package is `text`, intermediate result package is `intermediate`, subtitle package is `subtitle`, subtitle address package is `subtitle_url`, indicating all transmissions are complete is `eof`
`text`	Text	string	Yes	Text recognition result, will also appear in subtitle package but content is empty
`subtitle`	Subtitle	string	No	Subtitle content, only in subtitle package
`subtitle_url`	Subtitle URL	string	No	Subtitle content download address, only in subtitle package
`sentence_time`	Sentence-Level Timestamp	object	No	Sentence-level timestamp
`word_times`	Word-Level Timestamp	object	No	Word-level timestamp

# Text Result

Each sentence that is completely recognized will return a message. Intermediate recognition results are not returned. If it cannot be recognized or the recognition result is blank characters, it is also not returned.

Text result example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "c9cb36d8-3ca9-4e2b-9034-29f2c4edc3de",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "Hello."
    }
}

# Subtitle Result

After requesting subtitles and sending EOF, the generated subtitle result is returned.

Subtitle result example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "9a971f17-f871-4b73-9084-b856b67537d5",
    "asr": {
        "index": 3,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:01,280\n你好。\n\n2\n00:00:02,960 --> 00:00:04,240\n再见。\n\n"
    }
}

# Subtitle Address

Request subtitles, request Cache URL, and after sending EOF, return the cache URL of the subtitle file uploaded to Object Store.

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "25b30001-0b5a-4e9e-892c-4cc4d5bf134d",
    "asr": {
        "index": 4,
        "type": "subtitle_url",
        "subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/eab708a8-7aca-4237-a0a3-a6422ade8a23_25b30001-0b5a-4e9e-892c-4cc4d5bf134d.srt"
    }
}

# EOF

When an EOF request is received, send an EOF result packet, indicating that all results have been sent.

EOF example:

{
    "service": "asr",
    "status": "ok",
    "session": "eab708a8-7aca-4237-a0a3-a6422ade8a23",
    "trace": "16ff049a-41fb-4c7a-ac5e-b26dbc3218e5",
    "asr": {
        "index": 5,
        "type": "eof"
    }
}

# Practical Process Example Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
    "type": "ASR5",
    "asr": {}
}

Request: Binary Data Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721"
}

Response: 2

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "e1c44bdc-4f9a-487c-806e-005679db7d0d",
    "asr": {
        "index": 1,
        "type": "text",
        "text": "早知道你喜欢十里春光"
    }
}

Response: 3

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "f7551818-5025-4d83-b41b-136bb19b5b5f",
    "asr": {
        "index": 2,
        "type": "text",
        "text": "我一定会在麦田里种满玫瑰和山茶"
    }
}

Response: 4

{
    "service": "asr",
    "session": "4ea613f2-b1d4-47cf-8033-db59aae66721",
    "trace": "89d3a8b4-a291-4cdc-9b78-d3f912d06223",
    "asr": {
        "index": 3,
        "type": "text",
        "text": "你路过这片土地才算浪漫"
    }
}

# Case 2: Complete Configuration Process

Request: Starter

{
    "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
    "type": "ASR5",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "asr": {
        "subtitle": "srt",
        "cache_url": true,
        "intermediate": true,
        "mic_volume": 0.67
    }
}

Response: 1

{
    "service": "auth",
    "status": "ok",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa"
}

Request: Binary Data

Request: EOF

{
    "signal": "eof",
    "trace": "52517513-875a-47b6-bd30-f11a75e26745"
}

Response: 2

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "dcf88fbe-6cda-452d-8f51-e316cb4a0943",
  "asr": {
    "index": 1,
    "type": "intermediate",
    "text": "介"
  }
}

Response: 3

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "879d4700-746f-411b-954c-f83a2c6cd300",
  "asr": {
    "index": 2,
    "type": "intermediate",
    "text": "介绍下长"
  }
}

Response: 4

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "20e6a5e5-6d5d-4284-9013-4d410f1a5d37",
  "asr": {
    "index": 3,
    "type": "intermediate",
    "text": "介绍下长宁图书"
  }
}

Response: 5

{
  "service": "asr",
  "status": "ok",
  "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
  "trace": "cf7733e2-da28-442d-b5bb-282fc8f352f3",
  "asr": {
    "index": 4,
    "type": "text",
    "text": "介绍一下长宁图书馆。",
    "sentence_time": {
      "begin_ms": 2080,
      "end_ms": 4640
    },
    "word_times": [
      {
        "begin_ms": 2080,
        "end_ms": 2560,
        "text": "介"
      },
      {
        "begin_ms": 2560,
        "end_ms": 2800,
        "text": "绍"
      },
      {
        "begin_ms": 2800,
        "end_ms": 2920,
        "text": "一"
      },
      {
        "begin_ms": 2920,
        "end_ms": 3040,
        "text": "下"
      },
      {
        "begin_ms": 3040,
        "end_ms": 3280,
        "text": "长"
      },
      {
        "begin_ms": 3280,
        "end_ms": 3480,
        "text": "宁"
      },
      {
        "begin_ms": 3480,
        "end_ms": 3640,
        "text": "图"
      },
      {
        "begin_ms": 3640,
        "end_ms": 3880,
        "text": "书"
      },
      {
        "begin_ms": 3880,
        "end_ms": 4640,
        "text": "馆"
      }
    ]
  }
}

Response: 6

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
    "asr": {
        "index": 5,
        "type": "subtitle",
        "subtitle": "1\n00:00:00,000 --> 00:00:02,280\n介绍一下长宁图书馆\n\n"
    }
}

Response: 7

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "3dcafe20-d6e0-4bce-a2ba-932b442e9e92",
    "asr": {
        "index": 6,
        "type": "subtitle_url",
        "subtitle_url": "https://aigc.blob.core.chinacloudapi.cn/audio/asr-srt/8f97055c-bd29-41c7-92d1-3933fed566fa_3dcafe20-d6e0-4bce-a2ba-932b442e9e92.srt"
    }
}

Response: 8

{
    "service": "asr",
    "session": "8f97055c-bd29-41c7-92d1-3933fed566fa",
    "trace": "2bd4cbce-0f72-402c-8e88-0f2704a22868",
    "asr": {
        "index": 7,
        "type": "eof"
    }
}

# TTS (Text-to-Speech) Integration Guide (Qid)

Central control WebSocket full-duplex interface TTS calling method description. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.

# Calling Process

Establish a WebSocket connection, the address is usually ws://aigc.softsugar.com/api/voice/stream/v3?Authorization=Bearer {token};
Send a Starter package, which contains the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
Receive a response, indicating whether authentication was successful or failed;
Send a Task package, which contains specific text and format information that needs to be synthesized;
Receive data packages corresponding to the Task;
If there are no more voice synthesis tasks currently, you can directly disconnect (there is no design for disconnect message);

# Request Message Format

# Starter

The first package sent after establishing a connection, indicating the purpose of this connection and the parsing method of subsequent data packages. The format is JSON text, containing the following fields:

Field	Name	Type	Default Value	Description
`type`	Workflow Type	string	Required	Only supports TTS
`device`	Device ID	string	Empty string	Device ID, recommended to fill in for troubleshooting and locating problems
`session`	Session ID	string	Random UUIDv4	It is recommended that the caller generates and fills in the Session ID for troubleshooting and locating problems
`tts`	TTS Config	object	Required	TTS dedicated configuration, see below for details

TTS Config configuration details:

Field	Name	Type	Default Value	Description
`qid`	Qid	string	8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg	Required field
`pitch_offset`	Pitch Offset	float	`0.0`	Optional field, pitch, the higher the value, the sharper the voice, and the lower the value, the deeper the voice, supports range [-10, 10]
`speed_ratio`	Speed Ratio	float	`1.0`	Optional field, speech speed, the higher the value, the slower the speech, supports range [0.5, 2]
`sample_rate`	Sample Rate	int	`16000`	Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000
`volume`	Volume	int	`100`	Optional field, volume, the higher the value, the louder the sound, supports range [1, 400]
`format`	File Format	string	`pcm`	Optional field, audio file and content, may support pcm, wav, mp3, but only pcm supports streaming return
`omit_error`	Omit Error Message in Response	bool	`false`	Optional field, whether to omit error messages, defaults to returning
`audio`	Return Audio Data	bool	`true`	Optional field, whether to return audio, defaults to returning
`phone`	Return Phonetic Symbols	bool	`false`	Optional field, whether to return phonetic symbols, defaults not to return
`polyphone`	Return Polyphone	bool	`false`	Optional field, whether to return polyphones in the query, defaults not to return
`subtitle`	Subtitle Format	string	Empty string	Optional field, the format of the returned subtitles, empty means not to return, supports: srt
`subtitle_max_length`	Subtitle Max Length	int	0	Optional field, the maximum number of characters for each line of subtitles/sentence-level timestamps, 0 means no limit on the number of characters, only valid when returning subtitles or sentence-level timestamps
`subtitle_cut_by_punc`	Subtitle Cut by Punctuation	bool	`false`	Optional field, whether to wrap subtitles/sentence-level timestamps based on punctuation and remove punctuation, only valid when returning subtitles or sentence-level timestamps. The range of punctuation marks is seen in the Quick Reference section under Subtitle Wrapping Punctuation Range
`sentence_time`	Return Sentence-Level Timestamp	bool	`false`	Optional field, whether to return sentence-level timestamps
`word_time`	Return Word-Level Timestamp	bool	`false`	Optional field, whether to return word-level timestamps

# Task

After sending the Starter package and successfully establishing a connection, multiple Task packages can be repeatedly sent to submit synthesis tasks. The Task package format is JSON text, containing the following fields:

Field	Name	Type	Default Value	Description
`id`	Task ID	string	Random UUIDv4	Optional field, recommended that the caller generates and fills in, used to distinguish different requests in concurrent requests
`query`	Query	string	Required	Text content to be synthesized
`ssml`	Use SSML	bool	`false`	Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing method
`no_cache`	Disable Cache	bool	`false`	Optional field, whether to disable result caching for the current request, if enabled, results will neither use cache nor be stored in cache for the current request
`override`	TTS Config	object	Empty	Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: it's a direct replacement, not a merge)

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `auth`
`session`	Session ID	string	Yes	The Session ID of the current connection
`status`	Status Name	enum	Yes	The status of the current session, normally `ok`, failure is `fail`
`error`	Error Message	string	No	If failed, the error message returned

# TTS Result Data

Each successful Task continuously returns multiple data packages, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packages. Packages of the same type are returned in logical order, but the order of different types of data packages is not guaranteed. If phonetic symbols, subtitles, Cache URL are not requested in the Starter request, only audio will be returned.

The format of the return message is JSON text, containing the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `tts`
`session`	Session ID	string	Yes	The Session ID of the current connection
`trace`	Trace ID	string	Yes	The Trace ID corresponding to the current Task
`status`	Status Name	enum	Yes	The status of the current Task, normally `ok`, failure is `fail`
`error`	Error Message	string	No	If failed, the error message returned
`tts`	TTS Content	object	No	If successful, the synthesized result returned, see below for specific field meanings

Specific synthesized results are located within TTS Content:

Field	Name	Type	Mandatory	Description
`id`	Task ID	string	Yes	The ID corresponding to the current Task
`index`	Index No.	int	Yes	Sequence number of audio package, phonetic symbol package
`type`	Package Type	enum	Yes	`audio` for audio package, ~~`audio_url` for audio file URL package, `phone` for phonetic symbol package, `phone_url` for phonetic symbol file URL package, `subtitle_url` for subtitle file URL package,~~ `subtitle` for subtitle package, `polyphone` for polyphone package, `timestamp` for timestamp package, `eof` indicates all data has been sent
`audio_data`	Base64-encoded Audio Data	string	No	Audio data, only in audio package
`phone_data`	Base64-encoded Phonetic Symbols	string	No	Phonetic symbol data, only in phonetic symbol package
`polyphones`	Polyphone Data	object	No	Polyphone data, only in polyphone package
`subtitle_data`	Base64-encoded Subtitles	string	No	Subtitle data, only in subtitle package
`sentence_time`	Sentence-Level Timestamp	object	No	Sentence-level timestamps, only in timestamp package
`word_times`	Word-Level Timestamp	object	No	Word-level timestamps, only in timestamp package
`facefeature_data`	Base64-encoded Face Feature	string	No	Face Feature data, only in Face Feature package
~~`audio_url`~~	~~URL of Audio File~~	string	No	Audio file URL, only in audio package (Strikethrough indicates deprecated or unused features)
`phone_url`	URL of JSON for Phonetic Symbols	string	No	Phonetic symbol file URL, only in phonetic symbol package
`subtitle_url`	URL of Subtitle File	string	No	Subtitle file URL, only in subtitle package
`resource`	Phonetic Symbols Info	object	No	Phonetic symbol information, only in phonetic symbol package (used when TTS2 single return interface is used, this field will be returned)

# Audio Package

Contains Base64 encoded synthesized audio data results.

When the requested audio format is pcm, it is returned in multiple packages for streaming, other formats will be returned in a single package after audio synthesis.

Audio package example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}

# Phoneme Package

Includes the synthesis phoneme data results encoded in Base64.

Example of a phoneme package:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 2,
    "type": "phone",
    "phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
  }
}

# Subtitles

Contains Base64 encoded synthetic subtitle data results.

Example of a subtitle pack:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

# Timestamp Packet

Includes sentence-level and character-level timestamp information.

Example of a timestamp:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}

# Homophones Package

Includes homophone information, with recommended pronunciations first, followed by other pronunciations.

Example of homophones:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

# EOF

EOF packet, indicating that all results have been sent.

EOF example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Real-Process Example Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
  "type": "TTS",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好！"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Complete Configuration Process

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "qid": "8wfZav:AEA_Z10Mqp9GCwDGMrz8xIzi3VScxNzUtLCg",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "phone": true,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true,
    "cache_url": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 Audio

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 Phoneme

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 2,
    "type": "phone",
    "phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
  }
}

Response: 4 Timestamp

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 3, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 5 Polyphone

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 6 Subtitles

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 5,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 7 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}

# LLM (Large Language Model) Capability Integration Description

The central control WebSocket full-duplex interface LLM calling method instructions, linking method is WebSocket protocol, control message is UTF-8 encoded JSON text. The WS interface requires placing the token in the Authorization header field or concatenating it in the URL. The token passed in the header has higher priority. (Example: See FAQ).

# Calling Process

Establish a WebSocket connection;the address is usually ws://aigc.softsugar.com/api/voice/stream/v1?Authorization=Bearer {token};
Send a Starter package, containing the generic configuration information for subsequent requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be disconnected;
Receive a response, indicating whether authentication was successful or failed;
Send Data binary data packets, containing PCM audio;
After completing audio transmission, send an EOF package; (Optional, if not sent, additional 500 milliseconds or more of ambient silence is needed to help VAD end)
Receive corresponding ASR results, formatted as JSON text;
If an EOF request package is sent, receive an EOF result package, indicating that all recognition results have been sent;
If there are no more voice recognition tasks currently, you can disconnect directly (there is no design for a disconnection message);

# Request Message Format

# Starter

The first packet sent after establishing a connection, indicating the purpose of this connection and how subsequent data packets should be parsed. The format is JSON text, containing the following fields:

Field	Name	Type	Default Value	Description
`type`	Workflow Type	string	Required	Fill in the service engine number corresponding to the capability, e.g., "NLP7" NLP7 (sensechat), NLP10 (SenseTime humanoid model)
`device`	Device ID	string	Empty string	Device ID, recommended to fill in for traceability and troubleshooting
`session`	Session ID	string	Random UUIDv4	It is recommended that the caller generates a Session ID for traceability and troubleshooting
`nlp`	NLP Config	object	Required	NLP-specific configuration, see details below

NLP Config details are as follows:

Field	Name	Type	Default Value	Description
`omit_error`	Omit Error Message in Response	bool	`false`	Optional field, whether to omit error messages. By default, error messages are returned
`know_ids`	Knowledge IDs	string list	Empty list	Optional field, list of knowledge base IDs, supported only by certain large language model engines
`prompt_header`	System Role of Prompt	string	Empty string	Optional field, background description of the prompt. If empty, the preset value in the configuration is used. Supported only by certain large language model engines
`max_reply_token`	Max Token in Reply	int	`500`	Optional field, maximum number of tokens in the reply. The actual maximum value depends on the model, supported only by certain large language model engines

# Query

After the Starter packet is sent and the connection is successfully established, multiple Query text packets can be sent to submit user questions. The format is JSON text, containing the following fields:

Field	Name	Type	Default Value	Description
`id`	Trace ID	string	Random UUIDv4	Optional field, it is recommended that the caller generates and fills in the Trace ID to distinguish between different concurrent requests
`query`	Query	string	Required	The text content of the user's question

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is JSON text, containing the following fields:

Field	Name	Type	Required	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `auth`
`session`	Session ID	string	Yes	The Session ID of the current connection
`status`	Status Name	enum	Yes	The status of the current session. Normal is `ok`, failure is `fail`
`error`	Error Message	string	No	If failed, the returned error message

# NLP Result Data

For each successfully processed Query, a message will be returned. When omit_error is false, error messages will also be returned. The basic format of the return message is as follows:

Field	Name	Type	Required	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `nlp`
`session`	Session ID	string	Yes	The Session ID of the current connection
`trace`	Trace ID	string	Yes	The Trace ID corresponding to the current sentence
`status`	Status Name	enum	Yes	The status of the current session. Normal is `ok`, failure is `fail`
`error`	Error Message	string	No	If failed, the returned error message
`nlp`	NLP Content	object	No	If successful, the returned answer result. See the specific field meanings below

The specific recognition results are located in NLP Content:

Field	Name	Type	Required	Description
`index`	Index No.	int	Yes	The sequence number of the return packet
`query`	Query	string	Yes	The submitted question text
`answer`	Answer	string	Yes	The returned broadcast text, the digital human front end calls TTS for broadcasting
`text`	Text	string	No	The returned display text, displayed as text on the digital human front end
`finish_reason`	Finish Reason	string	Yes	The reason for stopping generation, enumerated values Stopped by end token: `stop` Stopped by reaching the maximum generation length: `length` Stopped by triggering sensitive words: `sensitive` Stopped by triggering the model context length limit: `context`

# TTS (Text-to-Speech) Integration Guide (Old)

Note: This interface is currently in maintenance mode and will not be updated with new features.

This document explains how to use the TTS capability via the Central Control WebSocket full-duplex interface. The connection uses the WebSocket protocol, and all messages are UTF-8 encoded JSON texts.

# Invocation Process

Establish a WebSocket connection, usually at ws://aigc.softsugar.com/api/voice/stream/v1?Authorization={Token};
Send a Starter packet, containing the general configuration information for subsequent TTS requests. If the format is incorrect or not sent within 10 seconds, the WebSocket connection will be terminated;
Receive a response indicating whether authentication was successful or failed;
Send a Task packet, containing specific text and format information to be synthesized;
Receive data packets corresponding to the Task;
If there are no more voice synthesis tasks, you can disconnect directly (there is no design for disconnect message);

# Request Message Format

# Starter

The first packet sent after establishing a connection, indicating the purpose of this connection and the parsing method for subsequent data packets. The format is a JSON text, including the following fields:

Field	Name	Type	Default Value	Description
`type`	Workflow Type	string	Mandatory	Fill in the service engine number corresponding to the capability, e.g., "TTS3"
`device`	Device ID	string	Empty string	Device ID, recommended to fill in for troubleshooting and tracking issues
`session`	Session ID	string	Random UUIDv4	It is recommended to generate and fill in the Session ID for troubleshooting and tracking issues
`tts`	TTS Config	object	Mandatory	TTS-specific configuration, see below for details

TTS Config details:

Field	Name	Type	Default Value	Description
`language`	Language Code	string	`zh-CN`	Optional field, the language to be synthesized, must be supported by the voice actor
`voice`	Voice ID	string	Different default per engine	Optional field, selectable voice actor
`pitch_offset`	Pitch Offset	float	`0.0`	Optional field, pitch, the higher the value, the sharper the voice, and vice versa. Range [-10, 10]
`style`	Style	string	Empty	Optional field, indicates the emotion of the voice actor
`speed_ratio`	Speed Ratio	float	`1.0`	Optional field, speech speed, the higher the value, the slower the speech. Range [0.5, 2]
`sample_rate`	Sample Rate	int	`16000`	Optional field, sampling rate, supports: 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000
`volume`	Volume	int	`100`	Optional field, volume, the higher the value, the louder the sound. Range [1, 400]
`format`	File Format	string	`pcm`	Optional field, audio file and content format, may support pcm, wav, mp3, but only pcm supports streaming return
`omit_error`	Omit Error Message in Response	bool	`false`	Optional field, whether to omit error messages, by default, they will be returned
`audio`	Return Audio Data	bool	`true`	Optional field, whether to return audio, by default, it will be returned
`phone`	Return Phonetic Symbols	bool	`false`	Optional field, whether to return phonetic symbols, by default, they will not be returned
`polyphone`	Return Polyphone	bool	`false`	Optional field, whether to return polyphones in the query, by default, they will not be returned
`subtitle`	Subtitle Format	string	Empty string	Optional field, the format of the subtitles to be returned, empty means not returned, supports: srt
`subtitle_max_length`	Subtitle Max Length	int	0	Optional field, the maximum number of characters per subtitle line/time-stamped sentence, 0 means unlimited, only effective when returning subtitles or sentence-level timestamps
`subtitle_cut_by_punc`	Subtitle Cut by Punctuation	bool	`false`	Optional field, whether to line-break and remove punctuation for subtitles/sentence-level timestamps based on punctuation, only effective when returning subtitles or sentence-level timestamps. See Quick Reference for the range of punctuation
`sentence_time`	Return Sentence-Level Timestamp	bool	`false`	Optional field, whether to return sentence-level timestamps
`word_time`	Return Word-Level Timestamp	bool	`false`	Optional field, whether to return word-level timestamps
~~`cache_url`~~	~~Return Cache URL for Data~~	~~bool~~	~~`false`~~	~~Optional field, whether to upload audio, phonetic symbols, subtitle files to Object Store and return cache URL~~. Deprecated field, no longer maintained, please stop using as soon as possible.

# Task

After sending the Starter packet and successfully establishing a connection, multiple Task packets can be sent to submit synthesis tasks. The Task packet format is a JSON text, including the following fields:

Field	Name	Type	Default Value	Description
`id`	Task ID	string	Random UUIDv4	Optional field, it is recommended to generate and fill in to distinguish different requests in concurrent scenarios
`query`	Query	string	Mandatory	The text content to be synthesized
`ssml`	Use SSML	bool	`false`	Optional field, whether to use SSML to mark the synthesis text, refer to ONES documentation for writing methods
`no_cache`	Disable Cache	bool	`false`	Optional field, whether to disable result caching for the current request, if enabled, neither cache results will be used nor will the result be stored in the cache
`override`	TTS Config	object	Empty	Optional field, independent configuration for a single TTS request, completely replaces the TTS configuration in the Starter message for the current task (Note: this is a direct replacement, not a merger)

# Response Message Format

# Authentication Result

After sending the Starter request, a message containing the authentication result will be returned. The format is a JSON text, including the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `auth`
`session`	Session ID	string	Yes	The Session ID of the current connection
`status`	Status Name	enum	Yes	The status of the current session, normally `ok`, failure `fail`
`error`	Error Message	string	No	If failed, the error message returned

# TTS Result Data

Each successful Task continuously returns multiple data packets, including audio, audio file address, phonetic symbols, phonetic symbol file address, subtitles, and subtitle file address packets. Data packets of the same type are returned in logical order, but the order of different types of data packets is not guaranteed. If phonetic symbols, subtitles, Cache URL are not requested in the Starter request, only audio will be returned.

The format of the return message is a JSON text, including the following fields:

Field	Name	Type	Mandatory	Description
`service`	Service Name	string	Yes	The service module corresponding to the current request, i.e., `tts`
`session`	Session ID	string	Yes	The Session ID of the current connection
`trace`	Trace ID	string	Yes	The Trace ID corresponding to the current Task
`status`	Status Name	enum	Yes	The status of the current Task, normally `ok`, failure `fail`
`error`	Error Message	string	No	If failed, the error message returned
`tts`	TTS Content	object	No	If successful, the synthesis result returned, see below for specific field meanings

Specific synthesis results are located within TTS Content:

Field	Name	Type	Mandatory	Description
`id`	Task ID	string	Yes	The ID corresponding to the current Task
`index`	Index No.	int	Yes	Sequence number of audio packets, phonetic symbol packets
`type`	Package Type	enum	Yes	`audio` for audio packets, `audio_url` for audio address packets, `phone` for phonetic symbol packets, ~~`phone_url` for phonetic symbol address packets,~~ `subtitle` for subtitle packets, `subtitle_url` for subtitle address packets, `polyphone` for polyphone packets, `timestamp` for timestamp packets, `eof` indicates all packets have been sent
`audio_data`	Base64-encoded Audio Data	string	No	Audio data, only present in audio packets
`phone_data`	Base64-encoded Phonetic Symbols	string	No	Phonetic symbol data, only present in phonetic symbol packets
`polyphones`	Polyphone Data	object	No	Polyphone data, only present in polyphone packets
`subtitle_data`	Base64-encoded Subtitles	string	No	Subtitle data, only present in subtitle packets
`sentence_time`	Sentence-Level Timestamp	object	No	Sentence-level timestamp, only present in timestamp packets
`word_times`	Word-Level Timestamp	object	No	Word-level timestamp, only present in timestamp packets
`resource`	Phonetic Symbols Info	object	No	Phonetic symbol information, only present in phonetic symbol packets (When using TTS2 single return interface, this field will always be returned)

# Audio Packet

Contains Base64 encoded synthesized audio data results.

When the audio format is pcm, it is returned in multiple streaming packets; other formats will be returned in a single packet after audio synthesis.

Audio packet example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "851ab562-ec51-4ad7-bd21-4f4af19875cb",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAA...DYBfQHDAeMBEwI8AlQCRAIaAu0BpAFCAckARQCv...AAAAAAAAAAAAAAAAAAAAAAAAAAA=="
  }
}

# Phoneme Pack

Includes the synthesis phoneme data results encoded in Base64.

Example of a phoneme pack:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 2,
    "type": "phone",
    "phone_data": "biBuIGkgaSBpIGkgaSBpIDIgMiAjMSAjMSAjMSAjMSAjMSAjMSBoIGggaCBoIGggaCBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBhbyBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5kIGVuZCBlbmQgZW5k"
  }
}

# Subtitles

Contains the result of synthesized subtitle data encoded in Base64.

Example of a subtitle package:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 4,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

# Timestamp Package

Includes sentence-level and character-level timestamp information.

Example of timestamp:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "4d5ed90f-2188-42a0-b4d7-db00f1a2a944", 
    "index": 7, 
    "type": "timestamp", 
    "sentence_time": {
        "begin_ms": 7770,
        "end_ms": 9140, 
        "text": "新人起步很不容易"
    }, 
    "word_times": [
        {"begin_ms": 7770, "end_ms": 7960, "text": "新"}, 
        {"begin_ms": 7960, "end_ms": 8120, "text": "人"}, 
        {"begin_ms": 8120, "end_ms": 8310, "text": "起"}, 
        {"begin_ms": 8310, "end_ms": 8430, "text": "步"}, 
        {"begin_ms": 8430, "end_ms": 8630, "text": "很"}, 
        {"begin_ms": 8630, "end_ms": 8720, "text": "不"}, 
        {"begin_ms": 8720, "end_ms": 8920, "text": "容"}, 
        {"begin_ms": 8920, "end_ms": 9140, "text": "易"}
    ]
  }
}

# Polyphonic Character Package

Includes information on polyphonic characters, with the recommended pronunciation first, followed by other pronunciations.

Examples of polyphonic characters:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "b2b2b2b2-b2b2-b2b2-b2b2-b2b2b2b2b2b2",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

# EOF

EOF result packet, indicating that all results have been sent.

EOF example:

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "02261a7e-7df8-4554-b2a1-ad2fa8bf2cbb",
    "index": 8,
    "type": "eof"
  }
}

# Practical Process Sample Analysis

# Case 1: Minimum Configuration Process

Request: Starter

{
  "type": "TTS3",
  "tts": {}
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260"
}

Request: Task

{
  "query": "大家好！"
}

Response: 2

{
  "service": "tts",
  "status": "ok",
  "session": "49d3af81-f344-4ccf-8231-574ceac1a260",
  "trace": "f2e13c02-c629-4db8-a942-4393583a5182",
  "tts": {
    "id": "4b69geebj4septyxh72qy885f",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

# Case 2: Complete Configuration Process

Request: Starter

{
  "auth": "XSMLTGKQVVCPJCQHJZ4VEDMGIY",
  "type": "TTS3",
  "device": "device-wei",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "tts": {
    "language": "zh-CN",
    "voice": "xiaoling",
    "speed_ratio": 1.05,
    "sample_rate": 16000,
    "volume": 200,
    "phone": true,
    "polyphone": true,
    "subtitle": "srt",
    "sentence_time": true,
    "word_time": true
  }
}

Response: 1

{
  "service": "auth",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a"
}

Request: Task

{
  "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
  "query": "你好。",
  "ssml": false
}

Response: 2 audio

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "335d508e-688f-4fc1-b057-4a4aa78b9ee7",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 1,
    "type": "audio",
    "audio_data": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAA...//wkA/P8CAP//AAD4/wMA0v8oAL3/...AAAAAAAAAAAAA=="
  }
}

Response: 3 Phoneme

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 2,
    "type": "phone",
    "phone_data": "aiBqIGluIGluIGluIGluIGluIG...ZCBlbmQgZW5k"
  }
}

Response: 4 Timestamp

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "de9a1066-d968-475a-ac38-b2da017b2a27", 
    "index": 3, 
    "type": "timestamp", 
    "sentence_time": {
      "begin_ms": 500, 
      "end_ms": 1010, 
      "text": "你好。"
    }, 
    "word_times": [
      {"begin_ms": 500, "end_ms": 590, "text": "你"}, 
      {"begin_ms": 590, "end_ms": 1010, "text": "好"}
    ]
  }
}

Response: 5 Polyphone

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "08a5785a-a6a2-4140-b587-a6cead592531",
  "tts": {
    "id": "a5e8b592-f0b1-46ad-bc97-836cbb010310",
    "index": 4,
    "type": "polyphone",
    "polyphones": [
        {
            "word": "好",
            "phones": ["hao3", "hao4"]
        }
    ]
}
}

Response: 6 subtitle

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 5,
    "type": "subtitle",
    "subtitle_data": "MQowMDowMDowMCwwMDAgLS0+IDAwOjAwOjAwLDUyOArkvaDlpb3jgIIKCg=="
  }
}

Response: 7 EOF

{
  "service": "tts",
  "status": "ok",
  "session": "5ef8b534-3b54-47e2-94d9-ff165864ad4a",
  "trace": "d923181d-9d9b-4be1-9370-40f456be3771",
  "tts": {
    "id": "bf3qmpuuk18ktv7cv4b6kzhs9",
    "index": 9,
    "type": "eof"
  }
}

← Speech Processing Platform Capabilities →