diff --git a/docs/voice/pipelines/index.md b/docs/voice/pipelines/index.md index cdd5fcb8..ef7d90f6 100644 --- a/docs/voice/pipelines/index.md +++ b/docs/voice/pipelines/index.md @@ -4,9 +4,10 @@ title: "Assist Pipelines" The [Assist pipeline](https://www.home-assistant.io/integrations/assist_pipeline) integration runs the common steps of a voice assistant: -1. Speech to text -2. Intent recognition -3. Text to speech +1. Wake word detection +2. Speech to text +3. Intent recognition +4. Text to speech Pipelines are run via a WebSocket API: @@ -23,31 +24,53 @@ Pipelines are run via a WebSocket API: The following input fields are available: -| Name | Type | Description | -|-------------------|--------|---------------------------------------------------------------------------------------------| -| `start_stage` | enum | Required. The first stage to run. One of `stt`, `intent`, `tts`. | -| `end_stage` | enum | Required. The last stage to run. One of `stt`, `intent`, `tts`. | -| `input` | dict | Depends on `start_stage`. For STT, the dictionary should contain a key `sample_rate` with an integer value. For intent and TTS, the key `text` should contain the input text. -| `pipeline` | string | Optional. ID of the pipeline (use `assist_pipeline/pipeline/list` to get names). | -| `conversation_id` | string | Optional. [Unique id for conversation](/docs/intent_conversation_api#conversation-id). | -| `timeout` | number | Optional. Number of seconds before pipeline times out (default: 30). | +| Name | Type | Description | +|-------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `start_stage` | enum | Required. The first stage to run. One of `wake` `stt`, `intent`, `tts`. | +| `end_stage` | enum | Required. The last stage to run. One of `stt`, `intent`, `tts`. | +| `input` | dict | Depends on `start_stage`: | +| `pipeline` | string | Optional. ID of the pipeline (use `assist_pipeline/pipeline/list` to get names). | +| `conversation_id` | string | Optional. [Unique id for conversation](/docs/intent_conversation_api#conversation-id). | +| `timeout` | number | Optional. Number of seconds before pipeline times out (default: 300). | ## Events As the pipeline runs, it emits events back over the WebSocket connection. The following events can be emitted: -| Name | Description | Emitted | Attributes | -|-----------------|-----------------------------|------------|---------------------------------------------------------------------------------------------------------| -| `run-start` | Start of pipeline run | always | `pipeline` - ID of the pipeline
`language` - Language used for pipeline
`runner_data` - Extra WebSocket data: | -| `run-end` | End of pipeline run | always | | -| `stt-start` | Start of speech to text | audio only | `engine`: STT engine used
`metadata`: incoming audio metadata -| `stt-end` | End of speech to text | audio only | `stt_output` - Object with `text`, the detected text. -| `intent-start` | Start of intent recognition | always | `engine` - [Agent](/docs/intent_conversation_api) engine used
`language`: Processing language.
`intent_input` - Input text to agent | -| `intent-end` | End of intent recognition | always | `intent_output` - [conversation response](/docs/intent_conversation_api#conversation-response) | -| `tts-start` | Start of text to speech | audio only | `engine` - TTS engine used
`language`: Output language.
`voice`: Output voice.
`tts_input`: Text to speak. | -| `tts-end` | End of text to speech | audio only | `media_id` - Media Source ID of the generated audio
`url` - URL to the generated audio
`mime_type` - MIME type of the generated audio
| -| `error` | Error in pipeline | On error | `code` - Error code
`message` - Error message | +| Name | Description | Emitted | Attributes | +|----------------|------------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `run-start` | Start of pipeline run | always | `pipeline` - ID of the pipeline
`language` - Language used for pipeline
`runner_data` - Extra WebSocket data: | +| `run-end` | End of pipeline run | always | | +| `wake_word-start` | Start of wake word detection | audio only | `engine`: wake engine used
`metadata`: incoming audio
`timeout`: seconds before wake word timeout metadata | +| `wake_word-end` | End of wake word detection | audio only | `wake_word_output` - Detection result data: | +| `stt-start` | Start of speech to text | audio only | `engine`: STT engine used
`metadata`: incoming audio metadata | +| `stt-vad-start` | Start of voice command | audio only | `timestamp`: time relative to start of audio stream (milliseconds) +| `stt-vad-end` | End of voice command | audio only | `timestamp`: time relative to start of audio stream (milliseconds) +| `stt-end` | End of speech to text | audio only | `stt_output` - Object with `text`, the detected text. | +| `intent-start` | Start of intent recognition | always | `engine` - [Agent](/docs/intent_conversation_api) engine used
`language`: Processing language.
`intent_input` - Input text to agent | +| `intent-end` | End of intent recognition | always | `intent_output` - [conversation response](/docs/intent_conversation_api#conversation-response) | +| `tts-start` | Start of text to speech | audio only | `engine` - TTS engine used
`language`: Output language.
`voice`: Output voice.
`tts_input`: Text to speak. | +| `tts-end` | End of text to speech | audio only | `media_id` - Media Source ID of the generated audio
`url` - URL to the generated audio
`mime_type` - MIME type of the generated audio
| +| `error` | Error in pipeline | on error | `code` - Error code ([see below](#error-codes))
`message` - Error message | + +## Error codes + +The following codes are returned from the pipeline `error` event: + +* `wake-engine-missing` - No wake word engine is installed +* `wake-provider-missing` - Configured wake word provider is not available +* `wake-stream-failed` - Unexpected error during wake word detection +* `wake-word-timeout` - Wake word was not detected within timeout +* `stt-provider-missing` - Configured speech-to-text provider is not available +* `stt-provider-unsupported-metadata` - Speech-to-text provider does not support audio format (sample rate, etc.) +* `stt-stream-failed` - Unexpected error during speech-to-text +* `stt-no-text-recognized` - Speech-to-text did not return a transcript +* `intent-not-supported` - Configured conversation agent is not available +* `intent-failed` - Unexpected error during intent recognition +* `tts-not-supported` - Configured text-to-speech provider is not available or options are not supported +* `tts-failed` - Unexpected error during text-to-speech + ## Sending speech data @@ -56,7 +79,34 @@ After starting a pipeline with `stt` as the first stage of the run and receiving For example, if `stt_binary_handler_id` is `1` and the audio chunk is `a1b2c3`, the message would be (in hex): ``` +stt_binary_handler_id +|| 01a1b2c3 + |||||| + audio ``` To indicate the end of sending speech data, send a binary message containing a single byte with the `stt_binary_handler_id`. + +## Wake word detection + +When `start_stage` is set to `wake`, the pipeline will not run until a wake word has been detected. Clients should avoid unnecessary audio streaming by using a local voice activity detector (VAD) to only start streaming when human speech is detected. + +For `wake`, the `input` object should contain a `timeout` float value. This is the number of seconds of silence before the pipeline will time out during wake word detection (error code `wake-word-timeout`). +If enough speech is detected by Home Assistant's internal VAD, the timeout will be continually reset. + +### Audio Enhancements + +The following settings are available as part of the `input` object when `start_stage` is set to `wake`: + +* `noise_suppression_level` - level of noise suppression (0 = disabled, 4 = max) +* `auto_gain_dbfs` - automatic gain control (0 = disabled, 31 = max) +* `volume_multiplier` - audio samples multiplied by constant (1.0 = no change, 2.0 = twice as loud) + +If your device's microphone is fairly quiet, the recommended settings are: + +* `noise_suppression_level` - 2 +* `auto_gain_dbfs` - 31 +* `volume_multiplier` - 2.0 + +Increasing `noise_suppression_level` or `volume_multiplier` may cause audio distortion.