Add wake word to Assist pipeline docs (#1855)

2025-07-16 13:56:29 +00:00 · 2023-10-07 07:29:16 -05:00 · 2023-10-07 07:29:16 -05:00 · 49f4c2164a
commit 49f4c2164a
parent 12f987b28a
1 changed files with 72 additions and 22 deletions
--- a/docs/voice/pipelines/index.md
+++ b/docs/voice/pipelines/index.md
@ -4,9 +4,10 @@ title: "Assist Pipelines"

 The [Assist pipeline](https://www.home-assistant.io/integrations/assist_pipeline) integration runs the common steps of a voice assistant:

-1. Speech to text
-2. Intent recognition
-3. Text to speech
+1. Wake word detection
+2. Speech to text
+3. Intent recognition
+4. Text to speech

 Pipelines are run via a WebSocket API:

@ -23,31 +24,53 @@ Pipelines are run via a WebSocket API:

 The following input fields are available:

-| Name              | Type   | Description                                                                                 |
-|-------------------|--------|---------------------------------------------------------------------------------------------|
-| `start_stage` | enum   | Required. The first stage to run. One of `stt`, `intent`, `tts`. |
-| `end_stage`   | enum   | Required. The last stage to run. One of `stt`, `intent`, `tts`. |
-| `input`       | dict   | Depends on `start_stage`. For STT, the dictionary should contain a key `sample_rate` with an integer value. For intent and TTS, the key `text` should contain the input text.
-| `pipeline` | string | Optional. ID of the pipeline (use `assist_pipeline/pipeline/list` to get names). |
-| `conversation_id` | string | Optional. [Unique id for conversation](/docs/intent_conversation_api#conversation-id).      |
-| `timeout`         | number | Optional. Number of seconds before pipeline times out (default: 30).                        |
+| Name              | Type   | Description                                                                                                                                                                                                                                                                                    |
+|-------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `start_stage`     | enum   | Required. The first stage to run. One of `wake` `stt`, `intent`, `tts`.                                                                                                                                                                                                                        |
+| `end_stage`       | enum   | Required. The last stage to run. One of `stt`, `intent`, `tts`.                                                                                                                                                                                                                                |
+| `input`           | dict   | Depends on `start_stage`: <ul><li>`wake` only:<ul><li>`timeout` - seconds before wake word detection times out (int, default: 3)</li><li>`noise_suppression_level` - amount of noise suppression (int, 0 = disabled, 4 = max)</li><li>`auto_gain_dbfs` - automatic gain (int, 0 = disabled, 31 = max)</li><li>`volume_multiplier` - fixed volume amplification (float, 1.0 = no change, 2.0 = twice as loud)</li></ul></li><li>`wake` and `stt`:<ul><li>`sample_rate` - sample rate of incoming audio (int, hertz)</li></ul></li><li>`intent` and `tts`:<ul><li>`text` - input text (string)</li></ul></li></ul> |
+| `pipeline`        | string | Optional. ID of the pipeline (use `assist_pipeline/pipeline/list` to get names).                                                                                                                                                                                                               |
+| `conversation_id` | string | Optional. [Unique id for conversation](/docs/intent_conversation_api#conversation-id).                                                                                                                                                                                                         |
+| `timeout`         | number | Optional. Number of seconds before pipeline times out (default: 300).                                                                                                                                                                                                                           |

 ## Events

 As the pipeline runs, it emits events back over the WebSocket connection.
 The following events can be emitted:

-| Name            | Description                 | Emitted    | Attributes                                                                                              |
-|-----------------|-----------------------------|------------|---------------------------------------------------------------------------------------------------------|
-| `run-start`     | Start of pipeline run       | always     | `pipeline` - ID of the pipeline<br />`language` - Language used for pipeline<br />`runner_data` - Extra WebSocket data: <ul><li>`stt_binary_handler_id` is the prefix to send speech data over.</li><li>`timeout` is the max run time for the whole pipeline.</li></ul>                         |
-| `run-end`       | End of pipeline run         | always     |                                                                                                         |
-| `stt-start`     | Start of speech to text     | audio only | `engine`: STT engine used<br />`metadata`: incoming audio metadata
-| `stt-end`       | End of speech to text       | audio only | `stt_output` - Object with `text`, the detected text.
-| `intent-start`  | Start of intent recognition | always     | `engine` - [Agent](/docs/intent_conversation_api) engine used<br />`language`: Processing language. <br /> `intent_input` - Input text to agent |
-| `intent-end`    | End of intent recognition   | always     | `intent_output` - [conversation response](/docs/intent_conversation_api#conversation-response)          |
-| `tts-start`     | Start of text to speech     | audio only | `engine` - TTS engine used<br />`language`: Output language.<br />`voice`: Output voice. <br />`tts_input`: Text to speak. |
-| `tts-end`       | End of text to speech       | audio only | `media_id` - Media Source ID of the generated audio<br />`url` - URL to the generated audio<br />`mime_type` - MIME type of the generated audio<br /> |
-| `error`         | Error in pipeline           | On error     | `code` - Error code<br />`message` - Error message |
+| Name           | Description                  | Emitted    | Attributes                                                                                                                                                                                                                                                              |
+|----------------|------------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `run-start`    | Start of pipeline run        | always     | `pipeline` - ID of the pipeline<br />`language` - Language used for pipeline<br />`runner_data` - Extra WebSocket data: <ul><li>`stt_binary_handler_id` is the prefix to send speech data over.</li><li>`timeout` is the max run time for the whole pipeline.</li></ul> |
+| `run-end`      | End of pipeline run          | always     |                                                                                                                                                                                                                                                                         |
+| `wake_word-start`   | Start of wake word detection | audio only | `engine`: wake engine used<br />`metadata`: incoming audio<br />`timeout`: seconds before wake word timeout metadata                                                                                                                                                                                                     |
+| `wake_word-end`     | End of wake word detection   | audio only | `wake_word_output` - Detection result data: <ul><li>`wake_word_id` is the id of detected wake word</li><li>`timestamp` is the detection time relative to start of audio stream (milliseconds, optional)</li></ul>                                                                             |
+| `stt-start`    | Start of speech to text      | audio only | `engine`: STT engine used<br />`metadata`: incoming audio metadata                                                                                                                                                                                                      |
+| `stt-vad-start`    | Start of voice command      | audio only | `timestamp`: time relative to start of audio stream (milliseconds)
+| `stt-vad-end`    | End of voice command      | audio only | `timestamp`: time relative to start of audio stream (milliseconds)
+| `stt-end`      | End of speech to text        | audio only | `stt_output` - Object with `text`, the detected text.                                                                                                                                                                                                                   |
+| `intent-start` | Start of intent recognition  | always     | `engine` - [Agent](/docs/intent_conversation_api) engine used<br />`language`: Processing language. <br /> `intent_input` - Input text to agent                                                                                                                         |
+| `intent-end`   | End of intent recognition    | always     | `intent_output` - [conversation response](/docs/intent_conversation_api#conversation-response)                                                                                                                                                                          |
+| `tts-start`    | Start of text to speech      | audio only | `engine` - TTS engine used<br />`language`: Output language.<br />`voice`: Output voice. <br />`tts_input`: Text to speak.                                                                                                                                              |
+| `tts-end`      | End of text to speech        | audio only | `media_id` - Media Source ID of the generated audio<br />`url` - URL to the generated audio<br />`mime_type` - MIME type of the generated audio<br />                                                                                                                   |
+| `error`        | Error in pipeline            | on error   | `code` - Error code ([see below](#error-codes))<br />`message` - Error message                                                                                                                                                                                                                      |
+
+## Error codes
+
+The following codes are returned from the pipeline `error` event:
+
+* `wake-engine-missing` - No wake word engine is installed
+* `wake-provider-missing` - Configured wake word provider is not available
+* `wake-stream-failed` - Unexpected error during wake word detection
+* `wake-word-timeout` - Wake word was not detected within timeout
+* `stt-provider-missing` - Configured speech-to-text provider is not available
+* `stt-provider-unsupported-metadata` - Speech-to-text provider does not support audio format (sample rate, etc.)
+* `stt-stream-failed` - Unexpected error during speech-to-text
+* `stt-no-text-recognized` - Speech-to-text did not return a transcript
+* `intent-not-supported` - Configured conversation agent is not available
+* `intent-failed` - Unexpected error during intent recognition
+* `tts-not-supported` - Configured text-to-speech provider is not available or options are not supported
+* `tts-failed` - Unexpected error during text-to-speech
+

 ## Sending speech data

@ -56,7 +79,34 @@ After starting a pipeline with `stt` as the first stage of the run and receiving
 For example, if `stt_binary_handler_id` is `1` and the audio chunk is `a1b2c3`, the message would be (in hex):

 ```
+stt_binary_handler_id
+||
 01a1b2c3
+  ||||||
+   audio
 ```

 To indicate the end of sending speech data, send a binary message containing a single byte with the `stt_binary_handler_id`.
+
+## Wake word detection
+
+When `start_stage` is set to `wake`, the pipeline will not run until a wake word has been detected. Clients should avoid unnecessary audio streaming by using a local voice activity detector (VAD) to only start streaming when human speech is detected.
+
+For `wake`, the `input` object should contain a `timeout` float value. This is the number of seconds of silence before the pipeline will time out during wake word detection (error code `wake-word-timeout`).
+If enough speech is detected by Home Assistant's internal VAD, the timeout will be continually reset.
+
+### Audio Enhancements
+
+The following settings are available as part of the `input` object when `start_stage` is set to `wake`:
+
+* `noise_suppression_level` - level of noise suppression (0 = disabled, 4 = max)
+* `auto_gain_dbfs` - automatic gain control (0 = disabled, 31 = max)
+* `volume_multiplier` - audio samples multiplied by constant (1.0 = no change, 2.0 = twice as loud)
+
+If your device's microphone is fairly quiet, the recommended settings are:
+
+* `noise_suppression_level` - 2
+* `auto_gain_dbfs` - 31
+* `volume_multiplier` - 2.0
+
+Increasing `noise_suppression_level` or `volume_multiplier` may cause audio distortion.