Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 31 additions & 50 deletions docs/configure-rails/yaml-schema/streaming/global-streaming.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,33 @@
---
title: Global Streaming
description: Enable streaming mode for LLM token generation in config.yml.
title: Streaming
description: Using streaming mode for LLM token generation in NeMo Guardrails.
---

# Global Streaming
# Streaming

Enable streaming mode for the main LLM generation at the top level of `config.yml`.
NeMo Guardrails supports streaming LLM responses via the `stream_async()` method. No configuration is required to enable streaming—simply use `stream_async()` instead of `generate_async()`.

## Configuration
## Basic Usage

```yaml
streaming: True
```

## What It Does

When enabled, global streaming:
```python
from nemoguardrails import LLMRails, RailsConfig

- Sets `streaming = True` on the underlying LLM model
- Enables `stream_usage = True` for token usage tracking
- Allows using the `stream_async()` method on `LLMRails`
- Makes the LLM produce tokens incrementally instead of all at once
config = RailsConfig.from_path("./config")
rails = LLMRails(config)

## Default
messages = [{"role": "user", "content": "Hello!"}]

`False`
async for chunk in rails.stream_async(messages=messages):
print(chunk, end="", flush=True)
```

---

## When to Use

### Streaming Without Output Rails

If you do not have output rails configured, only global streaming is needed:

```yaml
streaming: True
```

### Streaming With Output Rails
## Streaming With Output Rails

When using output rails with streaming, you must also configure [output rail streaming](output-rail-streaming.md):
When using output rails with streaming, you must configure [output rail streaming](output-rail-streaming.md):

```yaml
streaming: True

rails:
output:
flows:
Expand All @@ -53,27 +36,15 @@ rails:
enabled: True
```

---
If output rails are configured but `rails.output.streaming.enabled` is not set to `True`, calling `stream_async()` will raise an `StreamingNotSupportedError`.

## Python API Usage
---

### Simple Streaming
## Streaming With Handler (Deprecated)

```python
from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("./config")
rails = LLMRails(config)
> **Warning:** Using `StreamingHandler` directly is deprecated and will be removed in a future release. Use `stream_async()` instead.

messages = [{"role": "user", "content": "Hello!"}]

async for chunk in rails.stream_async(messages=messages):
print(chunk, end="", flush=True)
```

### Streaming With Handler

For more control, use a `StreamingHandler`:
For advanced use cases requiring more control over token processing, you can use a `StreamingHandler` with `generate_async()`:

```python
from nemoguardrails import LLMRails, RailsConfig
Expand Down Expand Up @@ -113,9 +84,19 @@ Enable streaming in the request body by setting `stream` to `true`:

---

## CLI Usage

Use the `--streaming` flag with the chat command:

```bash
nemoguardrails chat path/to/config --streaming
```

---

## Token Usage Tracking

When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` for the underlying LLM model.
When using `stream_async()`, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` on the underlying LLM model.

Access token usage through the `log` generation option:

Expand Down
31 changes: 5 additions & 26 deletions docs/configure-rails/yaml-schema/streaming/index.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,23 @@
---
title: Streaming Configuration
description: Configure streaming for LLM token generation and output rail processing in config.yml.
description: Configure streaming for output rail processing in config.yml.
---

# Streaming Configuration

NeMo Guardrails supports two levels of streaming configuration:
NeMo Guardrails supports streaming out of the box when using the `stream_async()` method. No configuration is required to enable basic streaming.

1. **Global streaming** - Controls LLM token generation
2. **Output rail streaming** - Controls how output rails process streamed tokens

## Configuration Comparison

| Aspect | Global `streaming` | Output Rail `streaming.enabled` |
|--------|-------------------|--------------------------------|
| **Scope** | LLM token generation | Output rail processing |
| **Required for** | Any streaming | Streaming with output rails |
| **Affects** | How LLM produces tokens | How rails process token chunks |
| **Default** | `False` | `False` |
When you have **output rails** configured, you need to explicitly enable streaming for them to process tokens in chunked mode.

## Quick Example

When using streaming with output rails, both configurations are required:
When using streaming with output rails:

```yaml
# Global: Enable LLM streaming
streaming: True

rails:
output:
flows:
- self check output
# Output rail streaming: Enable chunked processing
streaming:
enabled: True
chunk_size: 200
Expand All @@ -40,18 +26,11 @@ rails:

## Streaming Configuration Details

The following guides provide detailed documentation for each streaming configuration area.
The following guides provide detailed documentation for streaming configuration.

::::{grid} 1 1 2 2
:gutter: 3

:::{grid-item-card} Global Streaming
:link: global-streaming
:link-type: doc

Enable streaming mode for LLM token generation in config.yml.
:::

:::{grid-item-card} Output Rail Streaming
:link: output-rail-streaming
:link-type: doc
Expand Down
18 changes: 2 additions & 16 deletions docs/run-rails/streaming.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,12 @@
# Streaming

If the application LLM supports streaming, you can configure NeMo Guardrails to stream tokens as well.
If the application LLM supports streaming, NeMo Guardrails can stream tokens as well. Streaming is automatically enabled when you use the `stream_async()` method - no configuration is required.

For information about configuring streaming with output guardrails, refer to the following:

- For configuration, refer to [streaming output configuration](../user-guides/configuration-guide.md#streaming-output-configuration).
- For sample Python client code, refer to [streaming output](../getting-started/5-output-rails/README.md#streaming-output).

## Configuration

To activate streaming on a guardrails configuration, add the following to your `config.yml`:

```yaml
streaming: True
```

## Usage

### Chat CLI
Expand Down Expand Up @@ -215,13 +207,7 @@ POST /v1/chat/completions
We also support streaming for LLMs deployed using `HuggingFacePipeline`.
One example is provided in the [HF Pipeline Dolly](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/llm/hf_pipeline_dolly/README.md) configuration.

To use streaming for HF Pipeline LLMs, you first need to set the streaming flag in your `config.yml`.

```yaml
streaming: True
```

Then you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object,
To use streaming for HF Pipeline LLMs, you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object,
add it to the `kwargs` of the pipeline and to the `model_kwargs` of the `HuggingFacePipelineCompatible` object.

```python
Expand Down