diff --git a/docs/add-a-model.md b/docs/add-a-model.md
index f3d162fa9..1358c4dae 100644
--- a/docs/add-a-model.md
+++ b/docs/add-a-model.md
@@ -2,7 +2,7 @@
> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
-This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com.
+This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.
## Architecture Overview
@@ -12,96 +12,49 @@ Below is a diagram showing Ollama's inference engine architecture layers and how
graph TB
subgraph Models["Model Layer: LLM Implementations"]
direction TB
- llama["model/models/llama/model.go"]
- mllama["model/models/mllama/model.go"]
- qwen["model/models/qwen2/model.go"]
- qwen_vl["model/models/qwen2vl/model.go"]
+ llama["model/models/llama"]
+ mllama["model/models/mllama"]
+ qwen["model/models/qwen2"]
+ etc["...etc"]
- note1["Each model implements a specific architecture
- - Defines model parameters
- - Implements forward pass"]
+ note1[" Each model implements a
specific architecture:
- Defines model parameters
- Implements forward pass"]
end
subgraph ML_Ops["Neural Network Operations"]
direction TB
- nn_ops["nn/
- linear.go - Matrix operations
- embedding.go - Token embeddings
- normalization.go - Layer normalization
- convolution.go - Conv operations"]
+ nn_ops[" nn/
linear.go: Matrix multiplication
embedding.go: Token embedding lookups
normalization.go: Layer norm operations
convolution.go: Convolutional operations "]
- backend["ml/backend.go
- Hardware Abstraction Layer
- - Defines tensor operations
- - Manages computation graphs
- - Handles memory allocation"]
+ backend[" ml/backend.go
Hardware Abstraction Layer:
- Defines tensor operations
- Manages computation graphs
- Handles memory allocation "]
- note2["Common neural net operations
- used across different models
- - Abstracts hardware details
- - Provides unified API
- - Manages computation flow"]
+ note2[" Common neural net operations:
- Abstracts hardware details
- Provides unified API
- Manages computation flow "]
end
- subgraph GGML["Hardware Execution Layer"]
+ subgraph Hardware["Backend Execution Layer"]
direction TB
- ggml["ggml.go
- CGO Interface
- - Bridges Go and C++
- - Handles type conversion
- - Manages memory between languages"]
-
- subgraph Hardware_Specific["Hardware-Specific Implementations"]
- direction LR
- cpu["ggml-cpu.h
- CPU optimized ops"]
- cuda["ggml-cuda.h
- NVIDIA GPU ops"]
- metal["ggml-metal.h
- Apple GPU ops"]
- vulkan["ggml-vulkan.h
- Cross-platform GPU"]
- opencl["ggml-opencl.h
- OpenCL acceleration"]
- end
-
- note3["GGML provides optimized
- implementations for each hardware:
- - Automatic dispatch
- - Hardware-specific optimizations
- - Memory management
- - Parallel execution"]
+ backend_impl[" The backend package provides:
- Unified computation interface
- Automatic hardware selection
- Optimized kernels
- Efficient memory management "]
end
- %% Connections with explanations
- Models --> |"Makes high-level calls
- (e.g., self-attention)"| ML_Ops
- ML_Ops --> |"Translates to tensor operations
- (e.g., matmul, softmax)"| GGML
- GGML --> |"Executes optimized code
- on target hardware"| Hardware_Specific
-
- %% Styling
- classDef model fill:#fff,stroke:#01579b,stroke-width:2px
- classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
- classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
- classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
-
- class llama,mllama,qwen,qwen_vl,pixtral model
- class nn_ops,backend ml
- class ggml,cpu,cuda,metal,vulkan,opencl hw
- class note1,note2,note3 note
-
- %% Style subgraphs
- style Models fill:#fff,stroke:#01579b,stroke-width:2px
- style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
- style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
- style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
+ Models --> |" Makes high-level calls
(e.g., self-attention) "| ML_Ops
+ ML_Ops --> |" Translates to tensor operations
(e.g., matmul, softmax) "| Hardware
```
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
-## Implementation Steps
+## Implementation Process Overview
+
+Here's the high-level process for implementing a new model in Ollama:
+
+1. **Environment Setup**: Clone the repository and set up your development environment
+2. **Research Implementation**: Understand the original model architecture
+3. **Project Structure Setup**: Set up the necessary file structure
+4. **Create Basic Modelfile**: Create a simple Modelfile for testing
+5. **Implement Weight Conversion**: Map from original format to GGUF
+6. **Open a Draft PR**: Create a draft pull request to establish communication with maintainers
+7. **Implement Model Logic**: Create the model architecture and forward pass
+8. **Quality Check and Final Steps**: Create a Modelfile, add tests and ensure functionality
+10. **Finalize PR and Publish**: Complete the PR and publish to ollama.com
+
+## Implementation Steps in Detail
### 1. Environment Setup
@@ -121,11 +74,11 @@ Get the original model implementation running. This typically involves:
Create the necessary file structure by referencing previous model implementations. You'll need:
```
+convert/
+└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
model/
└── your-model/
- ├── model.go # Architecture and forward pass implementation
- ├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
- └── convert_test.go # Conversion logic tests
+ └── model.go # Architecture and forward pass implementation
```
Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
@@ -140,45 +93,194 @@ import (
)
```
-### 4. Development Process
+### 4. Create a Basic Modelfile
-1. **Open a Draft PR**
- - Create a draft pull request in the `ollama/ollama` repository
- - Use this as a communication channel with Ollama maintainers
+Create a simple Modelfile early in the process to facilitate testing:
-2. **Implement Weight Conversion**
- - Work on `convert.go`
- - Reference existing conversion implementations
- - Create a basic Modelfile:
- ```
- FROM /path/to/model
- ```
- - Test conversion:
- ```bash
- go run . create -f /path/to/Modelfile
+```
+FROM /path/to/model
+TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing
+```
+
+This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.
+
+### 5. Implement Weight Conversion
+
+- Work on `convert/convert_your-model.go`
+- Reference existing conversion implementations
+- Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit
+- Understand typical GGUF layout and structure:
+
+ **Typical GGUF Layout:**
+ ```
+ GGUF
+ ├── Metadata Section
+ │ ├── Model Parameters
+ │ │ ├── General architecture parameters
+ │ │ │ ├── "{arch}.vocab_size" (e.g., "llama.vocab_size")
+ │ │ │ ├── "{arch}.context_length" (e.g., "llama.context_length")
+ │ │ │ ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
+ │ │ │ └── "{arch}.block_count" (e.g., "llama.block_count")
+ │ │ │
+ │ │ └── Architecture-specific parameters
+ │ │ ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
+ │ │ ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
+ │ │ ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
+ │ │ └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
+ │ │
+ │ ├── Tokenizer parameters
+ │ │ ├── "tokenizer.ggml.model" (e.g., "llama")
+ │ │ ├── "tokenizer.ggml.tokens" (vocabulary tokens)
+ │ │ ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
+ │ │ └── "tokenizer.ggml.eos_id" (end of sequence token ID)
+ │ │
+ │ └── General metadata
+ │ └── "general.architecture" (e.g., "llama", "qwen2", "phi")
+ │
+ └── Tensor Data Section
+ ├── Common tensors:
+ │ ├── "token_embd.weight" (token embedding matrix)
+ │ ├── "rope_freqs.weight" (RoPE frequency weights)
+ │ ├── "output_norm.weight" (final layer normalization)
+ │ └── "output.weight" (output projection)
+ │
+ └── Layer-specific tensors:
+ ├── "blk.{i}.attn_q.weight" (query projection)
+ ├── "blk.{i}.attn_k.weight" (key projection)
+ ├── "blk.{i}.attn_v.weight" (value projection)
+ ├── "blk.{i}.attn_output.weight" (attention output)
+ ├── "blk.{i}.attn_norm.weight" (attention normalization)
+ ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
+ ├── "blk.{i}.ffn_up.weight" (FFN up projection)
+ ├── "blk.{i}.ffn_down.weight" (FFN down projection)
+ └── "blk.{i}.ffn_gate.weight" (FFN gate projection)
+ ```
+
+ - Key conversion details include:
+ - Linear weight matrices (sometimes need transposition)
+ - Layer normalization weights (might need reshaping)
+ - **Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture**
+
+- Test conversion:
+ ```bash
+ go run . create -f /path/to/Modelfile
+ ```
+
+### 6. Open a Draft PR
+
+After implementing the initial weight conversion, creating a draft pull request is recommended as it:
+- Establishes a communication channel with Ollama maintainers
+- Allows for early feedback on your approach
+- Makes it easier to track progress and changes
+
+To open a draft PR:
+1. Fork the repository
+2. Create a new branch for your model implementation
+3. Make initial commits with your weight conversion implementation
+4. Open a PR in the `ollama/ollama` repository and mark it as draft
+5. Include a clear description of the model you're implementing
+
+### 7. Implement Model Logic
+
+- Reference existing model implementations
+- Implement `New()` and `Forward()` functions in `model.go`:
+
+ **The `New()` function:**
+ - Creates and initializes your model structure
+ - Loads configuration parameters (embedding size, attention heads, etc.)
+ - Sets up the tokenizer with vocabulary and special tokens
+ - Initializes all model layers and weights
+ - **Important**: Sets up the KV cache for efficient inference
+ - Example:
+ ```go
+ func New(c ml.Config) (model.Model, error) {
+ m := &Model{
+ // Initialize tokenizer
+ BytePairEncoding: model.NewBytePairEncoding(...),
+ // Create layer arrays
+ Layers: make([]Layer, c.Uint("block_count")),
+ // Set model parameters
+ Options: &Options{...},
+ }
+ // Initialize KV cache for efficient inference
+ m.Cache = kvcache.NewCausalCache(m.Shift)
+ return m, nil
+ }
+ ```
+
+ **The `Forward()` function:**
+ - **What it does**: Defines the computational graph of your model
+ - **Important**: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
+ - Takes input tokens and converts them to embeddings
+ - Processes inputs through transformer layers (attention and feed-forward networks)
+ - Creates the path for data flow through your model's components
+ - Example:
+ ```go
+ func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
+ // Convert inputs to tensors
+ inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
+ positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
+
+ // Initial token embedding
+ hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
+
+ // Process through transformer layers
+ for i, layer := range m.Layers {
+ m.Cache.SetLayer(i)
+ hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
+ }
+
+ // Final processing and output
+ normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
+ logits := m.Output.Forward(ctx, normalizedOutput)
+
+ // Return logits for requested positions
+ outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
+ return logits.Rows(ctx, outputsTensor), nil
+ }
+ ```
+
+ **Key Components to Implement:**
+
+ 1. **KV Cache**:
+ - Improves inference performance for text generation
+ - How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
+ - Implementation: Use the `kvcache.NewCausalCache()` for autoregressive models
+ - Important: Must implement the `Shift()` function to handle rotary position embeddings with the cache
+
+ 2. **Self-Attention**:
+ - Core component that learns contextual relationships between tokens
+ - Implements query, key, value projections and their interactions
+ - Must handle positional encoding (usually Rotary Position Embeddings)
+ - Uses the KV cache to make generation efficient
+
+ 3. **Normalization Layers**:
+ - Purpose: Stabilizes training and maintains consistent activation distributions
+ - Types: RMSNorm, LayerNorm, etc. depending on model architecture
+ - Implementation: Apply before attention and feed-forward networks
+ - Example: `normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)`
+
+ 4. **Activation Functions**:
+ - Purpose: Introduces non-linearity into the model
+ - Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
+ - Found in feed-forward/MLP blocks
+ - Example:
+ ```go
+ // SwiGLU activation in MLP
+ gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
+ upProjection := mlp.Up.Forward(ctx, hiddenState)
+ intermediateStates := gateActivation.Mul(ctx, upProjection)
```
+- Run your forward pass:
+ ```bash
+ # in the root of the ollama directory
+ go build .
+ OLLAMA_DEBUG=1 ./ollama serve
+ OLLAMA_DEBUG=1 ./ollama run
+ ```
+- Compare output with research implementation
-3. **Implement Model Logic**
- - Implement `New()` and `Forward()` functions in `model.go`
- - Reference existing model implementations
- - Debug forward pass:
- ```bash
- OLLAMA_DEBUG=1 go run . run
- ```
- - Compare output with research implementation
-
-4. **Tokenizer Implementation**
- - Implement a new tokenizer if required
- - Ensure compatibility with model architecture
-
-5. **Text Generation Testing**
- - Implement proper prompt formatting
- - Test basic generation:
- ```bash
- go run . run "hello"
- ```
-
-### 5. Testing
+### 8. Quality Check and Final Steps
1. Add comprehensive tests to:
- `model_test.go`
@@ -189,28 +291,36 @@ import (
- Model initialization
- Text generation
-### 6. Model Deployment
+3. **Create Final Modelfile**
+ - Replace the static prompt with the proper Go template for your model:
+ ```
+ FROM
+ TEMPLATE # Add the proper Go template for your model, including tools if needed
+ LICENSE # Add appropriate license information
+ # Add additional parameters if needed
+ ```
+
+4. **End-to-end Testing**
+ - Run your model with your local Ollama build to ensure that it functions as expected
+
+5. Benchmark
+ - Run performance benchmarks on your model implementation
+ ```go
+ # from the root of the Ollama directory, while a server is running locally
+ go build .
+ OLLAMA_DEBUG=1 ./ollama serve
+ go test -bench=. -m ./...
+ ```
+
+### 9. Finalize PR and Publish to ollama.com
1. **Finalize Pull Request**
- Move PR out of draft state
- Address reviewer feedback
-2. **Deploy to ollama.com**
- - Determine model prompt format
- - Convert prompt format to Go template
- - Create final Modelfile:
- ```
- FROM
- TEMPLATE
- LICENSE
- # Add additional parameters if needed
- ```
+2. **Publish to ollama.com**
- Push to ollama.com:
```bash
ollama create / -f /path/to/Modelfile
ollama push /
- ```
-
-3. **Integration Testing**
- - Run end-to-end tests
- - Verify model behavior in production environment
+ ```
\ No newline at end of file