diff --git a/docs/implement.md b/docs/implement.md new file mode 100644 index 000000000..17b8d7b04 --- /dev/null +++ b/docs/implement.md @@ -0,0 +1,216 @@ +# Guide: Implementing Models in Ollama's Go Inference Engine + +> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve. + +This guide outlines the process of implementing a new model in Ollama's Go GGML inference engine. It covers everything from initial setup to deploying your model to ollama.com. + +## Architecture Overview + +Below is a diagram showing Ollama's inference engine architecture layers and how they interact: + +```mermaid +graph TB + subgraph Models["Model Layer: LLM Implementations"] + direction TB + llama["model/models/llama/model.go"] + mllama["model/models/mllama/model.go"] + qwen["model/models/qwen2/model.go"] + qwen_vl["model/models/qwen2vl/model.go"] + + note1["Each model implements a specific architecture + - Defines model parameters + - Implements forward pass"] + end + + subgraph ML_Ops["Neural Network Operations"] + direction TB + nn_ops["nn/ + linear.go - Matrix operations + embedding.go - Token embeddings + normalization.go - Layer normalization + convolution.go - Conv operations"] + + backend["ml/backend.go + Hardware Abstraction Layer + - Defines tensor operations + - Manages computation graphs + - Handles memory allocation"] + + note2["Common neural net operations + used across different models + - Abstracts hardware details + - Provides unified API + - Manages computation flow"] + end + + subgraph GGML["Hardware Execution Layer"] + direction TB + ggml["ggml.go + CGO Interface + - Bridges Go and C++ + - Handles type conversion + - Manages memory between languages"] + + subgraph Hardware_Specific["Hardware-Specific Implementations"] + direction LR + cpu["ggml-cpu.h + CPU optimized ops"] + cuda["ggml-cuda.h + NVIDIA GPU ops"] + metal["ggml-metal.h + Apple GPU ops"] + vulkan["ggml-vulkan.h + Cross-platform GPU"] + opencl["ggml-opencl.h + OpenCL acceleration"] + end + + note3["GGML provides optimized + implementations for each hardware: + - Automatic dispatch + - Hardware-specific optimizations + - Memory management + - Parallel execution"] + end + + %% Connections with explanations + Models --> |"Makes high-level calls + (e.g., self-attention)"| ML_Ops + ML_Ops --> |"Translates to tensor operations + (e.g., matmul, softmax)"| GGML + GGML --> |"Executes optimized code + on target hardware"| Hardware_Specific + + %% Styling + classDef model fill:#fff,stroke:#01579b,stroke-width:2px + classDef ml fill:#fff,stroke:#e65100,stroke-width:2px + classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px + classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5 + + class llama,mllama,qwen,qwen_vl,pixtral model + class nn_ops,backend ml + class ggml,cpu,cuda,metal,vulkan,opencl hw + class note1,note2,note3 note + + %% Style subgraphs + style Models fill:#fff,stroke:#01579b,stroke-width:2px + style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px + style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px + style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px +``` + +When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer. + +## Implementation Steps + +### 1. Environment Setup + +First, clone the Ollama repository and get it running locally. Follow the development setup guide at: +https://github.com/ollama/ollama/blob/main/docs/development.md + +### 2. Research Implementation + +Get the original model implementation running. This typically involves: +- Cloning the research code repository (usually Python-based) +- Setting up the required environment +- Running inference with sample inputs +- Understanding the model architecture and forward pass + +### 3. Project Structure Setup + +Create the necessary file structure by referencing previous model implementations. You'll need: + +``` +model/ +└── your-model/ + ├── model.go # Architecture and forward pass implementation + ├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML) + └── convert_test.go # Conversion logic tests +``` + +Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go): + +``` +package models + +import ( + _ "github.com/ollama/ollama/model/models/llama" + _ "github.com/ollama/ollama/model/models/mllama" + _ "github.com/ollama/ollama/model/models/your-model" // Add your model here +) +``` + +### 4. Development Process + +1. **Open a Draft PR** + - Create a draft pull request in the `ollama/ollama` repository + - Use this as a communication channel with Ollama maintainers + +2. **Implement Weight Conversion** + - Work on `convert.go` + - Reference existing conversion implementations + - Create a basic Modelfile: + ``` + FROM /path/to/model + ``` + - Test conversion: + ```bash + go run . create -f /path/to/Modelfile + ``` + +3. **Implement Model Logic** + - Implement `New()` and `Forward()` functions in `model.go` + - Reference existing model implementations + - Debug forward pass: + ```bash + OLLAMA_DEBUG=1 go run . run + ``` + - Compare output with research implementation + +4. **Tokenizer Implementation** + - Implement a new tokenizer if required + - Ensure compatibility with model architecture + +5. **Text Generation Testing** + - Implement proper prompt formatting + - Test basic generation: + ```bash + go run . run "hello" + ``` + +### 5. Testing + +1. Add comprehensive tests to: + - `model_test.go` + - `convert_test.go` + +2. Ensure tests cover: + - Weight conversion + - Model initialization + - Text generation + +### 6. Model Deployment + +1. **Finalize Pull Request** + - Move PR out of draft state + - Address reviewer feedback + +2. **Deploy to ollama.com** + - Determine model prompt format + - Convert prompt format to Go template + - Create final Modelfile: + ``` + FROM + TEMPLATE + LICENSE + # Add additional parameters if needed + ``` + - Push to ollama.com: + ```bash + ollama create / -f /path/to/Modelfile + ollama push / + ``` + +3. **Integration Testing** + - Run end-to-end tests + - Verify model behavior in production environment