wip

2023-10-22 09:54:59 -04:00
129 changed files with 59496 additions and 7418 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -6,4 +6,3 @@ scripts
 llm/llama.cpp/ggml
 llm/llama.cpp/gguf
 .env
-.cache
--- a/.gitignore
+++ b/.gitignore
@@ -6,6 +6,3 @@
 dist
 ollama
 ggml-metal.metal
-.cache
-*.exe
-.idea
--- a/6
+++ b/6
@@ -19,11 +19,5 @@ RUN apt-get update && apt-get install -y ca-certificates
 COPY --from=0 /go/src/github.com/jmorganca/ollama/ollama /bin/ollama
 EXPOSE 11434
 ENV OLLAMA_HOST 0.0.0.0
-
-# set some environment variable for better NVIDIA compatibility
-ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
-ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
-ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
-
 ENTRYPOINT ["/bin/ollama"]
 CMD ["serve"]
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ curl https://ollama.ai/install.sh | sh

 ### Docker

-The official [Ollama Docker image](https://hub.docker.com/r/ollama/ollama) `ollama/ollama` is available on Docker Hub.
+See the official [Docker image](https://hub.docker.com/r/ollama/ollama).

 ## Quickstart

@@ -47,8 +47,6 @@ Here are some example open-source models that can be downloaded:

 | Model              | Parameters | Size  | Download                       |
 | ------------------ | ---------- | ----- | ------------------------------ |
-| Neural Chat        | 7B         | 4.1GB | `ollama run neural-chat`       |
-| Starling           | 7B         | 4.1GB | `ollama run starling-lm`       |
 | Mistral            | 7B         | 4.1GB | `ollama run mistral`           |
 | Llama 2            | 7B         | 3.8GB | `ollama run llama2`            |
 | Code Llama         | 7B         | 3.8GB | `ollama run codellama`         |
@@ -57,7 +55,6 @@ Here are some example open-source models that can be downloaded:
 | Llama 2 70B        | 70B        | 39GB  | `ollama run llama2:70b`        |
 | Orca Mini          | 3B         | 1.9GB | `ollama run orca-mini`         |
 | Vicuna             | 7B         | 3.8GB | `ollama run vicuna`            |
-| LLaVA              | 7B         | 4.5GB | `ollama run llava`             |

 > Note: You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.

@@ -91,7 +88,7 @@ See the [guide](docs/import.md) on importing models for more information.

 ### Customize a prompt

-Models from the Ollama library can be customized with a prompt. For example, to customize the `llama2` model:
+Models from the Ollama library can be customized with a prompt. The example

 ```
 ollama pull llama2
@@ -105,7 +102,7 @@ FROM llama2
 # set the temperature to 1 [higher is more creative, lower is more coherent]
 PARAMETER temperature 1

-# set the system message
+# set the system prompt
 SYSTEM """
 You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
 """
@@ -159,17 +156,10 @@ For multiline input, you can wrap text with `"""`:
 I'm a basic program that prints the famous "Hello, world!" message to the console.
 ```

-### Multimodal models
-
-```
->>> What's in this image? /Users/jmorgan/Desktop/smile.png
-The image features a yellow smiley face, which is likely the central focus of the picture.
-```
-
 ### Pass in prompt as arguments

 ```
-$ ollama run llama2 "Summarize this file: $(cat README.md)"
+$ ollama run llama2 "summarize this file:" "$(cat README.md)"
 Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
 ```

@@ -188,7 +178,8 @@ ollama list
 Install `cmake` and `go`:

 ```
-brew install cmake go
+brew install cmake
+brew install go
 ```

 Then generate dependencies and build:
@@ -212,94 +203,33 @@ Finally, in a separate shell, run a model:

 ## REST API

-Ollama has a REST API for running and managing models.
+See the [API documentation](docs/api.md) for all endpoints.

-### Generate a response
+Ollama has an API for running and managing models. For example to generate text from a model:

 ```
-curl http://localhost:11434/api/generate -d '{
+curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'
 ```

-### Chat with a model
-
-```
-curl http://localhost:11434/api/chat -d '{
-  "model": "mistral",
-  "messages": [
-    { "role": "user", "content": "why is the sky blue?" }
-  ]
-}'
-```
-
-See the [API documentation](./docs/api.md) for all endpoints.
-
 ## Community Integrations

-### Web & Desktop
- [Bionic GPT](https://github.com/bionic-gpt/bionic-gpt)
- [HTML UI](https://github.com/rtcfirefly/ollama-ui)
- [Chatbot UI](https://github.com/ivanfioravanti/chatbot-ollama)
- [Typescript UI](https://github.com/ollama-interface/Ollama-Gui?tab=readme-ov-file)
- [Minimalistic React UI for Ollama Models](https://github.com/richawo/minimal-llm-ui)
- [Web UI](https://github.com/ollama-webui/ollama-webui)
- [Ollamac](https://github.com/kevinhermawan/Ollamac)
- [big-AGI](https://github.com/enricoros/big-agi/blob/main/docs/config-ollama.md)
- [Cheshire Cat assistant framework](https://github.com/cheshire-cat-ai/core)
- [Amica](https://github.com/semperai/amica)
- [chatd](https://github.com/BruceMacD/chatd)
-
-### Terminal
-
- [oterm](https://github.com/ggozad/oterm)
- [Ellama Emacs client](https://github.com/s-kostyaev/ellama)
- [Emacs client](https://github.com/zweifisch/ollama)
- [gen.nvim](https://github.com/David-Kunz/gen.nvim)
- [ollama.nvim](https://github.com/nomnivore/ollama.nvim)
- [ogpt.nvim](https://github.com/huynle/ogpt.nvim)
- [gptel Emacs client](https://github.com/karthink/gptel)
- [Oatmeal](https://github.com/dustinblackman/oatmeal)
-
-### Database
-
- [MindsDB](https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/ollama_handler/README.md)
-
-### Package managers
-
- [Pacman](https://archlinux.org/packages/extra/x86_64/ollama/)
-
-### Libraries
-
 - [LangChain](https://python.langchain.com/docs/integrations/llms/ollama) and [LangChain.js](https://js.langchain.com/docs/modules/model_io/models/llms/integrations/ollama) with [example](https://js.langchain.com/docs/use_cases/question_answering/local_retrieval_qa)
- [LangChainGo](https://github.com/tmc/langchaingo/) with [example](https://github.com/tmc/langchaingo/tree/main/examples/ollama-completion-example)
 - [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/llm/ollama.html)
- [LiteLLM](https://github.com/BerriAI/litellm)
- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
- [Ollama-rs for Rust](https://github.com/pepperoni21/ollama-rs)
- [Ollama4j for Java](https://github.com/amithkoujalgi/ollama4j)
- [ModelFusion Typescript Library](https://modelfusion.dev/integration/model-provider/ollama)
- [OllamaKit for Swift](https://github.com/kevinhermawan/OllamaKit)
- [Ollama for Dart](https://github.com/breitburg/dart-ollama)
- [Ollama for Laravel](https://github.com/cloudstudio/ollama-laravel)
-
-### Mobile
-
- [Enchanted](https://github.com/AugustDev/enchanted)
- [Maid](https://github.com/danemadsen/Maid)
-
-### Extensions & Plugins
-
 - [Raycast extension](https://github.com/MassimilianoPasquini97/raycast_ollama)
 - [Discollama](https://github.com/mxyng/discollama) (Discord bot inside the Ollama discord channel)
 - [Continue](https://github.com/continuedev/continue)
 - [Obsidian Ollama plugin](https://github.com/hinterdupfinger/obsidian-ollama)
- [Logseq Ollama plugin](https://github.com/omagdy7/ollama-logseq)
 - [Dagger Chatbot](https://github.com/samalba/dagger-chatbot)
+- [LiteLLM](https://github.com/BerriAI/litellm)
 - [Discord AI Bot](https://github.com/mekb-turtle/discord-ai-bot)
- [Ollama Telegram Bot](https://github.com/ruecat/ollama-telegram)
- [Hass Ollama Conversation](https://github.com/ej52/hass-ollama-conversation)
- [Rivet plugin](https://github.com/abrenneke/rivet-plugin-ollama)
- [Llama Coder](https://github.com/ex3ndr/llama-coder) (Copilot alternative using Ollama)
- [Obsidian BMO Chatbot plugin](https://github.com/longy2k/obsidian-bmo-chatbot)
+- [Chatbot UI](https://github.com/ivanfioravanti/chatbot-ollama)
+- [HTML UI](https://github.com/rtcfirefly/ollama-ui)
+- [Typescript UI](https://github.com/ollama-interface/Ollama-Gui?tab=readme-ov-file)
+- [Dumbar](https://github.com/JerrySievert/Dumbar)
+- [Emacs client](https://github.com/zweifisch/ollama)
+- [oterm](https://github.com/ggozad/oterm)
+- [Ellama Emacs client](https://github.com/s-kostyaev/ellama)
+- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
--- a/api/client.go
+++ b/api/client.go
@@ -5,7 +5,6 @@ import (
 	"bytes"
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"io"
 	"net"
@@ -19,6 +18,10 @@ import (
 	"github.com/jmorganca/ollama/version"
 )

+const DefaultHost = "127.0.0.1:11434"
+
+var envHost = os.Getenv("OLLAMA_HOST")
+
 type Client struct {
 	base *url.URL
 	http http.Client
@@ -41,24 +44,14 @@ func checkError(resp *http.Response, body []byte) error {
 }

 func ClientFromEnvironment() (*Client, error) {
-	defaultPort := "11434"
-
 	scheme, hostport, ok := strings.Cut(os.Getenv("OLLAMA_HOST"), "://")
-	switch {
-	case !ok:
+	if !ok {
 		scheme, hostport = "http", os.Getenv("OLLAMA_HOST")
-	case scheme == "http":
-		defaultPort = "80"
-	case scheme == "https":
-		defaultPort = "443"
 	}

-	// trim trailing slashes
-	hostport = strings.TrimRight(hostport, "/")
-
 	host, port, err := net.SplitHostPort(hostport)
 	if err != nil {
-		host, port = "127.0.0.1", defaultPort
+		host, port = "127.0.0.1", "11434"
 		if ip := net.ParseIP(strings.Trim(hostport, "[]")); ip != nil {
 			host = ip.String()
 		} else if hostport != "" {
@@ -73,7 +66,7 @@ func ClientFromEnvironment() (*Client, error) {
 		},
 	}

-	mockRequest, err := http.NewRequest(http.MethodHead, client.base.String(), nil)
+	mockRequest, err := http.NewRequest("HEAD", client.base.String(), nil)
 	if err != nil {
 		return nil, err
 	}
@@ -96,19 +89,11 @@ func (c *Client) do(ctx context.Context, method, path string, reqData, respData
 	var reqBody io.Reader
 	var data []byte
 	var err error
-
-	switch reqData := reqData.(type) {
-	case io.Reader:
-		// reqData is already an io.Reader
-		reqBody = reqData
-	case nil:
-		// noop
-	default:
+	if reqData != nil {
 		data, err = json.Marshal(reqData)
 		if err != nil {
 			return err
 		}
-
 		reqBody = bytes.NewReader(data)
 	}

@@ -221,19 +206,6 @@ func (c *Client) Generate(ctx context.Context, req *GenerateRequest, fn Generate
 	})
 }

-type ChatResponseFunc func(ChatResponse) error
-
-func (c *Client) Chat(ctx context.Context, req *ChatRequest, fn ChatResponseFunc) error {
-	return c.stream(ctx, http.MethodPost, "/api/chat", req, func(bts []byte) error {
-		var resp ChatResponse
-		if err := json.Unmarshal(bts, &resp); err != nil {
-			return err
-		}
-
-		return fn(resp)
-	})
-}
-
 type PullProgressFunc func(ProgressResponse) error

 func (c *Client) Pull(ctx context.Context, req *PullRequest, fn PullProgressFunc) error {
@@ -309,30 +281,3 @@ func (c *Client) Heartbeat(ctx context.Context) error {
 	}
 	return nil
 }
-
-func (c *Client) CreateBlob(ctx context.Context, digest string, r io.Reader) error {
-	if err := c.do(ctx, http.MethodHead, fmt.Sprintf("/api/blobs/%s", digest), nil, nil); err != nil {
-		var statusError StatusError
-		if !errors.As(err, &statusError) || statusError.StatusCode != http.StatusNotFound {
-			return err
-		}
-
-		if err := c.do(ctx, http.MethodPost, fmt.Sprintf("/api/blobs/%s", digest), r, nil); err != nil {
-			return err
-		}
-	}
-
-	return nil
-}
-
-func (c *Client) Version(ctx context.Context) (string, error) {
-	var version struct {
-		Version string `json:"version"`
-	}
-
-	if err := c.do(ctx, http.MethodGet, "/api/version", nil, &version); err != nil {
-		return "", err
-	}
-
-	return version.Version, nil
-}
--- a/api/client.py
+++ b/api/client.py
@@ -1,17 +1,13 @@
 import os
 import json
 import requests
-import os
-import hashlib
-import json
-from pathlib import Path

 BASE_URL = os.environ.get('OLLAMA_HOST', 'http://localhost:11434')

 # Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses.
 # The final response object will include statistics and additional data from the request. Use the callback function to override
 # the default handler.
-def generate(model_name, prompt, system=None, template=None, format="", context=None, options=None, callback=None):
+def generate(model_name, prompt, system=None, template=None, context=None, options=None, callback=None):
    try:
        url = f"{BASE_URL}/api/generate"
        payload = {
@@ -20,8 +16,7 @@ def generate(model_name, prompt, system=None, template=None, format="", context=
            "system": system, 
            "template": template, 
            "context": context, 
-            "options": options,
-            "format": format,
+            "options": options
        }
        
        # Remove keys with None values
@@ -61,86 +56,30 @@ def generate(model_name, prompt, system=None, template=None, format="", context=
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None, None
-    
-
-# Create a blob file on the server if it doesn't exist.
-def create_blob(digest, file_path):
-    url = f"{BASE_URL}/api/blobs/{digest}"
-
-    # Check if the blob exists
-    response = requests.head(url)
-    if response.status_code != 404:
-        return  # Blob already exists, no need to upload
-    response.raise_for_status()
-
-    # Upload the blob
-    with open(file_path, 'rb') as file_data:
-        requests.post(url, data=file_data)
-

 # Create a model from a Modelfile. Use the callback function to override the default handler.
-def create(model_name, filename, callback=None):
+def create(model_name, model_path, callback=None):
    try:
-        file_path = Path(filename).expanduser().resolve()
-        processed_lines = []
-
-        # Read and process the modelfile
-        with open(file_path, 'r') as f:
-            for line in f:            
-                # Skip empty or whitespace-only lines
-                if not line.strip():
-                    continue
-            
-                command, args = line.split(maxsplit=1)
-
-                if command.upper() in ["FROM", "ADAPTER"]:
-                    path = Path(args.strip()).expanduser()
-
-                    # Check if path is relative and resolve it
-                    if not path.is_absolute():
-                        path = (file_path.parent / path)
-
-                    # Skip if file does not exist for "model", this is handled by the server
-                    if not path.exists():
-                        processed_lines.append(line)
-                        continue
-
-                    # Calculate SHA-256 hash
-                    with open(path, 'rb') as bin_file:
-                        hash = hashlib.sha256()
-                        hash.update(bin_file.read())
-                        blob = f"sha256:{hash.hexdigest()}"
-                
-                    # Add the file to the remote server
-                    create_blob(blob, path)
-
-                    # Replace path with digest in the line
-                    line = f"{command} @{blob}\n"
-
-                processed_lines.append(line)
-
-        # Combine processed lines back into a single string
-        modelfile_content = '\n'.join(processed_lines)
-
        url = f"{BASE_URL}/api/create"
-        payload = {"name": model_name, "modelfile": modelfile_content}
-
+        payload = {"name": model_name, "path": model_path}
+        
        # Making a POST request with the stream parameter set to True to handle streaming responses
        with requests.post(url, json=payload, stream=True) as response:
            response.raise_for_status()
+
            # Iterating over the response line by line and displaying the status
            for line in response.iter_lines():
                if line:
+                    # Parsing each line (JSON chunk) and extracting the status
                    chunk = json.loads(line)
+
                    if callback:
                        callback(chunk)
                    else:
                        print(f"Status: {chunk.get('status')}")
-
-    except Exception as e:
+    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

-
 # Pull a model from a the model registry. Cancelled pulls are resumed from where they left off, and multiple
 # calls to will share the same download progress. Use the callback function to override the default handler.
 def pull(model_name, insecure=False, callback=None):
@@ -282,3 +221,5 @@ def heartbeat():
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return "Ollama is not running"
+
+
--- a/api/client_test.go
+++ b/api/client_test.go
@@ -1,43 +0,0 @@
-package api
-
-import "testing"
-
-func TestClientFromEnvironment(t *testing.T) {
-	type testCase struct {
-		value  string
-		expect string
-		err    error
-	}
-
-	testCases := map[string]*testCase{
-		"empty":                      {value: "", expect: "http://127.0.0.1:11434"},
-		"only address":               {value: "1.2.3.4", expect: "http://1.2.3.4:11434"},
-		"only port":                  {value: ":1234", expect: "http://:1234"},
-		"address and port":           {value: "1.2.3.4:1234", expect: "http://1.2.3.4:1234"},
-		"scheme http and address":    {value: "http://1.2.3.4", expect: "http://1.2.3.4:80"},
-		"scheme https and address":   {value: "https://1.2.3.4", expect: "https://1.2.3.4:443"},
-		"scheme, address, and port":  {value: "https://1.2.3.4:1234", expect: "https://1.2.3.4:1234"},
-		"hostname":                   {value: "example.com", expect: "http://example.com:11434"},
-		"hostname and port":          {value: "example.com:1234", expect: "http://example.com:1234"},
-		"scheme http and hostname":   {value: "http://example.com", expect: "http://example.com:80"},
-		"scheme https and hostname":  {value: "https://example.com", expect: "https://example.com:443"},
-		"scheme, hostname, and port": {value: "https://example.com:1234", expect: "https://example.com:1234"},
-		"trailing slash":             {value: "example.com/", expect: "http://example.com:11434"},
-		"trailing slash port":        {value: "example.com:1234/", expect: "http://example.com:1234"},
-	}
-
-	for k, v := range testCases {
-		t.Run(k, func(t *testing.T) {
-			t.Setenv("OLLAMA_HOST", v.value)
-
-			client, err := ClientFromEnvironment()
-			if err != v.err {
-				t.Fatalf("expected %s, got %s", v.err, err)
-			}
-
-			if client.base.String() != v.expect {
-				t.Fatalf("expected %s, got %s", v.expect, client.base.String())
-			}
-		})
-	}
-}
--- a/api/types.go
+++ b/api/types.go
@@ -6,7 +6,6 @@ import (
 	"math"
 	"os"
 	"reflect"
-	"strconv"
 	"strings"
 	"time"
 )
@@ -31,100 +30,17 @@ func (e StatusError) Error() string {
 	}
 }

-type ImageData []byte
-
 type GenerateRequest struct {
-	Model    string      `json:"model"`
-	Prompt   string      `json:"prompt"`
-	System   string      `json:"system"`
-	Template string      `json:"template"`
-	Context  []int       `json:"context,omitempty"`
-	Stream   *bool       `json:"stream,omitempty"`
-	Raw      bool        `json:"raw,omitempty"`
-	Format   string      `json:"format"`
-	Images   []ImageData `json:"images,omitempty"`
+	Model    string `json:"model"`
+	Prompt   string `json:"prompt"`
+	System   string `json:"system"`
+	Template string `json:"template"`
+	Context  []int  `json:"context,omitempty"`
+	Stream   *bool  `json:"stream,omitempty"`

 	Options map[string]interface{} `json:"options"`
 }

-type ChatRequest struct {
-	Model    string    `json:"model"`
-	Messages []Message `json:"messages"`
-	Stream   *bool     `json:"stream,omitempty"`
-	Format   string    `json:"format"`
-
-	Options map[string]interface{} `json:"options"`
-}
-
-type Message struct {
-	Role    string      `json:"role"` // one of ["system", "user", "assistant"]
-	Content string      `json:"content"`
-	Images  []ImageData `json:"images,omitempty"`
-}
-
-type ChatResponse struct {
-	Model     string    `json:"model"`
-	CreatedAt time.Time `json:"created_at"`
-	Message   Message   `json:"message"`
-
-	Done bool `json:"done"`
-
-	Metrics
-}
-
-type Metrics struct {
-	TotalDuration      time.Duration `json:"total_duration,omitempty"`
-	LoadDuration       time.Duration `json:"load_duration,omitempty"`
-	PromptEvalCount    int           `json:"prompt_eval_count,omitempty"`
-	PromptEvalDuration time.Duration `json:"prompt_eval_duration,omitempty"`
-	EvalCount          int           `json:"eval_count,omitempty"`
-	EvalDuration       time.Duration `json:"eval_duration,omitempty"`
-}
-
-// Options specfied in GenerateRequest, if you add a new option here add it to the API docs also
-type Options struct {
-	Runner
-
-	// Predict options used at runtime
-	NumKeep          int      `json:"num_keep,omitempty"`
-	Seed             int      `json:"seed,omitempty"`
-	NumPredict       int      `json:"num_predict,omitempty"`
-	TopK             int      `json:"top_k,omitempty"`
-	TopP             float32  `json:"top_p,omitempty"`
-	TFSZ             float32  `json:"tfs_z,omitempty"`
-	TypicalP         float32  `json:"typical_p,omitempty"`
-	RepeatLastN      int      `json:"repeat_last_n,omitempty"`
-	Temperature      float32  `json:"temperature,omitempty"`
-	RepeatPenalty    float32  `json:"repeat_penalty,omitempty"`
-	PresencePenalty  float32  `json:"presence_penalty,omitempty"`
-	FrequencyPenalty float32  `json:"frequency_penalty,omitempty"`
-	Mirostat         int      `json:"mirostat,omitempty"`
-	MirostatTau      float32  `json:"mirostat_tau,omitempty"`
-	MirostatEta      float32  `json:"mirostat_eta,omitempty"`
-	PenalizeNewline  bool     `json:"penalize_newline,omitempty"`
-	Stop             []string `json:"stop,omitempty"`
-}
-
-// Runner options which must be set when the model is loaded into memory
-type Runner struct {
-	UseNUMA            bool    `json:"numa,omitempty"`
-	NumCtx             int     `json:"num_ctx,omitempty"`
-	NumBatch           int     `json:"num_batch,omitempty"`
-	NumGQA             int     `json:"num_gqa,omitempty"`
-	NumGPU             int     `json:"num_gpu,omitempty"`
-	MainGPU            int     `json:"main_gpu,omitempty"`
-	LowVRAM            bool    `json:"low_vram,omitempty"`
-	F16KV              bool    `json:"f16_kv,omitempty"`
-	LogitsAll          bool    `json:"logits_all,omitempty"`
-	VocabOnly          bool    `json:"vocab_only,omitempty"`
-	UseMMap            bool    `json:"use_mmap,omitempty"`
-	UseMLock           bool    `json:"use_mlock,omitempty"`
-	EmbeddingOnly      bool    `json:"embedding_only,omitempty"`
-	RopeFrequencyBase  float32 `json:"rope_frequency_base,omitempty"`
-	RopeFrequencyScale float32 `json:"rope_frequency_scale,omitempty"`
-	NumThread          int     `json:"num_thread,omitempty"`
-}
-
 type EmbeddingRequest struct {
 	Model  string `json:"model"`
 	Prompt string `json:"prompt"`
@@ -137,10 +53,9 @@ type EmbeddingResponse struct {
 }

 type CreateRequest struct {
-	Name      string `json:"name"`
-	Path      string `json:"path"`
-	Modelfile string `json:"modelfile"`
-	Stream    *bool  `json:"stream,omitempty"`
+	Name   string `json:"name"`
+	Path   string `json:"path"`
+	Stream *bool  `json:"stream,omitempty"`
 }

 type DeleteRequest struct {
@@ -152,12 +67,11 @@ type ShowRequest struct {
 }

 type ShowResponse struct {
-	License    string       `json:"license,omitempty"`
-	Modelfile  string       `json:"modelfile,omitempty"`
-	Parameters string       `json:"parameters,omitempty"`
-	Template   string       `json:"template,omitempty"`
-	System     string       `json:"system,omitempty"`
-	Details    ModelDetails `json:"details,omitempty"`
+	License    string `json:"license,omitempty"`
+	Modelfile  string `json:"modelfile,omitempty"`
+	Parameters string `json:"parameters,omitempty"`
+	Template   string `json:"template,omitempty"`
+	System     string `json:"system,omitempty"`
 }

 type CopyRequest struct {
@@ -193,11 +107,10 @@ type ListResponse struct {
 }

 type ModelResponse struct {
-	Name       string       `json:"name"`
-	ModifiedAt time.Time    `json:"modified_at"`
-	Size       int64        `json:"size"`
-	Digest     string       `json:"digest"`
-	Details    ModelDetails `json:"details,omitempty"`
+	Name       string    `json:"name"`
+	ModifiedAt time.Time `json:"modified_at"`
+	Size       int64     `json:"size"`
+	Digest     string    `json:"digest"`
 }

 type TokenResponse struct {
@@ -212,43 +125,83 @@ type GenerateResponse struct {
 	Done    bool  `json:"done"`
 	Context []int `json:"context,omitempty"`

-	Metrics
+	TotalDuration      time.Duration `json:"total_duration,omitempty"`
+	LoadDuration       time.Duration `json:"load_duration,omitempty"`
+	PromptEvalCount    int           `json:"prompt_eval_count,omitempty"`
+	PromptEvalDuration time.Duration `json:"prompt_eval_duration,omitempty"`
+	EvalCount          int           `json:"eval_count,omitempty"`
+	EvalDuration       time.Duration `json:"eval_duration,omitempty"`
 }

-type ModelDetails struct {
-	Format            string   `json:"format"`
-	Family            string   `json:"family"`
-	Families          []string `json:"families"`
-	ParameterSize     string   `json:"parameter_size"`
-	QuantizationLevel string   `json:"quantization_level"`
+func (r *GenerateResponse) Summary() {
+	if r.TotalDuration > 0 {
+		fmt.Fprintf(os.Stderr, "total duration:       %v\n", r.TotalDuration)
+	}
+
+	if r.LoadDuration > 0 {
+		fmt.Fprintf(os.Stderr, "load duration:        %v\n", r.LoadDuration)
+	}
+
+	if r.PromptEvalCount > 0 {
+		fmt.Fprintf(os.Stderr, "prompt eval count:    %d token(s)\n", r.PromptEvalCount)
+	}
+
+	if r.PromptEvalDuration > 0 {
+		fmt.Fprintf(os.Stderr, "prompt eval duration: %s\n", r.PromptEvalDuration)
+		fmt.Fprintf(os.Stderr, "prompt eval rate:     %.2f tokens/s\n", float64(r.PromptEvalCount)/r.PromptEvalDuration.Seconds())
+	}
+
+	if r.EvalCount > 0 {
+		fmt.Fprintf(os.Stderr, "eval count:           %d token(s)\n", r.EvalCount)
+	}
+
+	if r.EvalDuration > 0 {
+		fmt.Fprintf(os.Stderr, "eval duration:        %s\n", r.EvalDuration)
+		fmt.Fprintf(os.Stderr, "eval rate:            %.2f tokens/s\n", float64(r.EvalCount)/r.EvalDuration.Seconds())
+	}
 }

-func (m *Metrics) Summary() {
-	if m.TotalDuration > 0 {
-		fmt.Fprintf(os.Stderr, "total duration:       %v\n", m.TotalDuration)
-	}
+// Runner options which must be set when the model is loaded into memory
+type Runner struct {
+	UseNUMA            bool    `json:"numa,omitempty"`
+	NumCtx             int     `json:"num_ctx,omitempty"`
+	NumBatch           int     `json:"num_batch,omitempty"`
+	NumGQA             int     `json:"num_gqa,omitempty"`
+	NumGPU             int     `json:"num_gpu,omitempty"`
+	MainGPU            int     `json:"main_gpu,omitempty"`
+	LowVRAM            bool    `json:"low_vram,omitempty"`
+	F16KV              bool    `json:"f16_kv,omitempty"`
+	LogitsAll          bool    `json:"logits_all,omitempty"`
+	VocabOnly          bool    `json:"vocab_only,omitempty"`
+	UseMMap            bool    `json:"use_mmap,omitempty"`
+	UseMLock           bool    `json:"use_mlock,omitempty"`
+	EmbeddingOnly      bool    `json:"embedding_only,omitempty"`
+	RopeFrequencyBase  float32 `json:"rope_frequency_base,omitempty"`
+	RopeFrequencyScale float32 `json:"rope_frequency_scale,omitempty"`
+	NumThread          int     `json:"num_thread,omitempty"`
+}

-	if m.LoadDuration > 0 {
-		fmt.Fprintf(os.Stderr, "load duration:        %v\n", m.LoadDuration)
-	}
+type Options struct {
+	Runner

-	if m.PromptEvalCount > 0 {
-		fmt.Fprintf(os.Stderr, "prompt eval count:    %d token(s)\n", m.PromptEvalCount)
-	}
-
-	if m.PromptEvalDuration > 0 {
-		fmt.Fprintf(os.Stderr, "prompt eval duration: %s\n", m.PromptEvalDuration)
-		fmt.Fprintf(os.Stderr, "prompt eval rate:     %.2f tokens/s\n", float64(m.PromptEvalCount)/m.PromptEvalDuration.Seconds())
-	}
-
-	if m.EvalCount > 0 {
-		fmt.Fprintf(os.Stderr, "eval count:           %d token(s)\n", m.EvalCount)
-	}
-
-	if m.EvalDuration > 0 {
-		fmt.Fprintf(os.Stderr, "eval duration:        %s\n", m.EvalDuration)
-		fmt.Fprintf(os.Stderr, "eval rate:            %.2f tokens/s\n", float64(m.EvalCount)/m.EvalDuration.Seconds())
-	}
+	// Predict options used at runtime
+	NumKeep          int      `json:"num_keep,omitempty"`
+	Seed             int      `json:"seed,omitempty"`
+	NumPredict       int      `json:"num_predict,omitempty"`
+	TopK             int      `json:"top_k,omitempty"`
+	TopP             float32  `json:"top_p,omitempty"`
+	TFSZ             float32  `json:"tfs_z,omitempty"`
+	TypicalP         float32  `json:"typical_p,omitempty"`
+	RepeatLastN      int      `json:"repeat_last_n,omitempty"`
+	Temperature      float32  `json:"temperature,omitempty"`
+	RepeatPenalty    float32  `json:"repeat_penalty,omitempty"`
+	PresencePenalty  float32  `json:"presence_penalty,omitempty"`
+	FrequencyPenalty float32  `json:"frequency_penalty,omitempty"`
+	Mirostat         int      `json:"mirostat,omitempty"`
+	MirostatTau      float32  `json:"mirostat_tau,omitempty"`
+	MirostatEta      float32  `json:"mirostat_eta,omitempty"`
+	PenalizeNewline  bool     `json:"penalize_newline,omitempty"`
+	Stop             []string `json:"stop,omitempty"`
 }

 var ErrInvalidOpts = fmt.Errorf("invalid options")
@@ -340,7 +293,7 @@ func DefaultOptions() Options {
 	return Options{
 		// options set on request to runner
 		NumPredict:       -1,
-		NumKeep:          0,
+		NumKeep:          -1,
 		Temperature:      0.8,
 		TopK:             40,
 		TopP:             0.9,
@@ -403,63 +356,3 @@ func (d *Duration) UnmarshalJSON(b []byte) (err error) {

 	return nil
 }
-
-// FormatParams converts specified parameter options to their correct types
-func FormatParams(params map[string][]string) (map[string]interface{}, error) {
-	opts := Options{}
-	valueOpts := reflect.ValueOf(&opts).Elem() // names of the fields in the options struct
-	typeOpts := reflect.TypeOf(opts)           // types of the fields in the options struct
-
-	// build map of json struct tags to their types
-	jsonOpts := make(map[string]reflect.StructField)
-	for _, field := range reflect.VisibleFields(typeOpts) {
-		jsonTag := strings.Split(field.Tag.Get("json"), ",")[0]
-		if jsonTag != "" {
-			jsonOpts[jsonTag] = field
-		}
-	}
-
-	out := make(map[string]interface{})
-	// iterate params and set values based on json struct tags
-	for key, vals := range params {
-		if opt, ok := jsonOpts[key]; !ok {
-			return nil, fmt.Errorf("unknown parameter '%s'", key)
-		} else {
-			field := valueOpts.FieldByName(opt.Name)
-			if field.IsValid() && field.CanSet() {
-				switch field.Kind() {
-				case reflect.Float32:
-					floatVal, err := strconv.ParseFloat(vals[0], 32)
-					if err != nil {
-						return nil, fmt.Errorf("invalid float value %s", vals)
-					}
-
-					out[key] = float32(floatVal)
-				case reflect.Int:
-					intVal, err := strconv.ParseInt(vals[0], 10, 64)
-					if err != nil {
-						return nil, fmt.Errorf("invalid int value %s", vals)
-					}
-
-					out[key] = intVal
-				case reflect.Bool:
-					boolVal, err := strconv.ParseBool(vals[0])
-					if err != nil {
-						return nil, fmt.Errorf("invalid bool value %s", vals)
-					}
-
-					out[key] = boolVal
-				case reflect.String:
-					out[key] = vals[0]
-				case reflect.Slice:
-					// TODO: only string slices are supported right now
-					out[key] = vals
-				default:
-					return nil, fmt.Errorf("unknown type %s for %s", field.Kind(), key)
-				}
-			}
-		}
-	}
-
-	return out, nil
-}
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
--- a/cmd/spinner.go
+++ b/cmd/spinner.go
@@ -0,0 +1,44 @@
+package cmd
+
+import (
+	"fmt"
+	"os"
+	"time"
+
+	"github.com/jmorganca/ollama/progressbar"
+)
+
+type Spinner struct {
+	description string
+	*progressbar.ProgressBar
+}
+
+func NewSpinner(description string) *Spinner {
+	return &Spinner{
+		description: description,
+		ProgressBar: progressbar.NewOptions(-1,
+			progressbar.OptionSetWriter(os.Stderr),
+			progressbar.OptionThrottle(60*time.Millisecond),
+			progressbar.OptionSpinnerType(14),
+			progressbar.OptionSetRenderBlankState(true),
+			progressbar.OptionSetElapsedTime(false),
+			progressbar.OptionClearOnFinish(),
+			progressbar.OptionSetDescription(description),
+		),
+	}
+}
+
+func (s *Spinner) Spin(tick time.Duration) {
+	for range time.Tick(tick) {
+		if s.IsFinished() {
+			break
+		}
+
+		s.Add(1)
+	}
+}
+
+func (s *Spinner) Stop() {
+	s.Finish()
+	fmt.Println(s.description)
+}
--- a/docs/api.md
+++ b/docs/api.md
@@ -3,7 +3,6 @@
 ## Endpoints

 - [Generate a completion](#generate-a-completion)
- [Generate a chat completion](#generate-a-chat-completion)
 - [Create a Model](#create-a-model)
 - [List Local Models](#list-local-models)
 - [Show Model Information](#show-model-information)
@@ -25,7 +24,7 @@ All durations are returned in nanoseconds.

 ### Streaming responses

-Certain endpoints stream responses as JSON objects.
+Certain endpoints stream responses as JSON objects delineated with the newline (`\n`) character.

 ## Generate a completion

@@ -33,48 +32,37 @@ Certain endpoints stream responses as JSON objects.
 POST /api/generate
 ```

-Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
+Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses. The final response object will include statistics and additional data from the request.

 ### Parameters

 - `model`: (required) the [model name](#model-names)
 - `prompt`: the prompt to generate a response for
- `images`: a list of base64-encoded images (for multimodal models such as `llava`)

 Advanced parameters (optional):

- `format`: the format to return a response in. Currently the only accepted value is `json`
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
- `system`: system message to (overrides what is defined in the `Modelfile`)
+- `system`: system prompt to (overrides what is defined in the `Modelfile`)
 - `template`: the full prompt or prompt template (overrides what is defined in the `Modelfile`)
 - `context`: the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory
- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
- `raw`: if `true` no formatting will be applied to the prompt. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API.
+- `stream`: if `false` the response will be be returned as a single response object, rather than a stream of objects

-### JSON mode
-
-Enable JSON mode by setting the `format` parameter to `json`. This will structure the response as valid JSON. See the JSON mode [example](#request-json-mode) below.
-
-> Note: it's important to instruct the model to use JSON in the `prompt`. Otherwise, the model may generate large amounts whitespace.
-
-### Examples
-
-#### Request
+### Request

 ```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "llama2",
+curl -X POST http://localhost:11434/api/generate -d '{
+  "model": "llama2:7b",
  "prompt": "Why is the sky blue?"
 }'
 ```

-#### Response
+### Response

-A stream of JSON objects is returned:
+A stream of JSON objects:

 ```json
 {
-  "model": "llama2",
+  "model": "llama2:7b",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
@@ -85,6 +73,8 @@ The final response in the stream also includes additional data about the generat

 - `total_duration`: time spent generating the response
 - `load_duration`: time spent in nanoseconds loading the model
+- `sample_count`: number of samples generated
+- `sample_duration`: time spent generating samples
 - `prompt_eval_count`: number of tokens in the prompt
 - `prompt_eval_duration`: time spent in nanoseconds evaluating the prompt
 - `eval_count`: number of tokens the response
@@ -96,13 +86,15 @@ To calculate how fast the response is generated in tokens per second (token/s),

 ```json
 {
-  "model": "llama2",
+  "model": "llama2:7b",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "",
  "context": [1, 2, 3],
  "done": true,
  "total_duration": 5589157167,
  "load_duration": 3013701500,
+  "sample_count": 114,
+  "sample_duration": 81442000,
  "prompt_eval_count": 46,
  "prompt_eval_duration": 1160282000,
  "eval_count": 113,
@@ -110,387 +102,30 @@ To calculate how fast the response is generated in tokens per second (token/s),
 }
 ```

-#### Request (No streaming)
-
-A response can be recieved in one reply when streaming is off.
-
-```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "llama2",
-  "prompt": "Why is the sky blue?",
-  "stream": false
-}'
-```
-
-#### Response
-
-If `stream` is set to `false`, the response will be a single JSON object:
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T19:22:45.499127Z",
-  "response": "The sky is blue because it is the color of the sky.",
-  "context": [1, 2, 3],
-  "done": true,
-  "total_duration": 5589157167,
-  "load_duration": 3013701500,
-  "prompt_eval_count": 46,
-  "prompt_eval_duration": 1160282000,
-  "eval_count": 13,
-  "eval_duration": 1325948000
-}
-```
-
-#### Request (with images)
-
-To submit images to multimodal models such as `llava` or `bakllava`, provide a list of base64-encoded `images`:
-
-```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "llava",
-  "prompt":"What is in this picture?",
-  "stream": false,
-  "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"]
-}'
-```
-
-#### Response
-
-```
-{
-  "model": "llava",
-  "created_at": "2023-11-03T15:36:02.583064Z",
-  "response": "A happy cartoon character, which is cute and cheerful.",
-  "context": [1, 2, 3],
-  "done": true,
-  "total_duration": 14648695333,
-  "load_duration": 3302671417,
-  "prompt_eval_count": 14,
-  "prompt_eval_duration": 286243000,
-  "eval_count": 129,
-  "eval_duration": 10931424000
-}
-```
-
-#### Request (Raw Mode)
-
-In some cases you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable formatting.
-
-```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "mistral",
-  "prompt": "[INST] why is the sky blue? [/INST]",
-  "raw": true,
-  "stream": false
-}'
-```
-
-#### Response
-
-```json
-{
-  "model": "mistral",
-  "created_at": "2023-11-03T15:36:02.583064Z",
-  "response": " The sky appears blue because of a phenomenon called Rayleigh scattering.",
-  "context": [1, 2, 3],
-  "done": true,
-  "total_duration": 14648695333,
-  "load_duration": 3302671417,
-  "prompt_eval_count": 14,
-  "prompt_eval_duration": 286243000,
-  "eval_count": 129,
-  "eval_duration": 10931424000
-}
-```
-
-#### Request (JSON mode)
-
-```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "llama2",
-  "prompt": "What color is the sky at different times of the day? Respond using JSON",
-  "format": "json",
-  "stream": false
-}'
-```
-
-#### Response
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-11-09T21:07:55.186497Z",
-  "response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n",
-  "done": true,
-  "total_duration": 4661289125,
-  "load_duration": 1714434500,
-  "prompt_eval_count": 36,
-  "prompt_eval_duration": 264132000,
-  "eval_count": 75,
-  "eval_duration": 2112149000
-}
-```
-
-The value of `response` will be a string containing JSON similar to:
-
-```json
-{
-  "morning": {
-    "color": "blue"
-  },
-  "noon": {
-    "color": "blue-gray"
-  },
-  "afternoon": {
-    "color": "warm gray"
-  },
-  "evening": {
-    "color": "orange"
-  }
-}
-```
-
-#### Request (With options)
-
-If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override.
-
-```shell
-curl http://localhost:11434/api/generate -d '{
-  "model": "llama2",
-  "prompt": "Why is the sky blue?",
-  "stream": false,
-  "options": {
-    "num_keep": 5,
-    "seed": 42,
-    "num_predict": 100,
-    "top_k": 20,
-    "top_p": 0.9,
-    "tfs_z": 0.5,
-    "typical_p": 0.7,
-    "repeat_last_n": 33,
-    "temperature": 0.8,
-    "repeat_penalty": 1.2,
-    "presence_penalty": 1.5,
-    "frequency_penalty": 1.0,
-    "mirostat": 1,
-    "mirostat_tau": 0.8,
-    "mirostat_eta": 0.6,
-    "penalize_newline": true,
-    "stop": ["\n", "user:"],
-    "numa": false,
-    "num_ctx": 1024,
-    "num_batch": 2,
-    "num_gqa": 1,
-    "num_gpu": 1,
-    "main_gpu": 0,
-    "low_vram": false,
-    "f16_kv": true,
-    "logits_all": false,
-    "vocab_only": false,
-    "use_mmap": true,
-    "use_mlock": false,
-    "embedding_only": false,
-    "rope_frequency_base": 1.1,
-    "rope_frequency_scale": 0.8,
-    "num_thread": 8
-  }
-}'
-```
-
-#### Response
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T19:22:45.499127Z",
-  "response": "The sky is blue because it is the color of the sky.",
-  "done": true,
-  "total_duration": 5589157167,
-  "load_duration": 3013701500,
-  "prompt_eval_count": 46,
-  "prompt_eval_duration": 1160282000,
-  "eval_count": 13,
-  "eval_duration": 1325948000
-}
-```
-
-## Generate a chat completion
-
-```shell
-POST /api/chat
-```
-
-Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
-
-### Parameters
-
- `model`: (required) the [model name](#model-names)
- `messages`: the messages of the chat, this can be used to keep a chat memory
-
-The `message` object has the following fields:
-
- `role`: the role of the message, either `system`, `user` or `assistant`
- `content`: the content of the message
- `images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
-
-Advanced parameters (optional):
-
- `format`: the format to return a response in. Currently the only accepted value is `json`
- `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
- `template`: the full prompt or prompt template (overrides what is defined in the `Modelfile`)
- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
-
-### Examples
-
-#### Request
-
-Send a chat message with a streaming response.
-
-```shell
-curl http://localhost:11434/api/chat -d '{
-  "model": "llama2",
-  "messages": [
-    {
-      "role": "user",
-      "content": "why is the sky blue?"
-    }
-  ]
-}'
-```
-
-#### Response
-
-A stream of JSON objects is returned:
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T08:52:19.385406455-07:00",
-  "message": {
-    "role": "assisant",
-    "content": "The"
-  },
-  "done": false
-}
-```
-
-Final response:
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T19:22:45.499127Z",
-  "done": true,
-  "total_duration": 5589157167,
-  "load_duration": 3013701500,
-  "prompt_eval_count": 46,
-  "prompt_eval_duration": 1160282000,
-  "eval_count": 113,
-  "eval_duration": 1325948000
-}
-```
-
-#### Request (With History)
-
-Send a chat message with a conversation history.
-
-```shell
-curl http://localhost:11434/api/chat -d '{
-  "model": "llama2",
-  "messages": [
-    {
-      "role": "user",
-      "content": "why is the sky blue?"
-    },
-    {
-      "role": "assistant",
-      "content": "due to rayleigh scattering."
-    },
-    {
-      "role": "user",
-      "content": "how is that different than mie scattering?"
-    }
-  ]
-}'
-```
-
-#### Response
-
-A stream of JSON objects is returned:
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T08:52:19.385406455-07:00",
-  "message": {
-    "role": "assisant",
-    "content": "The"
-  },
-  "done": false
-}
-```
-
-Final response:
-
-```json
-{
-  "model": "llama2",
-  "created_at": "2023-08-04T19:22:45.499127Z",
-  "done": true,
-  "total_duration": 5589157167,
-  "load_duration": 3013701500,
-  "prompt_eval_count": 46,
-  "prompt_eval_duration": 1160282000,
-  "eval_count": 113,
-  "eval_duration": 1325948000
-}
-```
-
-#### Request (with images)
-
-Send a chat message with a conversation history.
-
-```shell
-curl http://localhost:11434/api/chat -d '{
-  "model": "llama2",
-  "messages": [
-    {
-      "role": "user",
-      "content": "what is in this image?",
-      "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"]
-    },
-  ]
-}'
-```
-
 ## Create a Model

 ```shell
 POST /api/create
 ```

-Create a model from a [`Modelfile`](./modelfile.md). It is recommended to set `modelfile` to the content of the Modelfile rather than just set `path`. This is a requirement for remote create. Remote model creation should also create any file blobs, fields such as `FROM` and `ADAPTER`, explicitly with the server using [Create a Blob](#create-a-blob) and the value to the path indicated in the response.
+Create a model from a [`Modelfile`](./modelfile.md)

 ### Parameters

 - `name`: name of the model to create
- `modelfile` (optional): contents of the Modelfile
- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
- `path` (optional): path to the Modelfile
+- `path`: path to the Modelfile
+- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects

-### Examples
-
-#### Request
+### Request

 ```shell
-curl http://localhost:11434/api/create -d '{
+curl -X POST http://localhost:11434/api/create -d '{
  "name": "mario",
-  "modelfile": "FROM llama2\nSYSTEM You are mario from Super Mario Bros."
+  "path": "~/Modelfile"
 }'
 ```

-#### Response
+### Response

 A stream of JSON objects. When finished, `status` is `success`.

@@ -500,54 +135,6 @@ A stream of JSON objects. When finished, `status` is `success`.
 }
 ```

-### Check if a Blob Exists
-
-```shell
-HEAD /api/blobs/:digest
-```
-
-Check if a blob is known to the server.
-
-#### Query Parameters
-
- `digest`: the SHA256 digest of the blob
-
-#### Examples
-
-##### Request
-
-```shell
-curl -I http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2
-```
-
-##### Response
-
-Return 200 OK if the blob exists, 404 Not Found if it does not.
-
-### Create a Blob
-
-```shell
-POST /api/blobs/:digest
-```
-
-Create a blob from a file. Returns the server file path.
-
-#### Query Parameters
-
- `digest`: the expected SHA256 digest of the file
-
-#### Examples
-
-##### Request
-
-```shell
-curl -T model.bin -X POST http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2
-```
-
-##### Response
-
-Return 201 Created if the blob was successfully created.
-
 ## List Local Models

 ```shell
@@ -556,23 +143,19 @@ GET /api/tags

 List models that are available locally.

-### Examples
-
-#### Request
+### Request

 ```shell
 curl http://localhost:11434/api/tags
 ```

-#### Response
-
-A single JSON object will be returned.
+### Response

 ```json
 {
  "models": [
    {
-      "name": "llama2",
+      "name": "llama2:7b",
      "modified_at": "2023-08-02T17:02:23.713454393-07:00",
      "size": 3791730596
    },
@@ -591,36 +174,28 @@ A single JSON object will be returned.
 POST /api/show
 ```

-Show information about a model including details, modelfile, template, parameters, license, and system prompt.
+Show details about a model including modelfile, template, parameters, license, and system prompt.

 ### Parameters

 - `name`: name of the model to show

-### Examples
-
-#### Request
+### Request

 ```shell
 curl http://localhost:11434/api/show -d '{
-  "name": "llama2"
+  "name": "llama2:7b"
 }'
 ```

-#### Response
+### Response

 ```json
 {
-  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM mike/llava:latest\nTEMPLATE \"\"\"\nUSER:{{ .Prompt }}\nASSISTANT:\n\"\"\"\nPARAMETER num_ctx 4096",
-  "parameters": "num_ctx                        4096",
-  "template": "\nUSER:{{ .Prompt }}\nASSISTANT:\n",
-  "license:": "<license>",
-  "details": {
-    "format": "gguf",
-    "families": ["llama", "clip"],
-    "parameter_size": "7B",
-    "quantization_level": "Q4_0"
-  }
+  "license": "<contents of license block>",
+  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llama2:latest\n\nFROM /Users/username/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8\nTEMPLATE \"\"\"[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] \"\"\"\nSYSTEM \"\"\"\"\"\"\nPARAMETER stop [INST]\nPARAMETER stop [/INST]\nPARAMETER stop <<SYS>>\nPARAMETER stop <</SYS>>\n",
+  "parameters": "stop                           [INST]\nstop                           [/INST]\nstop                           <<SYS>>\nstop                           <</SYS>>",
+  "template": "[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>\n\n{{ end }}{{ .Prompt }} [/INST] "
 }
 ```

@@ -632,21 +207,15 @@ POST /api/copy

 Copy a model. Creates a model with another name from an existing model.

-### Examples
-
-#### Request
+### Request

 ```shell
 curl http://localhost:11434/api/copy -d '{
-  "source": "llama2",
+  "source": "llama2:7b",
  "destination": "llama2-backup"
 }'
 ```

-#### Response
-
-The only response is a 200 OK if successful.
-
 ## Delete a Model

 ```shell
@@ -657,11 +226,9 @@ Delete a model and its data.

 ### Parameters

- `name`: model name to delete
+- `model`: model name to delete

-### Examples
-
-#### Request
+### Request

 ```shell
 curl -X DELETE http://localhost:11434/api/delete -d '{
@@ -669,10 +236,6 @@ curl -X DELETE http://localhost:11434/api/delete -d '{
 }'
 ```

-#### Response
-
-If successful, the only response is a 200 OK.
-
 ## Pull a Model

 ```shell
@@ -685,63 +248,23 @@ Download a model from the ollama library. Cancelled pulls are resumed from where

 - `name`: name of the model to pull
 - `insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
+- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects

-### Examples
-
-#### Request
+### Request

 ```shell
-curl http://localhost:11434/api/pull -d '{
-  "name": "llama2"
+curl -X POST http://localhost:11434/api/pull -d '{
+  "name": "llama2:7b"
 }'
 ```

-#### Response
-
-If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
-
-The first object is the manifest:
-
-```json
-{
-  "status": "pulling manifest"
-}
-```
-
-Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest.
+### Response

 ```json
 {
  "status": "downloading digestname",
  "digest": "digestname",
-  "total": 2142590208,
-  "completed": 241970
-}
-```
-
-After all the files are downloaded, the final responses are:
-
-```json
-{
-    "status": "verifying sha256 digest"
-}
-{
-    "status": "writing manifest"
-}
-{
-    "status": "removing any unused layers"
-}
-{
-    "status": "success"
-}
-```
-
-if `stream` is set to false, then the response is a single JSON object:
-
-```json
-{
-  "status": "success"
+  "total": 2142590208
 }
 ```

@@ -757,21 +280,19 @@ Upload a model to a model library. Requires registering for ollama.ai and adding

 - `name`: name of the model to push in the form of `<namespace>/<model>:<tag>`
 - `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.
- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
+- `stream`: (optional) if `false` the response will be be returned as a single response object, rather than a stream of objects

-### Examples
-
-#### Request
+### Request

 ```shell
-curl http://localhost:11434/api/push -d '{
+curl -X POST http://localhost:11434/api/push -d '{
  "name": "mattw/pygmalion:latest"
 }'
 ```

-#### Response
+### Response

-If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
+Streaming response that starts with:

 ```json
 { "status": "retrieving manifest" }
@@ -804,12 +325,6 @@ Finally, when the upload is complete:
 {"status":"success"}
 ```

-If `stream` is set to `false`, then the response is a single JSON object:
-
-```json
-{ "status": "success" }
-```
-
 ## Generate Embeddings

 ```shell
@@ -827,18 +342,16 @@ Advanced parameters:

 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`

-### Examples
-
-#### Request
+### Request

 ```shell
-curl http://localhost:11434/api/embeddings -d '{
-  "model": "llama2",
+curl -X POST http://localhost:11434/api/embeddings -d '{
+  "model": "llama2:7b",
  "prompt": "Here is an article about llamas..."
 }'
 ```

-#### Response
+### Response

 ```json
 {
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -16,141 +16,19 @@ journalctl -u ollama

 If you're running `ollama serve` directly, the logs will be printed to the console.

-## How can I expose Ollama on my network?
-
-Ollama binds to 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
-
-On macOS:
+## How can I expose the Ollama server?

 ```bash
-OLLAMA_HOST=0.0.0.0:11434 ollama serve
+OLLAMA_HOST=0.0.0.0:11435 ollama serve
 ```

-On Linux:
-
-Create a `systemd` drop-in directory and set `Environment=OLLAMA_HOST`
-
-```bash
-mkdir -p /etc/systemd/system/ollama.service.d
-echo '[Service]' >>/etc/systemd/system/ollama.service.d/environment.conf
-```
-
-```bash
-echo 'Environment="OLLAMA_HOST=0.0.0.0:11434"' >>/etc/systemd/system/ollama.service.d/environment.conf
-```
-
-Reload `systemd` and restart Ollama:
-
-```bash
-systemctl daemon-reload
-systemctl restart ollama
-```
-
-## How can I allow additional web origins to access Ollama?
-
-Ollama allows cross origin requests from `127.0.0.1` and `0.0.0.0` by default. Add additional origins with the `OLLAMA_ORIGINS` environment variable:
-
-On macOS:
+By default, Ollama allows cross origin requests from `127.0.0.1` and `0.0.0.0`. To support more origins, you can use the `OLLAMA_ORIGINS` environment variable:

 ```bash
 OLLAMA_ORIGINS=http://192.168.1.1:*,https://example.com ollama serve
 ```

-On Linux:
-
-```bash
-echo 'Environment="OLLAMA_ORIGINS=http://192.168.1.1:*,https://example.com"' >>/etc/systemd/system/ollama.service.d/environment.conf
-```
-
-Reload `systemd` and restart Ollama:
-
-```bash
-systemctl daemon-reload
-systemctl restart ollama
-```
-
 ## Where are models stored?

 - macOS: Raw model data is stored under `~/.ollama/models`.
 - Linux: Raw model data is stored under `/usr/share/ollama/.ollama/models`
-
-Below the models directory you will find a structure similar to the following:
-
-```shell
-.
-├── blobs
-└── manifests
-   └── registry.ollama.ai
-      ├── f0rodo
-      ├── library
-      ├── mattw
-      └── saikatkumardey
-```
-
-There is a `manifests/registry.ollama.ai/namespace` path. In example above, the user has downloaded models from the official `library`, `f0rodo`, `mattw`, and `saikatkumardey` namespaces. Within each of those directories, you will find directories for each of the models downloaded. And in there you will find a file name representing each tag. Each tag file is the manifest for the model.  
-
-The manifest lists all the layers used in this model. You will see a `media type` for each layer, along with a digest. That digest corresponds with a file in the `models/blobs directory`.
-
-### How can I change where Ollama stores models?
-
-To modify where models are stored, you can use the `OLLAMA_MODELS` environment variable. Note that on Linux this means defining `OLLAMA_MODELS` in a drop-in `/etc/systemd/system/ollama.service.d` service file, reloading systemd, and restarting the ollama service.
-
-## Does Ollama send my prompts and answers back to Ollama.ai to use in any way?
-
-No. Anything you do with Ollama, such as generate a response from the model, stays with you. We don't collect any data about how you use the model. You are always in control of your own data.
-
-## How can I use Ollama in Visual Studio Code?
-
-There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. You can see the list of [extensions & plugins](https://github.com/jmorganca/ollama#extensions--plugins) at the bottom of the main repository readme.
-
-## How do I use Ollama behind a proxy?
-
-Ollama is compatible with proxy servers if `HTTP_PROXY` or `HTTPS_PROXY` are configured. When using either variables, ensure it is set where `ollama serve` can access the values.
-
-When using `HTTPS_PROXY`, ensure the proxy certificate is installed as a system certificate.
-
-On macOS:
-
-```bash
-HTTPS_PROXY=http://proxy.example.com ollama serve
-```
-
-On Linux:
-
-```bash
-echo 'Environment="HTTPS_PROXY=https://proxy.example.com"' >>/etc/systemd/system/ollama.service.d/environment.conf
-```
-
-Reload `systemd` and restart Ollama:
-
-```bash
-systemctl daemon-reload
-systemctl restart ollama
-```
-
-### How do I use Ollama behind a proxy in Docker?
-
-The Ollama Docker container image can be configured to use a proxy by passing `-e HTTPS_PROXY=https://proxy.example.com` when starting the container.
-
-Alternatively, Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on [macOS](https://docs.docker.com/desktop/settings/mac/#proxies), [Windows](https://docs.docker.com/desktop/settings/windows/#proxies), and [Linux](https://docs.docker.com/desktop/settings/linux/#proxies), and Docker [daemon with systemd](https://docs.docker.com/config/daemon/systemd/#httphttps-proxy).
-
-Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.
-
-```dockerfile
-FROM ollama/ollama
-COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
-RUN update-ca-certificates
-```
-
-Build and run this image:
-
-```shell
-docker build -t ollama-with-ca .
-docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
-```
-
-## How do I use Ollama with GPU acceleration in Docker?
-
-The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). See [ollama/ollama](https://hub.docker.com/r/ollama/ollama) for more details.
-
-GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
--- a/docs/import.md
+++ b/docs/import.md
@@ -1,52 +1,20 @@
 # Import a model

-This guide walks through importing a GGUF, PyTorch or Safetensors model.
+This guide walks through importing a PyTorch, Safetensors or GGUF model.

-## Importing (GGUF)
-
-### Step 1: Write a `Modelfile`
-
-Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
-
-```
-FROM ./mistral-7b-v0.1.Q4_0.gguf
-```
-
-(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
-
-```
-FROM ./q4_0.bin
-TEMPLATE "[INST] {{ .Prompt }} [/INST]"
-```
-
-### Step 2: Create the Ollama model
-
-Finally, create a model from your `Modelfile`:
-
-```
-ollama create example -f Modelfile
-```
-
-### Step 3: Run your model
-
-Next, test the model with `ollama run`:
-
-```
-ollama run example "What is your favourite condiment?"
-```
-
-## Importing (PyTorch & Safetensors)
-
-### Supported models
+## Supported models

 Ollama supports a set of model architectures, with support for more coming soon:

 - Llama & Mistral
 - Falcon & RW
+- GPT-NeoX
 - BigCode

 To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).

+## Importing
+
 ### Step 1: Clone the HuggingFace repository (optional)

 If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
@@ -76,7 +44,7 @@ This will output two files into the directory:

 ### Step 3: Write a `Modelfile`

-Next, create a `Modelfile` for your model:
+Next, create a `Modelfile` for your model. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.

 ```
 FROM ./q4_0.bin
@@ -97,15 +65,13 @@ Finally, create a model from your `Modelfile`:
 ollama create example -f Modelfile
 ```

-### Step 5: Run your model
-
 Next, test the model with `ollama run`:

 ```
 ollama run example "What is your favourite condiment?"
 ```

-## Publishing your model (optional – early alpha)
+### Step 5: Publish your model (optional – early alpha)

 Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:

@@ -183,6 +149,9 @@ python convert.py <path to model directory>
 # FalconForCausalLM
 python convert-falcon-hf-to-gguf.py <path to model directory>

+# GPTNeoXForCausalLM
+python convert-falcon-hf-to-gguf.py <path to model directory>
+
 # GPTBigCodeForCausalLM
 python convert-starcoder-hf-to-gguf.py <path to model directory>
 ```
--- a/docs/linux.md
+++ b/docs/linux.md
@@ -1,16 +1,12 @@
-# Ollama on Linux
+# Installing Ollama on Linux

-## Install
-
-Install Ollama running this one-liner:
+> Note: A one line installer for Ollama is available by running:
 >
-```bash
-curl https://ollama.ai/install.sh | sh
-```
+> ```bash
+> curl https://ollama.ai/install.sh | sh
+> ```

-## Manual install
-
-### Download the `ollama` binary
+## Download the `ollama` binary

 Ollama is distributed as a self-contained binary. Download it to a directory in your PATH:

@@ -19,7 +15,31 @@ sudo curl -L https://ollama.ai/download/ollama-linux-amd64 -o /usr/bin/ollama
 sudo chmod +x /usr/bin/ollama
 ```

-### Adding Ollama as a startup service (recommended)
+## Start Ollama
+
+Start Ollama by running `ollama serve`:
+
+```bash
+ollama serve
+```
+
+Once Ollama is running, run a model in another terminal session:
+
+```bash
+ollama run llama2
+```
+
+## Install CUDA drivers (optional – for Nvidia GPUs)
+
+[Download and install](https://developer.nvidia.com/cuda-downloads) CUDA.
+
+Verify that the drivers are installed by running the following command, which should print details about your GPU:
+
+```bash
+nvidia-smi
+```
+
+## Adding Ollama as a startup service (optional)

 Create a user for Ollama:

@@ -40,6 +60,7 @@ User=ollama
 Group=ollama
 Restart=always
 RestartSec=3
+Environment="HOME=/usr/share/ollama"

 [Install]
 WantedBy=default.target
@@ -52,65 +73,10 @@ sudo systemctl daemon-reload
 sudo systemctl enable ollama
 ```

-### Install CUDA drivers (optional – for Nvidia GPUs)
-
-[Download and install](https://developer.nvidia.com/cuda-downloads) CUDA.
-
-Verify that the drivers are installed by running the following command, which should print details about your GPU:
-
-```bash
-nvidia-smi
-```
-
-### Start Ollama
-
-Start Ollama using `systemd`:
-
-```bash
-sudo systemctl start ollama
-```
-
-## Update
-
-Update ollama by running the install script again:
-
-```bash
-curl https://ollama.ai/install.sh | sh
-```
-
-Or by downloading the ollama binary:
-
-```bash
-sudo curl -L https://ollama.ai/download/ollama-linux-amd64 -o /usr/bin/ollama
-sudo chmod +x /usr/bin/ollama
-```
-
-## Viewing logs
+### Viewing logs

 To view logs of Ollama running as a startup service, run:

 ```bash
 journalctl -u ollama
 ```
-
-## Uninstall
-
-Remove the ollama service:
-
-```bash
-sudo systemctl stop ollama
-sudo systemctl disable ollama
-sudo rm /etc/systemd/system/ollama.service
-```
-
-Remove the ollama binary from your bin directory (either `/usr/local/bin`, `/usr/bin`, or `/bin`):
-
-```bash
-sudo rm $(which ollama)
-```
-
-Remove the downloaded models and Ollama service user:
-```bash
-sudo rm -r /usr/share/ollama
-sudo userdel ollama
-```
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -30,19 +30,17 @@ The format of the `Modelfile`:
 INSTRUCTION arguments
 ```

-| Instruction                         | Description                                                    |
-| ----------------------------------- | -------------------------------------------------------------- |
-| [`FROM`](#from-required) (required) | Defines the base model to use.                                 |
-| [`PARAMETER`](#parameter)           | Sets the parameters for how Ollama will run the model.         |
-| [`TEMPLATE`](#template)             | The full prompt template to be sent to the model.              |
-| [`SYSTEM`](#system)                 | Specifies the system message that will be set in the template. |
-| [`ADAPTER`](#adapter)               | Defines the (Q)LoRA adapters to apply to the model.            |
-| [`LICENSE`](#license)               | Specifies the legal license.                                   |
+| Instruction                         | Description                                                   |
+| ----------------------------------- | ------------------------------------------------------------- |
+| [`FROM`](#from-required) (required) | Defines the base model to use.                                |
+| [`PARAMETER`](#parameter)           | Sets the parameters for how Ollama will run the model.        |
+| [`TEMPLATE`](#template)             | The full prompt template to be sent to the model.             |
+| [`SYSTEM`](#system)                 | Specifies the system prompt that will be set in the template. |
+| [`ADAPTER`](#adapter)               | Defines the (Q)LoRA adapters to apply to the model.           |
+| [`LICENSE`](#license)               | Specifies the legal license.                                  |

 ## Examples

-### Basic `Modelfile`
-
 An example of a `Modelfile` creating a mario blueprint:

 ```modelfile
@@ -52,7 +50,7 @@ PARAMETER temperature 1
 # sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
 PARAMETER num_ctx 4096

-# sets a custom system message to specify the behavior of the chat assistant
+# sets a custom system prompt to specify the behavior of the chat assistant
 SYSTEM You are Mario from super mario bros, acting as an assistant.
 ```

@@ -65,35 +63,6 @@ To use this:

 More examples are available in the [examples directory](../examples).

-### `Modelfile`s in [ollama.ai/library][1]
-
-There are two ways to view `Modelfile`s underlying the models in [ollama.ai/library][1]:
-
- Option 1: view a details page from a model's tags page:
-  1.  Go to a particular model's tags (e.g. https://ollama.ai/library/llama2/tags)
-  2.  Click on a tag (e.g. https://ollama.ai/library/llama2:13b)
-  3.  Scroll down to "Layers"
-      - Note: if the [`FROM` instruction](#from-required) is not present,
-        it means the model was created from a local file
- Option 2: use `ollama show` to print the `Modelfile` like so:
-
-  ```bash
-  > ollama show --modelfile llama2:13b
-  # Modelfile generated by "ollama show"
-  # To build a new Modelfile based on this one, replace the FROM line with:
-  # FROM llama2:13b
-
-  FROM /root/.ollama/models/blobs/sha256:123abc
-  TEMPLATE """[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>
-
-  {{ end }}{{ .Prompt }} [/INST] """
-  SYSTEM """"""
-  PARAMETER stop [INST]
-  PARAMETER stop [/INST]
-  PARAMETER stop <<SYS>>
-  PARAMETER stop <</SYS>>
-  ```
-
 ## Instructions

 ### FROM (Required)
@@ -143,8 +112,8 @@ PARAMETER <parameter> <parametervalue>
 | repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)                                                                                                                                           | int        | repeat_last_n 64     |
 | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1   |
 | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      |
-| seed           | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0)                                                                                       | int        | seed 42              |
-| stop           | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile.                                      | string     | stop "AI assistant:" |
+| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
+| stop           | Sets the stop sequences to use.                                                                                                                                                                                                                         | string     | stop "AI assistant:" |
 | tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1              |
 | num_predict    | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)                                                                                                                                   | int        | num_predict 42       |
 | top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                        | int        | top_k 40             |
@@ -152,15 +121,15 @@ PARAMETER <parameter> <parametervalue>

 ### TEMPLATE

-`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system message and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.
+`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.

 #### Template Variables

-| Variable        | Description                                                                                                   |
-| --------------- | ------------------------------------------------------------------------------------------------------------- |
-| `{{ .System }}` | The system message used to specify custom behavior, this must also be set in the Modelfile as an instruction. |
-| `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input.                  |
-| `{{ .First }}`  | A boolean value used to render specific template information for the first generation of a session.           |
+| Variable        | Description                                                                                                  |
+| --------------- | ------------------------------------------------------------------------------------------------------------ |
+| `{{ .System }}` | The system prompt used to specify custom behavior, this must also be set in the Modelfile as an instruction. |
+| `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input.                 |
+| `{{ .First }}`  | A boolean value used to render specific template information for the first generation of a session.          |

 ```modelfile
 TEMPLATE """
@@ -180,7 +149,7 @@ SYSTEM """<system message>"""

 ### SYSTEM

-The `SYSTEM` instruction specifies the system message to be used in the template, if applicable.
+The `SYSTEM` instruction specifies the system prompt to be used in the template, if applicable.

 ```modelfile
 SYSTEM """<system message>"""
@@ -208,5 +177,3 @@ LICENSE """

 - the **`Modelfile` is not case sensitive**. In the examples, we use uppercase for instructions to make it easier to distinguish it from arguments.
 - Instructions can be in any order. In the examples, we start with FROM instruction to keep it easily readable.
-
-[1]: https://ollama.ai/library
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@@ -4,6 +4,5 @@ Here is a list of ways you can use Ollama with other tools to build interesting

 - [Using LangChain with Ollama in JavaScript](./tutorials/langchainjs.md)
 - [Using LangChain with Ollama in Python](./tutorials/langchainpy.md)
- [Running Ollama on NVIDIA Jetson Devices](./tutorials/nvidia-jetson.md)

-Also be sure to check out the [examples](../examples) directory for more ways to use Ollama.
+Also be sure to check out the [examples](../examples) directory for more ways to use Ollama.
--- a/docs/tutorials/fly-gpu.md
+++ b/docs/tutorials/fly-gpu.md
@@ -1,83 +0,0 @@
-# Running Ollama on Fly.io GPU Instances
-
-Ollama runs with little to no configuration on [Fly.io GPU instances](https://fly.io/docs/gpus/gpu-quickstart/). If you don't have access to GPUs yet, you'll need to [apply for access](https://fly.io/gpu/) on the waitlist. Once you're accepted, you'll get an email with instructions on how to get started.
-
-Create a new app with `fly apps create`:
-
-```bash
-fly apps create
-```
-
-Then create a `fly.toml` file in a new folder that looks like this:
-
-```toml
-app = "sparkling-violet-709"
-primary_region = "ord"
-vm.size = "a100-40gb" # see https://fly.io/docs/gpus/gpu-quickstart/ for more info
-
-[build]
-  image = "ollama/ollama"
-
-[http_service]
-  internal_port = 11434
-  force_https = false
-  auto_stop_machines = true
-  auto_start_machines = true
-  min_machines_running = 0
-  processes = ["app"]
-
-[mounts]
-  source = "models"
-  destination = "/root/.ollama"
-  initial_size = "100gb"
-```
-
-Then create a [new private IPv6 address](https://fly.io/docs/reference/private-networking/#flycast-private-load-balancing) for your app:
-
-```bash
-fly ips allocate-v6 --private
-```
-
-Then deploy your app:
-
-```bash
-fly deploy
-```
-
-And finally you can access it interactively with a new Fly.io Machine:
-
-```
-fly machine run -e OLLAMA_HOST=http://your-app-name.flycast --shell ollama/ollama
-```
-
-```bash
-$ ollama run openchat:7b-v3.5-fp16
->>> How do I bake chocolate chip cookies?
- To bake chocolate chip cookies, follow these steps:
-
-1. Preheat the oven to 375°F (190°C) and line a baking sheet with parchment paper or silicone baking mat.
-
-2. In a large bowl, mix together 1 cup of unsalted butter (softened), 3/4 cup granulated sugar, and 3/4
-cup packed brown sugar until light and fluffy.
-
-3. Add 2 large eggs, one at a time, to the butter mixture, beating well after each addition. Stir in 1
-teaspoon of pure vanilla extract.
-
-4. In a separate bowl, whisk together 2 cups all-purpose flour, 1/2 teaspoon baking soda, and 1/2 teaspoon
-salt. Gradually add the dry ingredients to the wet ingredients, stirring until just combined.
-
-5. Fold in 2 cups of chocolate chips (or chunks) into the dough.
-
-6. Drop rounded tablespoons of dough onto the prepared baking sheet, spacing them about 2 inches apart.
-
-7. Bake for 10-12 minutes, or until the edges are golden brown. The centers should still be slightly soft.
-
-8. Allow the cookies to cool on the baking sheet for a few minutes before transferring them to a wire rack
-to cool completely.
-
-Enjoy your homemade chocolate chip cookies!
-```
-
-When you set it up like this, it will automatically turn off when you're done using it. Then when you access it again, it will automatically turn back on. This is a great way to save money on GPU instances when you're not using them. If you want a persistent wake-on-use connection to your Ollama instance, you can set up a [connection to your Fly network using WireGuard](https://fly.io/docs/reference/private-networking/#discovering-apps-through-dns-on-a-wireguard-connection). Then you can access your Ollama instance at `http://your-app-name.flycast`.
-
-And that's it!
--- a/docs/tutorials/langchainjs.md
+++ b/docs/tutorials/langchainjs.md
@@ -23,17 +23,13 @@ const answer = await ollama.call(`why is the sky blue?`);
 console.log(answer);
 ```

-That will get us the same thing as if we ran `ollama run llama2 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's install **Cheerio** and build that part of the app.
-
-```bash
-npm install cheerio 
-```
+That will get us the same thing as if we ran `ollama run llama2 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's build that part of the app.

 ```javascript
 import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";

 const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/2023_Hawaii_wildfires");
-const data = await loader.load();
+const data = loader.load();
 ```

 That will load the document. Although this page is smaller than the Odyssey, it is certainly bigger than the context size for most LLMs. So we are going to need to split into smaller pieces, and then select just the pieces relevant to our question. This is a great use for a vector datastore. In this example, we will use the **MemoryVectorStore** that is part of **LangChain**. But there is one more thing we need to get the content into the datastore. We have to run an embeddings process that converts the tokens in the text into a series of vectors. And for that, we are going to use **Tensorflow**. There is a lot of stuff going on in this one. First, install the **Tensorflow** components that we need.
--- a/docs/tutorials/langchainpy.md
+++ b/docs/tutorials/langchainpy.md
@@ -42,13 +42,12 @@ text_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
 all_splits = text_splitter.split_documents(data)
 ```

-It's split up, but we have to find the relevant splits and then submit those to the model. We can do this by creating embeddings and storing them in a vector database. We can use Ollama directly to instantiate an embedding model. We will use ChromaDB in this example for a vector database. `pip install GPT4All chromadb`
+It's split up, but we have to find the relevant splits and then submit those to the model. We can do this by creating embeddings and storing them in a vector database. For now, we don't have embeddings built in to Ollama, though we will be adding that soon, so for now, we can use the GPT4All library for that. We will use ChromaDB in this example for a vector database. `pip install GPT4All chromadb`

 ```python
-from langchain.embeddings import OllamaEmbeddings
+from langchain.embeddings import GPT4AllEmbeddings
 from langchain.vectorstores import Chroma
-oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2")
-vectorstore = Chroma.from_documents(documents=all_splits, embedding=oembed)
+vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
 ```

 Now let's ask a question from the document. **Who was Neleus, and who is in his family?** Neleus is a character in the Odyssey, and the answer can be found in our text.
--- a/docs/tutorials/nvidia-jetson.md
+++ b/docs/tutorials/nvidia-jetson.md
@@ -1,38 +0,0 @@
-# Running Ollama on NVIDIA Jetson Devices
-
-With some minor configuration, Ollama runs well on [NVIDIA Jetson Devices](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/). The following has been tested on [JetPack 5.1.2](https://developer.nvidia.com/embedded/jetpack).
-
-NVIDIA Jetson devices are Linux-based embedded AI computers that are purpose-built for AI applications.
-
-Jetsons have an integrated GPU that is wired directly to the memory controller of the machine. For this reason, the `nvidia-smi` command is unrecognized, and Ollama proceeds to operate in "CPU only"
-mode. This can be verified by using a monitoring tool like jtop.
-
-In order to address this, we simply pass the path to the Jetson's pre-installed CUDA libraries into `ollama serve` (while in a tmux session). We then hardcode the num_gpu parameters into a cloned
-version of our target model.
-
-Prerequisites:
-
- curl
- tmux
-
-Here are the steps:
-
- Install Ollama via standard Linux command (ignore the 404 error): `curl https://ollama.ai/install.sh | sh`
- Stop the Ollama service: `sudo systemctl stop ollama`
- Start Ollama serve in a tmux session called ollama_jetson and reference the CUDA libraries path: `tmux has-session -t ollama_jetson 2>/dev/null || tmux new-session -d -s ollama_jetson 
-'LD_LIBRARY_PATH=/usr/local/cuda/lib64 ollama serve'`
- Pull the model you want to use (e.g. mistral): `ollama pull mistral`
- Create a new Modelfile specifically for enabling GPU support on the Jetson: `touch ModelfileMistralJetson`
- In the ModelfileMistralJetson file, specify the FROM model and the num_gpu PARAMETER as shown below:
-
-```
-FROM mistral
-PARAMETER num_gpu 999
-```
-
- Create a new model from your Modelfile: `ollama create mistral-jetson -f ./ModelfileMistralJetson`
- Run the new model: `ollama run mistral-jetson`
-
-If you run a monitoring tool like jtop you should now see that Ollama is using the Jetson's integrated GPU.
-
-And that's it!
--- a/examples/bash-comparemodels/README.md
+++ b/examples/bash-comparemodels/README.md
@@ -1,10 +0,0 @@
-# Bash Shell examples
-
-When calling `ollama`, you can pass it a file to run all the prompts in the file, one after the other:
-
-`ollama run llama2 < sourcequestions.txt`
-
-This concept is used in the following example.
-
-## Compare Models
-`comparemodels.sh` is a script that runs all the questions in `sourcequestions.txt` using any 4 models you choose that you have already pulled from the Ollama library or have created locally.
--- a/examples/bash-comparemodels/comparemodels.sh
+++ b/examples/bash-comparemodels/comparemodels.sh
@@ -1,64 +0,0 @@
-#! /usr/bin/env bash
-# Compare multiple models by running them with the same questions
-
-NUMBEROFCHOICES=4
-SELECTIONS=()
-declare -a SUMS=()
-
-# Get the list of models
-CHOICES=$(ollama list | awk '{print $1}')
-
-# Select which models to run as a comparison
-echo "Select $NUMBEROFCHOICES models to compare:"
-select ITEM in $CHOICES; do
-    if [[ -n $ITEM ]]; then
-        echo "You have selected $ITEM"
-        SELECTIONS+=("$ITEM")
-        ((COUNT++))
-        if [[ $COUNT -eq $NUMBEROFCHOICES ]]; then
-            break
-        fi
-    else
-        echo "Invalid selection"
-    fi
-done
-
-# Loop through each of the selected models
-for ITEM in "${SELECTIONS[@]}"; do
-    echo "--------------------------------------------------------------"
-    echo "Loading the model $ITEM into memory"
-    ollama run "$ITEM" ""
-    echo "--------------------------------------------------------------"
-    echo "Running the questions through the model $ITEM"
-    COMMAND_OUTPUT=$(ollama run "$ITEM" --verbose < sourcequestions.txt 2>&1| tee /dev/stderr)
-
-    # eval duration is sometimes listed in seconds and sometimes in milliseconds. 
-    # Add up the values for each model
-    SUM=$(echo "$COMMAND_OUTPUT" | awk '
-    /eval duration:/ {
-        value = $3
-        if (index(value, "ms") > 0) {
-            gsub("ms", "", value)
-            value /= 1000
-        } else {
-            gsub("s", "", value)
-        }
-        sum += value
-    }
-    END { print sum }')
-
-
-    SUMS+=("All questions for $ITEM completed in $SUM seconds")
-done
-
-echo ""
-echo "--------------------------------------------------------------"
-echo -e "Sums of eval durations for each run:"
-for val in "${SUMS[@]}"; do
-    echo "$val"
-done
-
-echo "--------------------------------------------------------------"
-echo "Comparison complete. Now you can decide"
-echo "which model is best."
-echo "--------------------------------------------------------------"
--- a/examples/bash-comparemodels/sourcequestions.txt
+++ b/examples/bash-comparemodels/sourcequestions.txt
@@ -1,7 +0,0 @@
-Why is the sky blue
-What is a black hole
-Explain the big bang theory like I am 5?
-What is the quickest way to win a game of Monopoly with 3 others?
-Why does a vacuum bottle keep my coffee hot and my milkshake cold?
-What is the difference between a meteor, a meteorite, and a meteoroid?
-Create an array with 5 items and print to the console. Do this in Python, C#, Typescript, and Rust.
--- a/examples/jupyter-notebook/README.md
+++ b/examples/jupyter-notebook/README.md
@@ -1,5 +0,0 @@
-# Ollama Jupyter Notebook
-
-This example downloads and installs Ollama in a Jupyter instance such as Google Colab. It will start the Ollama service and expose an endpoint using `ngrok` which can be used to communicate with the Ollama instance remotely.
-
-For best results, use an instance with GPU accelerator.
--- a/examples/jupyter-notebook/ollama.ipynb
+++ b/examples/jupyter-notebook/ollama.ipynb
@@ -1,102 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "93f59dcb-c588-41b8-a792-55d88ade739c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Download and run the Ollama Linux install script\n",
-    "!curl https://ollama.ai/install.sh | sh\n",
-    "!command -v systemctl >/dev/null && sudo systemctl stop ollama"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "658c147e-c7f8-490e-910e-62b80f577dda",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install aiohttp pyngrok\n",
-    "\n",
-    "import os\n",
-    "import asyncio\n",
-    "from aiohttp import ClientSession\n",
-    "\n",
-    "# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred\n",
-    "# over the built-in library. This is particularly important for \n",
-    "# Google Colab which installs older drivers\n",
-    "os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})\n",
-    "\n",
-    "async def run(cmd):\n",
-    "  '''\n",
-    "  run is a helper function to run subcommands asynchronously.\n",
-    "  '''\n",
-    "  print('>>> starting', *cmd)\n",
-    "  p = await asyncio.subprocess.create_subprocess_exec(\n",
-    "      *cmd,\n",
-    "      stdout=asyncio.subprocess.PIPE,\n",
-    "      stderr=asyncio.subprocess.PIPE,\n",
-    "  )\n",
-    "\n",
-    "  async def pipe(lines):\n",
-    "    async for line in lines:\n",
-    "      print(line.strip().decode('utf-8'))\n",
-    "\n",
-    "  await asyncio.gather(\n",
-    "      pipe(p.stdout),\n",
-    "      pipe(p.stderr),\n",
-    "  )\n",
-    "\n",
-    "\n",
-    "await asyncio.gather(\n",
-    "    run(['ollama', 'serve']),\n",
-    "    run(['ngrok', 'http', '--log', 'stderr', '11434']),\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e7735a55-9aad-4caf-8683-52e2163ba53b",
-   "metadata": {},
-   "source": [
-    "The previous cell starts two processes, `ollama` and `ngrok`. The log output will show a line like the following which describes the external address.\n",
-    "\n",
-    "```\n",
-    "t=2023-11-12T22:55:56+0000 lvl=info msg=\"started tunnel\" obj=tunnels name=command_line addr=http://localhost:11434 url=https://8249-34-125-179-11.ngrok.io\n",
-    "```\n",
-    "\n",
-    "The external address in this case is `https://8249-34-125-179-11.ngrok.io` which can be passed into `OLLAMA_HOST` to access this instance.\n",
-    "\n",
-    "```bash\n",
-    "export OLLAMA_HOST=https://8249-34-125-179-11.ngrok.io\n",
-    "ollama list\n",
-    "ollama run mistral\n",
-    "```"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.6"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
--- a/examples/kubernetes/README.md
+++ b/examples/kubernetes/README.md
@@ -1,36 +0,0 @@
-# Deploy Ollama to Kubernetes
-
-## Prerequisites
-
- Ollama: https://ollama.ai/download
- Kubernetes cluster. This example will use Google Kubernetes Engine.
-
-## Steps
-
-1. Create the Ollama namespace, daemon set, and service
-
-    ```bash
-    kubectl apply -f cpu.yaml
-    ```
-
-1. Port forward the Ollama service to connect and use it locally
-
-    ```bash
-    kubectl -n ollama port-forward service/ollama 11434:80
-    ```
-
-1. Pull and run a model, for example `orca-mini:3b`
-
-    ```bash
-    ollama run orca-mini:3b
-    ```
-
-## (Optional) Hardware Acceleration
-
-Hardware acceleration in Kubernetes requires NVIDIA's [`k8s-device-plugin`](https://github.com/NVIDIA/k8s-device-plugin). Follow the link for more details.
-
-Once configured, create a GPU enabled Ollama deployment.
-
-```bash
-kubectl apply -f gpu.yaml
-```
--- a/examples/kubernetes/cpu.yaml
+++ b/examples/kubernetes/cpu.yaml
@@ -1,42 +0,0 @@
---
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: ollama
---
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: ollama
-  namespace: ollama
-spec:
-  selector:
-    matchLabels:
-      name: ollama
-  template:
-    metadata:
-      labels:
-        name: ollama
-    spec:
-      containers:
-      - name: ollama
-        image: ollama/ollama:latest
-        ports:
-        - name: http
-          containerPort: 11434
-          protocol: TCP
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: ollama
-  namespace: ollama
-spec:
-  type: ClusterIP
-  selector:
-    name: ollama
-  ports:
-  - port: 80
-    name: http
-    targetPort: http
-    protocol: TCP
--- a/examples/kubernetes/gpu.yaml
+++ b/examples/kubernetes/gpu.yaml
@@ -1,58 +0,0 @@
---
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: ollama
---
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: ollama
-  namespace: ollama
-spec:
-  strategy:
-    type: Recreate
-  selector:
-    matchLabels:
-      name: ollama
-  template:
-    metadata:
-      labels:
-        name: ollama
-    spec:
-      containers:
-      - name: ollama
-        image: ollama/ollama:latest
-        env:
-        - name: PATH
-          value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
-        - name: LD_LIBRARY_PATH
-          value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
-        - name: NVIDIA_DRIVER_CAPABILITIES
-          value: compute,utility
-        ports:
-        - name: http
-          containerPort: 11434
-          protocol: TCP
-        resources:
-          limits:
-            nvidia.com/gpu: 1
-      tolerations:
-      - key: nvidia.com/gpu
-        operator: Exists
-        effect: NoSchedule
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: ollama
-  namespace: ollama
-spec:
-  type: ClusterIP
-  selector:
-    name: ollama
-  ports:
-  - port: 80
-    name: http
-    targetPort: http
-    protocol: TCP
--- a/examples/langchain-python-rag-privategpt/constants.py
+++ b/examples/langchain-python-rag-privategpt/constants.py
@@ -6,6 +6,7 @@ PERSIST_DIRECTORY = os.environ.get('PERSIST_DIRECTORY', 'db')

 # Define the Chroma settings
 CHROMA_SETTINGS = Settings(
+        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
 )
--- a/examples/langchain-python-rag-privategpt/ingest.py
+++ b/examples/langchain-python-rag-privategpt/ingest.py
@@ -150,7 +150,7 @@ def main():
        print("Creating new vectorstore")
        texts = process_documents()
        print(f"Creating embeddings. May take some minutes...")
-        db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory)
+        db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
    db.persist()
    db = None

--- a/examples/langchain-python-rag-privategpt/privateGPT.py
+++ b/examples/langchain-python-rag-privategpt/privateGPT.py
@@ -4,7 +4,6 @@ from langchain.embeddings import HuggingFaceEmbeddings
 from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
 from langchain.vectorstores import Chroma
 from langchain.llms import Ollama
-import chromadb
 import os
 import argparse
 import time
@@ -23,9 +22,7 @@ def main():
    # Parse the command line arguments
    args = parse_arguments()
    embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
-
-    db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
-
+    db = Chroma(persist_directory=persist_directory, embedding_function=embeddings, client_settings=CHROMA_SETTINGS)
    retriever = db.as_retriever(search_kwargs={"k": target_source_chunks})
    # activate/deactivate the streaming StdOut callback for LLMs
    callbacks = [] if args.mute_stream else [StreamingStdOutCallbackHandler()]
--- a/examples/langchain-python-rag-privategpt/requirements.txt
+++ b/examples/langchain-python-rag-privategpt/requirements.txt
--- a/examples/python-json-datagenerator/predefinedschema.py
+++ b/examples/python-json-datagenerator/predefinedschema.py
@@ -1,31 +0,0 @@
-import requests
-import json
-import random
-
-model = "llama2"
-template = {
-  "firstName": "", 
-  "lastName": "", 
-  "address": {
-    "street": "", 
-    "city": "", 
-    "state": "", 
-    "zipCode": ""
-  }, 
-  "phoneNumber": ""
-}
-
-prompt = f"generate one realistically believable sample data set of a persons first name, last name, address in the US, and  phone number. \nUse the following template: {json.dumps(template)}."
-
-data = {
-    "prompt": prompt,
-    "model": model,
-    "format": "json",
-    "stream": False,
-    "options": {"temperature": 2.5, "top_p": 0.99, "top_k": 100},
-}
-
-print(f"Generating a sample user")
-response = requests.post("http://localhost:11434/api/generate", json=data, stream=False)
-json_data = json.loads(response.text)
-print(json.dumps(json.loads(json_data["response"]), indent=2))
--- a/examples/python-json-datagenerator/randomaddresses.py
+++ b/examples/python-json-datagenerator/randomaddresses.py
@@ -1,31 +0,0 @@
-import requests
-import json
-import random
-
-countries = [
-    "United States",
-    "United Kingdom",
-    "the Netherlands",
-    "Germany",
-    "Mexico",
-    "Canada",
-    "France",
-]
-country = random.choice(countries)
-model = "llama2"
-
-prompt = f"generate one realistically believable sample data set of a persons first name, last name, address in {country}, and phone number. Do not use common names. Respond using JSON. Key names should have no backslashes, values should use plain ascii with no special characters."
-
-data = {
-    "prompt": prompt,
-    "model": model,
-    "format": "json",
-    "stream": False,
-    "options": {"temperature": 2.5, "top_p": 0.99, "top_k": 100},
-}
-
-print(f"Generating a sample user in {country}")
-response = requests.post("http://localhost:11434/api/generate", json=data, stream=False)
-json_data = json.loads(response.text)
-
-print(json.dumps(json.loads(json_data["response"]), indent=2))
--- a/examples/python-json-datagenerator/readme.md
+++ b/examples/python-json-datagenerator/readme.md
@@ -1,34 +0,0 @@
-# JSON Output Example
-
-![llmjson 2023-11-10 15_31_31](https://github.com/jmorganca/ollama/assets/633681/e599d986-9b4a-4118-81a4-4cfe7e22da25)
-
-There are two python scripts in this example. `randomaddresses.py` generates random addresses from different countries. `predefinedschema.py` sets a template for the model to fill in.
-
-## Review the Code
-
-Both programs are basically the same, with a different prompt for each, demonstrating two different ideas. The key part of getting JSON out of a model is to state in the prompt or system prompt that it should respond using JSON, and specifying the `format` as `json` in the data body.
-
-```python
-prompt = f"generate one realistically believable sample data set of a persons first name, last name, address in {country}, and  phone number. Do not use common names. Respond using JSON. Key names should with no backslashes, values should use plain ascii with no special characters."
-
-data = {
-    "prompt": prompt,
-    "model": model,
-    "format": "json",
-    "stream": False,
-    "options": {"temperature": 2.5, "top_p": 0.99, "top_k": 100},
-}
-```
-
-When running `randomaddresses.py` you will see that the schema changes and adapts to the chosen country.
-
-In `predefinedschema.py`, a template has been specified in the prompt as well. It's been defined as JSON and then dumped into the prompt string to make it easier to work with.
-
-Both examples turn streaming off so that we end up with the completed JSON all at once. We need to convert the `response.text` to JSON so that when we output it as a string we can set the indent spacing to make the output easy to read.
-
-```python
-response = requests.post("http://localhost:11434/api/generate", json=data, stream=False)
-json_data = json.loads(response.text)
-
-print(json.dumps(json.loads(json_data["response"]), indent=2))
-```
--- a/examples/python-json-datagenerator/requirements.txt
+++ b/examples/python-json-datagenerator/requirements.txt
@@ -1 +0,0 @@
-Requests==2.31.0
--- a/examples/python-loganalysis/Modelfile
+++ b/examples/python-loganalysis/Modelfile
@@ -1,8 +0,0 @@
-FROM codebooga:latest
-
-SYSTEM """
-You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer.
-"""
-
-PARAMETER TEMPERATURE 0.3
-
--- a/examples/python-loganalysis/loganalysis.py
+++ b/examples/python-loganalysis/loganalysis.py
@@ -1,42 +0,0 @@
-import sys
-import re
-import requests
-import json
-
-# prelines and postlines represent the number of lines of context to include in the output around the error
-prelines = 10
-postlines = 10
-
-def find_errors_in_log_file():
-  if len(sys.argv) < 2:
-    print("Usage: python loganalysis.py <filename>")
-    return
-
-  log_file_path = sys.argv[1]
-  with open(log_file_path, 'r') as log_file:
-    log_lines = log_file.readlines()
-
-error_logs = []
-    for i, line in enumerate(log_lines):
-        if "error" in line.lower():
-            start_index = max(0, i - prelines)
-            end_index = min(len(log_lines), i + postlines + 1)
-            error_logs.extend(log_lines[start_index:end_index])
-
-  return error_logs
-
-error_logs = find_errors_in_log_file()
-
-data = {
-  "prompt": "\n".join(error_logs), 
-  "model": "mattw/loganalyzer"
-}
-
-
-response = requests.post("http://localhost:11434/api/generate", json=data, stream=True)
-for line in response.iter_lines():
-  if line:
-    json_data = json.loads(line)
-    if json_data['done'] == False:
-      print(json_data['response'], end='', flush=True)
-
--- a/examples/python-loganalysis/logtest.logfile
+++ b/examples/python-loganalysis/logtest.logfile
@@ -1,32 +0,0 @@
-2023-11-10 07:17:40 /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
-2023-11-10 07:17:40 /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
-2023-11-10 07:17:40 /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
-2023-11-10 07:17:40 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
-2023-11-10 07:17:40 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
-2023-11-10 07:17:40 /docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
-2023-11-10 07:17:40 /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
-2023-11-10 07:17:40 /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
-2023-11-10 07:17:40 /docker-entrypoint.sh: Configuration complete; ready for start up
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: using the "epoll" event method
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: nginx/1.25.3
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14) 
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: OS: Linux 6.4.16-linuxkit
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker processes
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 29
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 30
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 31
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 32
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 33
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 34
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 35
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 36
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 37
-2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: start worker process 38
-2023-11-10 07:17:44 192.168.65.1 - - [10/Nov/2023:13:17:43 +0000] "GET / HTTP/1.1" 200 615 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "-"
-2023-11-10 07:17:44 2023/11/10 13:17:44 [error] 29#29: *1 open() "/usr/share/nginx/html/favicon.ico" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "localhost:8080", referrer: "http://localhost:8080/"
-2023-11-10 07:17:44 192.168.65.1 - - [10/Nov/2023:13:17:44 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "-"
-2023-11-10 07:17:50 2023/11/10 13:17:50 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080"
-2023-11-10 07:17:50 192.168.65.1 - - [10/Nov/2023:13:17:50 +0000] "GET /ahstat HTTP/1.1" 404 555 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "-"
-2023-11-10 07:18:53 2023/11/10 13:18:53 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080"
-2023-11-10 07:18:53 192.168.65.1 - - [10/Nov/2023:13:18:53 +0000] "GET /ahstat HTTP/1.1" 404 555 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "-"
--- a/examples/python-loganalysis/readme.md
+++ b/examples/python-loganalysis/readme.md
@@ -1,48 +0,0 @@
-# Log Analysis example
-
-![loganalyzer 2023-11-10 08_53_29](https://github.com/jmorganca/ollama/assets/633681/ad30f1fc-321f-4953-8914-e30e24db9921)
-
-This example shows one possible way to create a log file analyzer. To use it, run:
-
-`python loganalysis.py <logfile>`
-
-You can try this with the `logtest.logfile` file included in this directory.
-
-## Review the code
-
-The first part of this example is a Modelfile that takes `codebooga` and applies a new System Prompt:
-
-```plaintext
-SYSTEM """
-You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer.
-"""
-```
-
-This model is available at https://ollama.ai/mattw/loganalyzer. You can customize it and add to your own namespace using the command `ollama create <namespace/modelname> -f <path-to-modelfile>` then `ollama push <namespace/modelname>`.
-
-Then loganalysis.py scans all the lines in the given log file and searches for the word 'error'. When the word is found, the 10 lines before and after are set as the prompt for a call to the Generate API.
-
-```python
-data = {
-  "prompt": "\n".join(error_logs), 
-  "model": "mattw/loganalyzer"
-}
-```
-
-Finally, the streamed output is parsed and the response field in the output is printed to the line.
-
-```python
-response = requests.post("http://localhost:11434/api/generate", json=data, stream=True)
-for line in response.iter_lines():
-  if line:
-    json_data = json.loads(line)
-    if json_data['done'] == False:
-      print(json_data['response'], end='')
-
-```
-
-## Next Steps
-
-There is a lot more that can be done here. This is a simple way to detect errors, looking for the word error. Perhaps it would be interesting to find anomalous activity in the logs. It could be interesting to create embeddings for each line and compare them, looking for similar lines. Or look into applying Levenshtein Distance algorithms to find similar lines to help identify the anomalous lines.
-
-Also try different models and different prompts to analyze the data. You could consider adding retrieval augmented generation (RAG) to this to help understand newer log formats.
--- a/examples/python-loganalysis/requirements.txt
+++ b/examples/python-loganalysis/requirements.txt
@@ -1 +0,0 @@
-Requests==2.31.0
--- a/examples/python-simplechat/client.py
+++ b/examples/python-simplechat/client.py
@@ -1,46 +0,0 @@
-import json
-import requests
-
-# NOTE: ollama must be running for this to work, start the ollama app or run `ollama serve`
-model = "llama2"  # TODO: update this for whatever model you wish to use
-
-
-def chat(messages):
-    r = requests.post(
-        "http://0.0.0.0:11434/api/chat",
-        json={"model": model, "messages": messages, "stream": True},
-    )
-    r.raise_for_status()
-    output = ""
-
-    for line in r.iter_lines():
-        body = json.loads(line)
-        if "error" in body:
-            raise Exception(body["error"])
-        if body.get("done") is False:
-            message = body.get("message", "")
-            content = message.get("content", "")
-            output += content
-            # the response streams one token at a time, print that as we receive it
-            print(content, end="", flush=True)
-
-
-        if body.get("done", False):
-            message["content"] = output
-            return message
-
-
-def main():
-    messages = []
-    
-    while True:
-        user_input = input("Enter a prompt: ")
-        print()
-        messages.append({"role": "user", "content": user_input})
-        message = chat(messages)
-        messages.append(message)
-        print("\n\n")
-
-
-if __name__ == "__main__":
-    main()
--- a/examples/python-simplechat/readme.md
+++ b/examples/python-simplechat/readme.md
@@ -1,24 +0,0 @@
-# Simple Chat Example
-
-The **chat** endpoint is one of two ways to generate text from an LLM with Ollama. At a high level you provide the endpoint an array of objects with a role and content specified. Then with each output and prompt, you add more of those role/content objects, which builds up the history.
-
-## Review the Code
-
-You can see in the **chat** function that actually calling the endpoint is done simply with:
-
-```python
-r = requests.post(
-  "http://0.0.0.0:11434/api/chat",
-  json={"model": model, "messages": messages, "stream": True},
-)
-```
-
-With the **generate** endpoint, you need to provide a `prompt`. But with **chat**, you provide `messages`. And the resulting stream of responses includes a `message` object with a `content` field.
-
-The final JSON object doesn't provide the full content, so you will need to build the content yourself.
-
-In the **main** function, we collect `user_input` and add it as a message to our messages and that is passed to the chat function. When the LLM is done responding the output is added as another message.
-
-## Next Steps
-
-In this example, all generations are kept. You might want to experiment with summarizing everything older than 10 conversations to enable longer history with less context being used.
--- a/examples/python-simplegenerate/client.py
+++ b/examples/python-simplegenerate/client.py
@@ -17,7 +17,7 @@ def generate(prompt, context):
    for line in r.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
-        # the response streams one token at a time, print that as we receive it
+        # the response streams one token at a time, print that as we recieve it
        print(response_part, end='', flush=True)

        if 'error' in body:
@@ -35,4 +35,4 @@ def main():
        print()

 if __name__ == "__main__":
-    main()
+    main()
--- a/examples/typescript-functioncalling/extractemail.ts
+++ b/examples/typescript-functioncalling/extractemail.ts
@@ -1,118 +0,0 @@
-import { Ollama } from "ollama-node";
-import { readFile } from "fs/promises";
-
-// function to be called on events
-function reportEvents(name: string, date: string, location: string) {
-  const nameString = name ? `${name}` : `an event`;
-  const dateString = date ? ` on ${date}` : ``;
-  const locationString = location ? ` at ${location}` : ``;
-  console.log(`You have an event: ${nameString}${dateString}${locationString}`)
-}
-
-// function to be called on addresses
-function reportAddresses(address) {
-  for (const field in address) {
-    if (address[field]) {
-      if (field === "city") {
-        const city = address.city;
-        const state = address.state ? `, ${address.state}` : '';
-        const zip = address.zip ? ` ${address.zip}` : '';
-        console.log(`${city}${state}${zip}`);
-        break;
-      } else {
-        console.log(`${address[field]}`);
-      }
-    }
-  }
-  console.log(``);
-}
-
-async function main() {
-
-  const ollama = new Ollama();
-
-  const systemprompt = `You will be given a text along with a prompt and a schema. You will have to extract the information requested in the prompt from the text and generate output in JSON observing the schema provided. If the schema shows a type of integer or number, you must only show a integer for that field. A string should always be a valid string. If a value is unknown, leave it empty. Output the JSON with extra spaces to ensure that it pretty prints.`
-
-  const schema = {
-    "eventsQuantity": {
-      "type": "integer",
-      "description": "The number of events in the source text"
-    },
-    "addressesQuantity": {
-      "type": "integer",
-      "description": "The number of addresses in the source text"
-    },
-    "events": [{
-      name: {
-        "type": "string",
-        description: "Name of the event"
-      },
-      "date": {
-        "type": "string",
-        "description": "Date of the event"
-      },
-      "location": {
-        "type": "string",
-        "description": "Location of the event"
-      },
-      "extraInfo": {
-        "type": "string",
-        "description": "Any extra information that is provided about the event."
-      }
-    }],
-    "people": [{
-      "name": {
-        "type": "string",
-        "description": "Name of the person"
-      },
-      "company": {
-        "type": "string",
-        "description": "Name of the company where they work"
-      },
-      "street": {
-        "type": "string",
-        "description": "Street address of the person or company. This is only the street name and the numerical address. Do not include city, state, or zip of the address in this field."
-      },
-      "city": {
-        "type": "string",
-        "description": "City portion of the address of the person or company"
-      },
-      "state": {
-        "type": "string",
-        "description": "State portion of the address of the person or company"
-      },
-      "zip": {
-        "type": "string",
-        "description": "Zip code of the person or company"
-      },
-      "extraInfo": {
-        "type": "string",
-        "description": "Any extra information that is provided about the location."
-      }
-    }]
-  }
-
-  const textcontent = await readFile("./info.txt", "utf-8").then((text) => text.split(" ").slice(0, 2000).join(" "));
-
-  const prompt = `The source text is a series of emails that have been put into a single file. They are separated by three dashes. Review the source text and determine the full address of the person sending each of the emails as well as any events that we need to track. If they provide a company address use that. If any extra info is provided, such as a description of the place, or a floor, add it to extraInfo. The first field in the address JSON is quantity of events and should be set to the number of events tracked and the second field should be set to the number of addresses tracked in the file. Don't stuff an event into the output that isn't an event. Only add data to the mostly appropriate field. Don't make up fields that aren't in the schema. If there isn't a value for a field, use null. Output should be in JSON.\n\nSchema: \n${JSON.stringify(schema, null, 2)}\n\nSource Text:\n${textcontent}`
-
-  await ollama.setModel("neural-chat");
-  ollama.setSystemPrompt(systemprompt);
-  ollama.setJSONFormat(true);
-  const data = await ollama.generate(prompt);
-  const output = JSON.parse(data.output);
-  const events = output.events;
-  const addresses = output.people;
-
-  console.log(`Here are your ${output.eventsQuantity} events:`);
-  for (const event of events) {
-    reportEvents(event.name, event.date, event.location);
-  }
-
-  console.log(`\n\nHere are your ${output.addressesQuantity} addresses:`);
-  for (const address of addresses) {
-    reportAddresses(address);
-  }
-}
-
-main();
--- a/examples/typescript-functioncalling/extractwp.ts
+++ b/examples/typescript-functioncalling/extractwp.ts
@@ -1,38 +0,0 @@
-import { Ollama } from "ollama-node";
-import { readFile } from "fs/promises";
-
-async function main() {
-
-  const ollama = new Ollama();
-
-  // Set the system prompt to prepare the model to receive a prompt and a schema and set some rules for the output.
-  const systemprompt = `You will be given a text along with a prompt and a schema. You will have to extract the information requested in the prompt from the text and generate output in JSON observing the schema provided. If the schema shows a type of integer or number, you must only show a integer for that field. A string should always be a valid string. If a value is unknown, leave it empty. Output the JSON with extra spaces to ensure that it pretty prints.`
-
-  const schema = {
-    "people": [{
-      "name": {
-        "type": "string",
-        "description": "Name of the person"
-      },
-      "title": {
-        "type": "string",
-        "description": "Title of the person"
-      }
-    }],
-  }
-
-  // Depending on the model chosen, you may be limited by the size of the context window, so limit the context to 2000 words.
-  const textcontent = await readFile("./wp.txt", "utf-8").then((text) => text.split(" ").slice(0, 2000).join(" "));
-
-  // Specific instructions for this task
-  const prompt = `Review the source text and determine the 10 most important people to focus on. Then extract the name and title for those people. Output should be in JSON.\n\nSchema: \n${JSON.stringify(schema, null, 2)}\n\nSource Text:\n${textcontent}`
-
-  await ollama.setModel("neural-chat");
-  ollama.setSystemPrompt(systemprompt);
-
-  // setJSONFormat is the equivalent of setting 'format: json' in the API
-  ollama.setJSONFormat(true);
-  await ollama.streamingGenerate(prompt, (word) => { process.stdout.write(word) })
-}
-
-main();
--- a/examples/typescript-functioncalling/info.txt
+++ b/examples/typescript-functioncalling/info.txt
@@ -1,17 +0,0 @@
---
-Hi matt, 
-
-thanks for letting me know that you are going to come today, November 16, for my tea party. My address is 123 Falk St on Bainbridge Island. I live in the house with the red door. I will be home all day so just come by whenever you want.
-
-Fred
-
---
-Great, send the check to our office at 1917 1st St, Seattle, WA 98101. I will let you know when we receive it.
-
-Mark Richardson
-Big Corp
---
-We are looking forward to seeing you at our Local AI Meetup. It will be held on December 3. It will be at the offices of Enormous Co. Our address is 344 1st Ave, Seattle, WA 98101. We will be meeting in the conference room on the 3rd floor.
-
-Barbara Reilly
-Enormous Co.
--- a/examples/typescript-functioncalling/package-lock.json
+++ b/examples/typescript-functioncalling/package-lock.json
@@ -1,519 +0,0 @@
-{
-  "name": "typescript-functioncalling",
-  "lockfileVersion": 3,
-  "requires": true,
-  "packages": {
-    "": {
-      "dependencies": {
-        "ollama-node": "^0.1.27"
-      },
-      "devDependencies": {
-        "tsx": "^4.1.2",
-        "typescript": "^5.2.2"
-      }
-    },
-    "node_modules/@esbuild/android-arm": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/android-arm/-/android-arm-0.18.20.tgz",
-      "integrity": "sha512-fyi7TDI/ijKKNZTUJAQqiG5T7YjJXgnzkURqmGj13C6dCqckZBLdl4h7bkhHt/t0WP+zO9/zwroDvANaOqO5Sw==",
-      "cpu": [
-        "arm"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "android"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/android-arm64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/android-arm64/-/android-arm64-0.18.20.tgz",
-      "integrity": "sha512-Nz4rJcchGDtENV0eMKUNa6L12zz2zBDXuhj/Vjh18zGqB44Bi7MBMSXjgunJgjRhCmKOjnPuZp4Mb6OKqtMHLQ==",
-      "cpu": [
-        "arm64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "android"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/android-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/android-x64/-/android-x64-0.18.20.tgz",
-      "integrity": "sha512-8GDdlePJA8D6zlZYJV/jnrRAi6rOiNaCC/JclcXpB+KIuvfBN4owLtgzY2bsxnx666XjJx2kDPUmnTtR8qKQUg==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "android"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/darwin-arm64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/darwin-arm64/-/darwin-arm64-0.18.20.tgz",
-      "integrity": "sha512-bxRHW5kHU38zS2lPTPOyuyTm+S+eobPUnTNkdJEfAddYgEcll4xkT8DB9d2008DtTbl7uJag2HuE5NZAZgnNEA==",
-      "cpu": [
-        "arm64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "darwin"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/darwin-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/darwin-x64/-/darwin-x64-0.18.20.tgz",
-      "integrity": "sha512-pc5gxlMDxzm513qPGbCbDukOdsGtKhfxD1zJKXjCCcU7ju50O7MeAZ8c4krSJcOIJGFR+qx21yMMVYwiQvyTyQ==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "darwin"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/freebsd-arm64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/freebsd-arm64/-/freebsd-arm64-0.18.20.tgz",
-      "integrity": "sha512-yqDQHy4QHevpMAaxhhIwYPMv1NECwOvIpGCZkECn8w2WFHXjEwrBn3CeNIYsibZ/iZEUemj++M26W3cNR5h+Tw==",
-      "cpu": [
-        "arm64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "freebsd"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/freebsd-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/freebsd-x64/-/freebsd-x64-0.18.20.tgz",
-      "integrity": "sha512-tgWRPPuQsd3RmBZwarGVHZQvtzfEBOreNuxEMKFcd5DaDn2PbBxfwLcj4+aenoh7ctXcbXmOQIn8HI6mCSw5MQ==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "freebsd"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-arm": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-arm/-/linux-arm-0.18.20.tgz",
-      "integrity": "sha512-/5bHkMWnq1EgKr1V+Ybz3s1hWXok7mDFUMQ4cG10AfW3wL02PSZi5kFpYKrptDsgb2WAJIvRcDm+qIvXf/apvg==",
-      "cpu": [
-        "arm"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-arm64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-arm64/-/linux-arm64-0.18.20.tgz",
-      "integrity": "sha512-2YbscF+UL7SQAVIpnWvYwM+3LskyDmPhe31pE7/aoTMFKKzIc9lLbyGUpmmb8a8AixOL61sQ/mFh3jEjHYFvdA==",
-      "cpu": [
-        "arm64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-ia32": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-ia32/-/linux-ia32-0.18.20.tgz",
-      "integrity": "sha512-P4etWwq6IsReT0E1KHU40bOnzMHoH73aXp96Fs8TIT6z9Hu8G6+0SHSw9i2isWrD2nbx2qo5yUqACgdfVGx7TA==",
-      "cpu": [
-        "ia32"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-loong64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-loong64/-/linux-loong64-0.18.20.tgz",
-      "integrity": "sha512-nXW8nqBTrOpDLPgPY9uV+/1DjxoQ7DoB2N8eocyq8I9XuqJ7BiAMDMf9n1xZM9TgW0J8zrquIb/A7s3BJv7rjg==",
-      "cpu": [
-        "loong64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-mips64el": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-mips64el/-/linux-mips64el-0.18.20.tgz",
-      "integrity": "sha512-d5NeaXZcHp8PzYy5VnXV3VSd2D328Zb+9dEq5HE6bw6+N86JVPExrA6O68OPwobntbNJ0pzCpUFZTo3w0GyetQ==",
-      "cpu": [
-        "mips64el"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-ppc64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-ppc64/-/linux-ppc64-0.18.20.tgz",
-      "integrity": "sha512-WHPyeScRNcmANnLQkq6AfyXRFr5D6N2sKgkFo2FqguP44Nw2eyDlbTdZwd9GYk98DZG9QItIiTlFLHJHjxP3FA==",
-      "cpu": [
-        "ppc64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-riscv64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-riscv64/-/linux-riscv64-0.18.20.tgz",
-      "integrity": "sha512-WSxo6h5ecI5XH34KC7w5veNnKkju3zBRLEQNY7mv5mtBmrP/MjNBCAlsM2u5hDBlS3NGcTQpoBvRzqBcRtpq1A==",
-      "cpu": [
-        "riscv64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-s390x": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-s390x/-/linux-s390x-0.18.20.tgz",
-      "integrity": "sha512-+8231GMs3mAEth6Ja1iK0a1sQ3ohfcpzpRLH8uuc5/KVDFneH6jtAJLFGafpzpMRO6DzJ6AvXKze9LfFMrIHVQ==",
-      "cpu": [
-        "s390x"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/linux-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/linux-x64/-/linux-x64-0.18.20.tgz",
-      "integrity": "sha512-UYqiqemphJcNsFEskc73jQ7B9jgwjWrSayxawS6UVFZGWrAAtkzjxSqnoclCXxWtfwLdzU+vTpcNYhpn43uP1w==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "linux"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/netbsd-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/netbsd-x64/-/netbsd-x64-0.18.20.tgz",
-      "integrity": "sha512-iO1c++VP6xUBUmltHZoMtCUdPlnPGdBom6IrO4gyKPFFVBKioIImVooR5I83nTew5UOYrk3gIJhbZh8X44y06A==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "netbsd"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/openbsd-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/openbsd-x64/-/openbsd-x64-0.18.20.tgz",
-      "integrity": "sha512-e5e4YSsuQfX4cxcygw/UCPIEP6wbIL+se3sxPdCiMbFLBWu0eiZOJ7WoD+ptCLrmjZBK1Wk7I6D/I3NglUGOxg==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "openbsd"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/sunos-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/sunos-x64/-/sunos-x64-0.18.20.tgz",
-      "integrity": "sha512-kDbFRFp0YpTQVVrqUd5FTYmWo45zGaXe0X8E1G/LKFC0v8x0vWrhOWSLITcCn63lmZIxfOMXtCfti/RxN/0wnQ==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "sunos"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/win32-arm64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/win32-arm64/-/win32-arm64-0.18.20.tgz",
-      "integrity": "sha512-ddYFR6ItYgoaq4v4JmQQaAI5s7npztfV4Ag6NrhiaW0RrnOXqBkgwZLofVTlq1daVTQNhtI5oieTvkRPfZrePg==",
-      "cpu": [
-        "arm64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "win32"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/win32-ia32": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/win32-ia32/-/win32-ia32-0.18.20.tgz",
-      "integrity": "sha512-Wv7QBi3ID/rROT08SABTS7eV4hX26sVduqDOTe1MvGMjNd3EjOz4b7zeexIR62GTIEKrfJXKL9LFxTYgkyeu7g==",
-      "cpu": [
-        "ia32"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "win32"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@esbuild/win32-x64": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/@esbuild/win32-x64/-/win32-x64-0.18.20.tgz",
-      "integrity": "sha512-kTdfRcSiDfQca/y9QIkng02avJ+NCaQvrMejlsB3RRv5sE9rRoeBPISaZpKxHELzRxZyLvNts1P27W3wV+8geQ==",
-      "cpu": [
-        "x64"
-      ],
-      "dev": true,
-      "optional": true,
-      "os": [
-        "win32"
-      ],
-      "engines": {
-        "node": ">=12"
-      }
-    },
-    "node_modules/@types/node": {
-      "version": "20.9.0",
-      "resolved": "https://registry.npmjs.org/@types/node/-/node-20.9.0.tgz",
-      "integrity": "sha512-nekiGu2NDb1BcVofVcEKMIwzlx4NjHlcjhoxxKBNLtz15Y1z7MYf549DFvkHSId02Ax6kGwWntIBPC3l/JZcmw==",
-      "dependencies": {
-        "undici-types": "~5.26.4"
-      }
-    },
-    "node_modules/buffer-from": {
-      "version": "1.1.2",
-      "resolved": "https://registry.npmjs.org/buffer-from/-/buffer-from-1.1.2.tgz",
-      "integrity": "sha512-E+XQCRwSbaaiChtv6k6Dwgc+bx+Bs6vuKJHHl5kox/BaKbhiXzqQOwK4cO22yElGp2OCmjwVhT3HmxgyPGnJfQ==",
-      "dev": true
-    },
-    "node_modules/esbuild": {
-      "version": "0.18.20",
-      "resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.18.20.tgz",
-      "integrity": "sha512-ceqxoedUrcayh7Y7ZX6NdbbDzGROiyVBgC4PriJThBKSVPWnnFHZAkfI1lJT8QFkOwH4qOS2SJkS4wvpGl8BpA==",
-      "dev": true,
-      "hasInstallScript": true,
-      "bin": {
-        "esbuild": "bin/esbuild"
-      },
-      "engines": {
-        "node": ">=12"
-      },
-      "optionalDependencies": {
-        "@esbuild/android-arm": "0.18.20",
-        "@esbuild/android-arm64": "0.18.20",
-        "@esbuild/android-x64": "0.18.20",
-        "@esbuild/darwin-arm64": "0.18.20",
-        "@esbuild/darwin-x64": "0.18.20",
-        "@esbuild/freebsd-arm64": "0.18.20",
-        "@esbuild/freebsd-x64": "0.18.20",
-        "@esbuild/linux-arm": "0.18.20",
-        "@esbuild/linux-arm64": "0.18.20",
-        "@esbuild/linux-ia32": "0.18.20",
-        "@esbuild/linux-loong64": "0.18.20",
-        "@esbuild/linux-mips64el": "0.18.20",
-        "@esbuild/linux-ppc64": "0.18.20",
-        "@esbuild/linux-riscv64": "0.18.20",
-        "@esbuild/linux-s390x": "0.18.20",
-        "@esbuild/linux-x64": "0.18.20",
-        "@esbuild/netbsd-x64": "0.18.20",
-        "@esbuild/openbsd-x64": "0.18.20",
-        "@esbuild/sunos-x64": "0.18.20",
-        "@esbuild/win32-arm64": "0.18.20",
-        "@esbuild/win32-ia32": "0.18.20",
-        "@esbuild/win32-x64": "0.18.20"
-      }
-    },
-    "node_modules/fsevents": {
-      "version": "2.3.3",
-      "resolved": "https://registry.npmjs.org/fsevents/-/fsevents-2.3.3.tgz",
-      "integrity": "sha512-5xoDfX+fL7faATnagmWPpbFtwh/R77WmMMqqHGS65C3vvB0YHrgF+B1YmZ3441tMj5n63k0212XNoJwzlhffQw==",
-      "dev": true,
-      "hasInstallScript": true,
-      "optional": true,
-      "os": [
-        "darwin"
-      ],
-      "engines": {
-        "node": "^8.16.0 || ^10.6.0 || >=11.0.0"
-      }
-    },
-    "node_modules/get-tsconfig": {
-      "version": "4.7.2",
-      "resolved": "https://registry.npmjs.org/get-tsconfig/-/get-tsconfig-4.7.2.tgz",
-      "integrity": "sha512-wuMsz4leaj5hbGgg4IvDU0bqJagpftG5l5cXIAvo8uZrqn0NJqwtfupTN00VnkQJPcIRrxYrm1Ue24btpCha2A==",
-      "dev": true,
-      "dependencies": {
-        "resolve-pkg-maps": "^1.0.0"
-      },
-      "funding": {
-        "url": "https://github.com/privatenumber/get-tsconfig?sponsor=1"
-      }
-    },
-    "node_modules/ollama-node": {
-      "version": "0.1.27",
-      "resolved": "https://registry.npmjs.org/ollama-node/-/ollama-node-0.1.27.tgz",
-      "integrity": "sha512-tFABPf5P0sXCR5USA31E3tqbge5h/4uf/t5j8/rPvHDo0SDwXeN0kah2J7hIqqkYlO1vLRs0uLC1/Mprgv9t2g==",
-      "dependencies": {
-        "@types/node": "^20.8.4"
-      }
-    },
-    "node_modules/resolve-pkg-maps": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/resolve-pkg-maps/-/resolve-pkg-maps-1.0.0.tgz",
-      "integrity": "sha512-seS2Tj26TBVOC2NIc2rOe2y2ZO7efxITtLZcGSOnHHNOQ7CkiUBfw0Iw2ck6xkIhPwLhKNLS8BO+hEpngQlqzw==",
-      "dev": true,
-      "funding": {
-        "url": "https://github.com/privatenumber/resolve-pkg-maps?sponsor=1"
-      }
-    },
-    "node_modules/source-map": {
-      "version": "0.6.1",
-      "resolved": "https://registry.npmjs.org/source-map/-/source-map-0.6.1.tgz",
-      "integrity": "sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g==",
-      "dev": true,
-      "engines": {
-        "node": ">=0.10.0"
-      }
-    },
-    "node_modules/source-map-support": {
-      "version": "0.5.21",
-      "resolved": "https://registry.npmjs.org/source-map-support/-/source-map-support-0.5.21.tgz",
-      "integrity": "sha512-uBHU3L3czsIyYXKX88fdrGovxdSCoTGDRZ6SYXtSRxLZUzHg5P/66Ht6uoUlHu9EZod+inXhKo3qQgwXUT/y1w==",
-      "dev": true,
-      "dependencies": {
-        "buffer-from": "^1.0.0",
-        "source-map": "^0.6.0"
-      }
-    },
-    "node_modules/tsx": {
-      "version": "4.1.2",
-      "resolved": "https://registry.npmjs.org/tsx/-/tsx-4.1.2.tgz",
-      "integrity": "sha512-1spM1bFV6MP2s4tO4tDC7g52fsaFdtEWdO4GfGdqi20qUgPbnAJqixOyIAvCSx1DDj3YIUB4CD06owTWUsOAuQ==",
-      "dev": true,
-      "dependencies": {
-        "esbuild": "~0.18.20",
-        "get-tsconfig": "^4.7.2",
-        "source-map-support": "^0.5.21"
-      },
-      "bin": {
-        "tsx": "dist/cli.mjs"
-      },
-      "engines": {
-        "node": ">=18.0.0"
-      },
-      "optionalDependencies": {
-        "fsevents": "~2.3.3"
-      }
-    },
-    "node_modules/typescript": {
-      "version": "5.2.2",
-      "resolved": "https://registry.npmjs.org/typescript/-/typescript-5.2.2.tgz",
-      "integrity": "sha512-mI4WrpHsbCIcwT9cF4FZvr80QUeKvsUsUvKDoR+X/7XHQH98xYD8YHZg7ANtz2GtZt/CBq2QJ0thkGJMHfqc1w==",
-      "dev": true,
-      "bin": {
-        "tsc": "bin/tsc",
-        "tsserver": "bin/tsserver"
-      },
-      "engines": {
-        "node": ">=14.17"
-      }
-    },
-    "node_modules/undici-types": {
-      "version": "5.26.5",
-      "resolved": "https://registry.npmjs.org/undici-types/-/undici-types-5.26.5.tgz",
-      "integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA=="
-    }
-  }
-}
--- a/examples/typescript-functioncalling/package.json
+++ b/examples/typescript-functioncalling/package.json
@@ -1,9 +0,0 @@
-{
-  "dependencies": {
-    "ollama-node": "^0.1.27"
-  },
-  "devDependencies": {
-    "tsx": "^4.1.2",
-    "typescript": "^5.2.2"
-  }
-}
--- a/examples/typescript-functioncalling/readme.md
+++ b/examples/typescript-functioncalling/readme.md
@@ -1,28 +0,0 @@
-# Function calling
-
-![function calling 2023-11-16 16_12_58](https://github.com/jmorganca/ollama/assets/633681/a0acc247-9746-45ab-b325-b65dfbbee4fb)
-
-One of the features added to some models is 'function calling'. It's a bit of a confusing name. It's understandable if you think that means the model can call functions, but that's not what it means. Function calling simply means that the output of the model is formatted in JSON, using a preconfigured schema, and uses the expected types. Then your code can use the output of the model and call functions with it. Using the JSON format in Ollama, you can use any model for function calling. 
-
-The two examples provided can extract information out of the provided texts. The first example uses the first couple of chapters from War and Peace by Lev Nikolayevich Tolstoy, and extracts the names and titles of the characters introduced in the story. The second example uses a more complicated schema to pull out addresses and event information from a series of emails.
-
-## Running the examples
-
-1. Clone this repo and navigate to the `examples/typescript-functioncalling` directory.
-2. Install the dependencies with `npm install`.
-3. Review the `wp.txt` file.
-4. Run `tsx extractwp.ts`.
-5. Review the `info.txt` file.
-6. Run `tsx extractemail.ts`.
-
-## Review the Code
-
-Both examples do roughly the same thing with different source material. They both use the same system prompt, which tells the model to expect some instructions and a schema. Then we inject the schema into the prompt and generate an answer.
-
-The first example, `extractwp.ts`, outputs the resulting JSON to the console, listing the characters introduced at the start of War and Peace. The second example, `extractemail.ts`, is a bit more complicated, extracting two different types of information: addresses and events. It outputs the results to a JSON blob, then the addresses are handed off to one function called `reportAddresses` and the events are handed off to another function called `reportEvents`.
-
-Notice that both examples are using the model from Intel called `neural-chat`. This is not a model tuned for function calling, yet it performs very well at this task.
-
-## Next Steps
-
-Try exporting some of your real emails to the input file and seeing how well the model does. Try pointing the first example at other books. You could even have it cycle through all the sections and maybe add up the number of times any character is seen throughout the book, determining the most important characters. You can also try out different models.
--- a/examples/typescript-functioncalling/wp.txt
+++ b/examples/typescript-functioncalling/wp.txt
@@ -1,183 +0,0 @@
-"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don't tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist - I really believe he is Antichrist - I will have nothing more to do with you and you are no longer my friend, no longer my 'faithful slave,' as you call yourself! But how do you do? I see I have frightened you - sit down and tell me all the news."
-
-It was in July, 1805, and the speaker was the well-known Anna Pavlovna Scherer, maid of honor and favorite of the Empress Marya Fedorovna. With these words she greeted Prince Vasili Kuragin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pavlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.
-
-All her invitations without exception, written in French, and delivered by a scarlet-liveried footman that morning, ran as follows:
-
-"If you have nothing better to do, Count (or Prince), and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10 - Annette Scherer."
-
-"Heavens! what a virulent attack!" replied the prince, not in the least disconcerted by this reception. He had just entered, wearing an embroidered court uniform, knee breeches, and shoes, and had stars on his breast and a serene expression on his flat face. He spoke in that refined French in which our grandfathers not only spoke but thought, and with the gentle, patronizing intonation natural to a man of importance who had grown old in society and at court. He went up to Anna Pavlovna, kissed her hand, presenting to her his bald, scented, and shining head, and complacently seated himself on the sofa.
-
-"First of all, dear friend, tell me how you are. Set your friend's mind at rest," said he without altering his tone, beneath the politeness and affected sympathy of which indifference and even irony could be discerned.
-
-"Can one be well while suffering morally? Can one be calm in times like these if one has any feeling?" said Anna Pavlovna. "You are staying the whole evening, I hope?"
-
-"And the fete at the English ambassador's? Today is Wednesday. I must put in an appearance there," said the prince. "My daughter is coming for me to take me there."
-
-"I thought today's fete had been canceled. I confess all these festivities and fireworks are becoming wearisome."
-
-"If they had known that you wished it, the entertainment would have been put off," said the prince, who, like a wound-up clock, by force of habit said things he did not even wish to be believed.
-
-"Don't tease! Well, and what has been decided about Novosiltsev's dispatch? You know everything."
-
-"What can one say about it?" replied the prince in a cold, listless tone. "What has been decided? They have decided that Buonaparte has burnt his boats, and I believe that we are ready to burn ours."
-
-Prince Vasili always spoke languidly, like an actor repeating a stale part. Anna Pavlovna Scherer on the contrary, despite her forty years, overflowed with animation and impulsiveness. To be an enthusiast had become her social vocation and, sometimes even when she did not feel like it, she became enthusiastic in order not to disappoint the expectations of those who knew her. The subdued smile which, though it did not suit her faded features, always played round her lips expressed, as in a spoiled child, a continual consciousness of her charming defect, which she neither wished, nor could, nor considered it necessary, to correct.
-
-In the midst of a conversation on political matters Anna Pavlovna burst out:
-
-"Oh, don't speak to me of Austria. Perhaps I don't understand things, but Austria never has wished, and does not wish, for war. She is betraying us! Russia alone must save Europe. Our gracious sovereign recognizes his high vocation and will be true to it. That is the one thing I have faith in! Our good and wonderful sovereign has to perform the noblest role on earth, and he is so virtuous and noble that God will not forsake him. He will fulfill his vocation and crush the hydra of revolution, which has become more terrible than ever in the person of this murderer and villain! We alone must avenge the blood of the just one.... Whom, I ask you, can we rely on?... England with her commercial spirit will not and cannot understand the Emperor Alexander's loftiness of soul. She has refused to evacuate Malta. She wanted to find, and still seeks, some secret motive in our actions. What answer did Novosiltsev get? None. The English have not understood and cannot understand the self-abnegation of our Emperor who wants nothing for himself, but only desires the good of mankind. And what have they promised? Nothing! And what little they have promised they will not perform! Prussia has always declared that Buonaparte is invincible, and that all Europe is powerless before him.... And I don't believe a word that Hardenburg says, or Haugwitz either. This famous Prussian neutrality is just a trap. I have faith only in God and the lofty destiny of our adored monarch. He will save Europe!"
-
-She suddenly paused, smiling at her own impetuosity.
-
-"I think," said the prince with a smile, "that if you had been sent instead of our dear Wintzingerode you would have captured the King of Prussia's consent by assault. You are so eloquent. Will you give me a cup of tea?"
-
-"In a moment. A propos," she added, becoming calm again, "I am expecting two very interesting men tonight, le Vicomte de Mortemart, who is connected with the Montmorencys through the Rohans, one of the best French families. He is one of the genuine emigres, the good ones. And also the Abbe Morio. Do you know that profound thinker? He has been received by the Emperor. Had you heard?"
-
-"I shall be delighted to meet them," said the prince. "But tell me," he added with studied carelessness as if it had only just occurred to him, though the question he was about to ask was the chief motive of his visit, "is it true that the Dowager Empress wants Baron Funke to be appointed first secretary at Vienna? The baron by all accounts is a poor creature."
-
-Prince Vasili wished to obtain this post for his son, but others were trying through the Dowager Empress Marya Fedorovna to secure it for the baron.
-
-Anna Pavlovna almost closed her eyes to indicate that neither she nor anyone else had a right to criticize what the Empress desired or was pleased with.
-
-"Baron Funke has been recommended to the Dowager Empress by her sister," was all she said, in a dry and mournful tone.
-
-As she named the Empress, Anna Pavlovna's face suddenly assumed an expression of profound and sincere devotion and respect mingled with sadness, and this occurred every time she mentioned her illustrious patroness. She added that Her Majesty had deigned to show Baron Funke beaucoup d'estime, and again her face clouded over with sadness.
-
-The prince was silent and looked indifferent. But, with the womanly and courtierlike quickness and tact habitual to her, Anna Pavlovna wished both to rebuke him (for daring to speak as he had done of a man recommended to the Empress) and at the same time to console him, so she said:
-
-"Now about your family. Do you know that since your daughter came out everyone has been enraptured by her? They say she is amazingly beautiful."
-
-The prince bowed to signify his respect and gratitude.
-
-"I often think," she continued after a short pause, drawing nearer to the prince and smiling amiably at him as if to show that political and social topics were ended and the time had come for intimate conversation - "I often think how unfairly sometimes the joys of life are distributed. Why has fate given you two such splendid children? I don't speak of Anatole, your youngest. I don't like him," she added in a tone admitting of no rejoinder and raising her eyebrows. "Two such charming children. And really you appreciate them less than anyone, and so you don't deserve to have them."
-
-And she smiled her ecstatic smile.
-
-"I can't help it," said the prince. "Lavater would have said I lack the bump of paternity."
-
-"Don't joke; I mean to have a serious talk with you. Do you know I am dissatisfied with your younger son? Between ourselves" (and her face assumed its melancholy expression), "he was mentioned at Her Majesty's and you were pitied...."
-
-The prince answered nothing, but she looked at him significantly, awaiting a reply. He frowned.
-
-"What would you have me do?" he said at last. "You know I did all a father could for their education, and they have both turned out fools. Hippolyte is at least a quiet fool, but Anatole is an active one. That is the only difference between them." He said this smiling in a way more natural and animated than usual, so that the wrinkles round his mouth very clearly revealed something unexpectedly coarse and unpleasant.
-
-"And why are children born to such men as you? If you were not a father there would be nothing I could reproach you with," said Anna Pavlovna, looking up pensively.
-
-"I am your faithful slave and to you alone I can confess that my children are the bane of my life. It is the cross I have to bear. That is how I explain it to myself. It can't be helped!"
-
-He said no more, but expressed his resignation to cruel fate by a gesture. Anna Pavlovna meditated.
-
-"Have you never thought of marrying your prodigal son Anatole?" she asked. "They say old maids have a mania for matchmaking, and though I don't feel that weakness in myself as yet, I know a little person who is very unhappy with her father. She is a relation of yours, Princess Mary Bolkonskaya."
-
-Prince Vasili did not reply, though, with the quickness of memory and perception befitting a man of the world, he indicated by a movement of the head that he was considering this information.
-
-"Do you know," he said at last, evidently unable to check the sad current of his thoughts, "that Anatole is costing me forty thousand rubles a year? And," he went on after a pause, "what will it be in five years, if he goes on like this?" Presently he added: "That's what we fathers have to put up with.... Is this princess of yours rich?"
-
-"Her father is very rich and stingy. He lives in the country. He is the well-known Prince Bolkonski who had to retire from the army under the late Emperor, and was nicknamed 'the King of Prussia.' He is very clever but eccentric, and a bore. The poor girl is very unhappy. She has a brother; I think you know him, he married Lise Meinen lately. He is an aide-de-camp of Kutuzov's and will be here tonight."
-
-"Listen, dear Annette," said the prince, suddenly taking Anna Pavlovna's hand and for some reason drawing it downwards. "Arrange that affair for me and I shall always be your most devoted slave-slafe with an f, as a village elder of mine writes in his reports. She is rich and of good family and that's all I want."
-
-And with the familiarity and easy grace peculiar to him, he raised the maid of honor's hand to his lips, kissed it, and swung it to and fro as he lay back in his armchair, looking in another direction.
-
-"Attendez," said Anna Pavlovna, reflecting, "I'll speak to Lise, young Bolkonski's wife, this very evening, and perhaps the thing can be arranged. It shall be on your family's behalf that I'll start my apprenticeship as old maid."
-
-Anna Pavlovna's drawing room was gradually filling. The highest Petersburg society was assembled there: people differing widely in age and character but alike in the social circle to which they belonged. Prince Vasili's daughter, the beautiful Helene, came to take her father to the ambassador's entertainment; she wore a ball dress and her badge as maid of honor. The youthful little Princess Bolkonskaya, known as la femme la plus seduisante de Petersbourg, * was also there. She had been married during the previous winter, and being pregnant did not go to any large gatherings, but only to small receptions. Prince Vasili's son, Hippolyte, had come with Mortemart, whom he introduced. The Abbe Morio and many others had also come.
-
-* The most fascinating woman in Petersburg.
-
-To each new arrival Anna Pavlovna said, "You have not yet seen my aunt," or "You do not know my aunt?" and very gravely conducted him or her to a little old lady, wearing large bows of ribbon in her cap, who had come sailing in from another room as soon as the guests began to arrive; and slowly turning her eyes from the visitor to her aunt, Anna Pavlovna mentioned each one's name and then left them.
-
-Each visitor performed the ceremony of greeting this old aunt whom not one of them knew, not one of them wanted to know, and not one of them cared about; Anna Pavlovna observed these greetings with mournful and solemn interest and silent approval. The aunt spoke to each of them in the same words, about their health and her own, and the health of Her Majesty, "who, thank God, was better today." And each visitor, though politeness prevented his showing impatience, left the old woman with a sense of relief at having performed a vexatious duty and did not return to her the whole evening.
-
-The young Princess Bolkonskaya had brought some work in a gold-embroidered velvet bag. Her pretty little upper lip, on which a delicate dark down was just perceptible, was too short for her teeth, but it lifted all the more sweetly, and was especially charming when she occasionally drew it down to meet the lower lip. As is always the case with a thoroughly attractive woman, her defect - the shortness of her upper lip and her half-open mouth - seemed to be her own special and peculiar form of beauty. Everyone brightened at the sight of this pretty young woman, so soon to become a mother, so full of life and health, and carrying her burden so lightly. Old men and dull dispirited young ones who looked at her, after being in her company and talking to her a little while, felt as if they too were becoming, like her, full of life and health. All who talked to her, and at each word saw her bright smile and the constant gleam of her white teeth, thought that they were in a specially amiable mood that day.
-
-The little princess went round the table with quick, short, swaying steps, her workbag on her arm, and gaily spreading out her dress sat down on a sofa near the silver samovar, as if all she was doing was a pleasure to herself and to all around her. "I have brought my work," said she in French, displaying her bag and addressing all present. "Mind, Annette, I hope you have not played a wicked trick on me," she added, turning to her hostess. "You wrote that it was to be quite a small reception, and just see how badly I am dressed." And she spread out her arms to show her short-waisted, lace-trimmed, dainty gray dress, girdled with a broad ribbon just below the breast.
-
-"Soyez tranquille, Lise, you will always be prettier than anyone else," replied Anna Pavlovna.
-
-"You know," said the princess in the same tone of voice and still in French, turning to a general, "my husband is deserting me? He is going to get himself killed. Tell me what this wretched war is for?" she added, addressing Prince Vasili, and without waiting for an answer she turned to speak to his daughter, the beautiful Helene.
-
-"What a delightful woman this little princess is!" said Prince Vasili to Anna Pavlovna.
-
-One of the next arrivals was a stout, heavily built young man with close-cropped hair, spectacles, the light-colored breeches fashionable at that time, a very high ruffle, and a brown dress coat. This stout young man was an illegitimate son of Count Bezukhov, a well-known grandee of Catherine's time who now lay dying in Moscow. The young man had not yet entered either the military or civil service, as he had only just returned from abroad where he had been educated, and this was his first appearance in society. Anna Pavlovna greeted him with the nod she accorded to the lowest hierarchy in her drawing room. But in spite of this lowest-grade greeting, a look of anxiety and fear, as at the sight of something too large and unsuited to the place, came over her face when she saw Pierre enter. Though he was certainly rather bigger than the other men in the room, her anxiety could only have reference to the clever though shy, but observant and natural, expression which distinguished him from everyone else in that drawing room.
-
-"It is very good of you, Monsieur Pierre, to come and visit a poor invalid," said Anna Pavlovna, exchanging an alarmed glance with her aunt as she conducted him to her.
-
-Pierre murmured something unintelligible, and continued to look round as if in search of something. On his way to the aunt he bowed to the little princess with a pleased smile, as to an intimate acquaintance.
-
-Anna Pavlovna's alarm was justified, for Pierre turned away from the aunt without waiting to hear her speech about Her Majesty's health. Anna Pavlovna in dismay detained him with the words: "Do you know the Abbe Morio? He is a most interesting man."
-
-"Yes, I have heard of his scheme for perpetual peace, and it is very interesting but hardly feasible."
-
-"You think so?" rejoined Anna Pavlovna in order to say something and get away to attend to her duties as hostess. But Pierre now committed a reverse act of impoliteness. First he had left a lady before she had finished speaking to him, and now he continued to speak to another who wished to get away. With his head bent, and his big feet spread apart, he began explaining his reasons for thinking the abbe's plan chimerical.
-
-"We will talk of it later," said Anna Pavlovna with a smile.
-
-And having got rid of this young man who did not know how to behave, she resumed her duties as hostess and continued to listen and watch, ready to help at any point where the conversation might happen to flag. As the foreman of a spinning mill, when he has set the hands to work, goes round and notices here a spindle that has stopped or there one that creaks or makes more noise than it should, and hastens to check the machine or set it in proper motion, so Anna Pavlovna moved about her drawing room, approaching now a silent, now a too-noisy group, and by a word or slight rearrangement kept the conversational machine in steady, proper, and regular motion. But amid these cares her anxiety about Pierre was evident. She kept an anxious watch on him when he approached the group round Mortemart to listen to what was being said there, and again when he passed to another group whose center was the abbe.
-
-Pierre had been educated abroad, and this reception at Anna Pavlovna's was the first he had attended in Russia. He knew that all the intellectual lights of Petersburg were gathered there and, like a child in a toyshop, did not know which way to look, afraid of missing any clever conversation that was to be heard. Seeing the self-confident and refined expression on the faces of those present he was always expecting to hear something very profound. At last he came up to Morio. Here the conversation seemed interesting and he stood waiting for an opportunity to express his own views, as young people are fond of doing.
-
-CHAPTER III
-Anna Pavlovna's reception was in full swing. The spindles hummed steadily and ceaselessly on all sides. With the exception of the aunt, beside whom sat only one elderly lady, who with her thin careworn face was rather out of place in this brilliant society, the whole company had settled into three groups. One, chiefly masculine, had formed round the abbe. Another, of young people, was grouped round the beautiful Princess Helene, Prince Vasili's daughter, and the little Princess Bolkonskaya, very pretty and rosy, though rather too plump for her age. The third group was gathered round Mortemart and Anna Pavlovna.
-
-The vicomte was a nice-looking young man with soft features and polished manners, who evidently considered himself a celebrity but out of politeness modestly placed himself at the disposal of the circle in which he found himself. Anna Pavlovna was obviously serving him up as a treat to her guests. As a clever maitre d'hotel serves up as a specially choice delicacy a piece of meat that no one who had seen it in the kitchen would have cared to eat, so Anna Pavlovna served up to her guests, first the vicomte and then the abbe, as peculiarly choice morsels. The group about Mortemart immediately began discussing the murder of the Duc d'Enghien. The vicomte said that the Duc d'Enghien had perished by his own magnanimity, and that there were particular reasons for Buonaparte's hatred of him.
-
-"Ah, yes! Do tell us all about it, Vicomte," said Anna Pavlovna, with a pleasant feeling that there was something A la Louis XV in the sound of that sentence: "Contez nous cela, Vicomte."
-
-The vicomte bowed and smiled courteously in token of his willingness to comply. Anna Pavlovna arranged a group round him, inviting everyone to listen to his tale.
-
-"The vicomte knew the duc personally," whispered Anna Pavlovna to one of the guests. "The vicomte is a wonderful raconteur," said she to another. "How evidently he belongs to the best society," said she to a third; and the vicomte was served up to the company in the choicest and most advantageous style, like a well-garnished joint of roast beef on a hot dish.
-
-The vicomte wished to begin his story and gave a subtle smile.
-
-"Come over here, Helene, dear," said Anna Pavlovna to the beautiful young princess who was sitting some way off, the center of another group.
-
-The princess smiled. She rose with the same unchanging smile with which she had first entered the room - the smile of a perfectly beautiful woman. With a slight rustle of her white dress trimmed with moss and ivy, with a gleam of white shoulders, glossy hair, and sparkling diamonds, she passed between the men who made way for her, not looking at any of them but smiling on all, as if graciously allowing each the privilege of admiring her beautiful figure and shapely shoulders, back, and bosom - which in the fashion of those days were very much exposed - and she seemed to bring the glamour of a ballroom with her as she moved toward Anna Pavlovna. Helene was so lovely that not only did she not show any trace of coquetry, but on the contrary she even appeared shy of her unquestionable and all too victorious beauty. She seemed to wish, but to be unable, to diminish its effect.
-
-"How lovely!" said everyone who saw her; and the vicomte lifted his shoulders and dropped his eyes as if startled by something extraordinary when she took her seat opposite and beamed upon him also with her unchanging smile.
-
-"Madame, I doubt my ability before such an audience," said he, smilingly inclining his head.
-
-The princess rested her bare round arm on a little table and considered a reply unnecessary. She smilingly waited. All the time the story was being told she sat upright, glancing now at her beautiful round arm, altered in shape by its pressure on the table, now at her still more beautiful bosom, on which she readjusted a diamond necklace. From time to time she smoothed the folds of her dress, and whenever the story produced an effect she glanced at Anna Pavlovna, at once adopted just the expression she saw on the maid of honor's face, and again relapsed into her radiant smile.
-
-The little princess had also left the tea table and followed Helene.
-
-"Wait a moment, I'll get my work.... Now then, what are you thinking of?" she went on, turning to Prince Hippolyte. "Fetch me my workbag."
-
-There was a general movement as the princess, smiling and talking merrily to everyone at once, sat down and gaily arranged herself in her seat.
-
-"Now I am all right," she said, and asking the vicomte to begin, she took up her work.
-
-Prince Hippolyte, having brought the workbag, joined the circle and moving a chair close to hers seated himself beside her.
-
-Le charmant Hippolyte was surprising by his extraordinary resemblance to his beautiful sister, but yet more by the fact that in spite of this resemblance he was exceedingly ugly. His features were like his sister's, but while in her case everything was lit up by a joyous, self-satisfied, youthful, and constant smile of animation, and by the wonderful classic beauty of her figure, his face on the contrary was dulled by imbecility and a constant expression of sullen self-confidence, while his body was thin and weak. His eyes, nose, and mouth all seemed puckered into a vacant, wearied grimace, and his arms and legs always fell into unnatural positions.
-
-"It's not going to be a ghost story?" said he, sitting down beside the princess and hastily adjusting his lorgnette, as if without this instrument he could not begin to speak.
-
-"Why no, my dear fellow," said the astonished narrator, shrugging his shoulders.
-
-"Because I hate ghost stories," said Prince Hippolyte in a tone which showed that he only understood the meaning of his words after he had uttered them.
-
-He spoke with such self-confidence that his hearers could not be sure whether what he said was very witty or very stupid. He was dressed in a dark-green dress coat, knee breeches of the color of cuisse de nymphe effrayee, as he called it, shoes, and silk stockings.
-
-The vicomte told his tale very neatly. It was an anecdote, then current, to the effect that the Duc d'Enghien had gone secretly to Paris to visit Mademoiselle George; that at her house he came upon Bonaparte, who also enjoyed the famous actress' favors, and that in his presence Napoleon happened to fall into one of the fainting fits to which he was subject, and was thus at the duc's mercy. The latter spared him, and this magnanimity Bonaparte subsequently repaid by death.
-
-The story was very pretty and interesting, especially at the point where the rivals suddenly recognized one another; and the ladies looked agitated.
-
-"Charming!" said Anna Pavlovna with an inquiring glance at the little princess.
-
-"Charming!" whispered the little princess, sticking the needle into her work as if to testify that the interest and fascination of the story prevented her from going on with it.
-
-The vicomte appreciated this silent praise and smiling gratefully prepared to continue, but just then Anna Pavlovna, who had kept a watchful eye on the young man who so alarmed her, noticed that he was talking too loudly and vehemently with the abbe, so she hurried to the rescue. Pierre had managed to start a conversation with the abbe about the balance of power, and the latter, evidently interested by the young man's simple-minded eagerness, was explaining his pet theory. Both were talking and listening too eagerly and too naturally, which was why Anna Pavlovna disapproved.
-
-"The means are ... the balance of power in Europe and the rights of the people," the abbe was saying. "It is only necessary for one powerful nation like Russia - barbaric as she is said to be - to place herself disinterestedly at the head of an alliance having for its object the maintenance of the balance of power of Europe, and it would save the world!"
-
-"But how are you to get that balance?" Pierre was beginning.
-
-At that moment Anna Pavlovna came up and, looking severely at Pierre, asked the Italian how he stood Russian climate. The Italian's face instantly changed and assumed an offensively affected, sugary expression, evidently habitual to him when conversing with women.
-
-"I am so enchanted by the brilliancy of the wit and culture of the society, more especially of the feminine society, in which I have had the honor of being received, that I have not yet had time to think of the climate," said he.
-
-Not letting the abbe and Pierre escape, Anna Pavlovna, the more conveniently to keep them under observation, brought them into the larger circle.
-
--- a/examples/typescript-simplechat/client.ts
+++ b/examples/typescript-simplechat/client.ts
@@ -1,77 +0,0 @@
-import * as readline from "readline";
-
-const model = "llama2";
-type Message = {
-  role: "assistant" | "user" | "system";
-  content: string;
-}
-const messages: Message[] = [{
-  role: "system",
-  content: "You are a helpful AI agent."
-}]
-
-const rl = readline.createInterface({
-  input: process.stdin,
-  output: process.stdout
-})
-
-async function chat(messages: Message[]): Promise<Message> {
-  const body = {
-    model: model,
-    messages: messages
-  }
-
-  const response = await fetch("http://localhost:11434/api/chat", {
-    method: "POST",
-    body: JSON.stringify(body)
-  })
-
-  const reader = response.body?.getReader()
-  if (!reader) {
-    throw new Error("Failed to read response body")
-  }
-  let content = ""
-  while (true) {
-    const { done, value } = await reader.read()
-    if (done) {
-      break;
-    }
-    const rawjson = new TextDecoder().decode(value);
-    const json = JSON.parse(rawjson)
-
-    if (json.done === false) {
-      process.stdout.write(json.message.content);
-      content += json.message.content
-    }
-
-  }
-  return { role: "assistant", content: content };
-}
-
-async function askQuestion(): Promise<void> {
-  return new Promise<void>((resolve) => {
-    rl.question("\n\nAsk a question: (press enter alone to quit)\n\n", async (user_input) => {
-      if (user_input.trim() === "") {
-        rl.close();
-        console.log("Thankyou. Goodbye.\n")
-        console.log("=======\nHere is the message history that was used in this conversation.\n=======\n")
-        messages.forEach(message => {
-          console.log(message)
-        })
-        resolve();
-      } else {
-        console.log();
-        messages.push({ role: "user", content: user_input });
-        messages.push(await chat(messages));
-        await askQuestion(); // Ask the next question
-      }
-    });
-  });
-}
-
-async function main() {
-  await askQuestion();
-
-}
-
-main();
--- a/examples/typescript-simplechat/package.json
+++ b/examples/typescript-simplechat/package.json
@@ -1 +0,0 @@
-{ "dependencies": { "@types/node": "^20.10.4", "prompt-sync": "^4.2.0", "readline": "^1.3.0" } }
--- a/examples/typescript-simplechat/readme.md
+++ b/examples/typescript-simplechat/readme.md
@@ -1,39 +0,0 @@
-# Simple Chat Example
-
-The **chat** endpoint is one of two ways to generate text from an LLM with Ollama. At a high level you provide the endpoint an array of message objects with a role and content specified. Then with each output and prompt, you add more messages, which builds up the history.
-
-## Run the Example
-
-There are a few ways to run this, just like any Typescript code:
-
-1. Compile with `tsc` and then run it with `node client.js`.
-2. Install `tsx` and run it with `tsx client.ts`.
-3. Install `bun` and run it with `bun client.ts`.
-
-## Review the Code
-
-You can see in the **chat** function that is actually calling the endpoint is simply done with:
-
-```typescript
-const body = {
-  model: model,
-  messages: messages
-}
-
-const response = await fetch("http://localhost:11434/api/chat", {
-  method: "POST",
-  body: JSON.stringify(body)
-})
-```
-
-With the **generate** endpoint, you need to provide a `prompt`. But with **chat**, you provide `messages`. And the resulting stream of responses includes a `message` object with a `content` field.
-
-The final JSON object doesn't provide the full content, so you will need to build the content yourself. In this example, **chat** takes the full array of messages and outputs the resulting message from this call of the chat endpoint.
-
-In the **askQuestion** function, we collect `user_input` and add it as a message to our messages and that is passed to the chat function. When the LLM is done responding the output is added as another message to the messages array.
-
-At the end, you will see a printout of all the messages.
-
-## Next Steps
-
-In this example, all generations are kept. You might want to experiment with summarizing everything older than 10 conversations to enable longer history with less context being used.
--- a/format/bytes.go
+++ b/format/bytes.go
@@ -1,47 +1,23 @@
 package format

-import (
-	"fmt"
-	"math"
-)
+import "fmt"

 const (
 	Byte     = 1
 	KiloByte = Byte * 1000
 	MegaByte = KiloByte * 1000
 	GigaByte = MegaByte * 1000
-	TeraByte = GigaByte * 1000
 )

 func HumanBytes(b int64) string {
-	var value float64
-	var unit string
-
 	switch {
-	case b >= TeraByte:
-		value = float64(b) / TeraByte
-		unit = "TB"
-	case b >= GigaByte:
-		value = float64(b) / GigaByte
-		unit = "GB"
-	case b >= MegaByte:
-		value = float64(b) / MegaByte
-		unit = "MB"
-	case b >= KiloByte:
-		value = float64(b) / KiloByte
-		unit = "KB"
+	case b > GigaByte:
+		return fmt.Sprintf("%d GB", b/GigaByte)
+	case b > MegaByte:
+		return fmt.Sprintf("%d MB", b/MegaByte)
+	case b > KiloByte:
+		return fmt.Sprintf("%d KB", b/KiloByte)
 	default:
 		return fmt.Sprintf("%d B", b)
 	}
-
-	switch {
-	case value >= 100:
-		return fmt.Sprintf("%d %s", int(value), unit)
-	case value >= 10:
-		return fmt.Sprintf("%d %s", int(value), unit)
-	case value != math.Trunc(value):
-		return fmt.Sprintf("%.1f %s", value, unit)
-	default:
-		return fmt.Sprintf("%d %s", int(value), unit)
-	}
 }
--- a/format/format.go
+++ b/format/format.go
@@ -1,25 +0,0 @@
-package format
-
-import (
-	"fmt"
-	"math"
-)
-
-const (
-	Thousand = 1000
-	Million  = Thousand * 1000
-	Billion  = Million * 1000
-)
-
-func HumanNumber(b uint64) string {
-	switch {
-	case b > Billion:
-		return fmt.Sprintf("%.0fB", math.Round(float64(b)/Billion))
-	case b > Million:
-		return fmt.Sprintf("%.0fM", math.Round(float64(b)/Million))
-	case b > Thousand:
-		return fmt.Sprintf("%.0fK", math.Round(float64(b)/Thousand))
-	default:
-		return fmt.Sprintf("%d", b)
-	}
-}
--- a/go.mod
+++ b/go.mod
@@ -3,20 +3,17 @@ module github.com/jmorganca/ollama
 go 1.20

 require (
-	github.com/emirpasic/gods v1.18.1
+	github.com/dustin/go-humanize v1.0.1
 	github.com/gin-gonic/gin v1.9.1
+	github.com/mattn/go-runewidth v0.0.14
+	github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db
 	github.com/olekukonko/tablewriter v0.0.5
+	github.com/pdevine/readline v1.5.2
 	github.com/spf13/cobra v1.7.0
-	github.com/stretchr/testify v1.8.3
 	golang.org/x/sync v0.3.0
 )

-require (
-	github.com/davecgh/go-spew v1.1.1 // indirect
-	github.com/mattn/go-runewidth v0.0.14 // indirect
-	github.com/pmezard/go-difflib v1.0.0 // indirect
-	github.com/rivo/uniseg v0.2.0 // indirect
-)
+require github.com/rivo/uniseg v0.2.0 // indirect

 require (
 	github.com/bytedance/sonic v1.9.1 // indirect
@@ -42,12 +39,12 @@ require (
 	github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
 	github.com/ugorji/go/codec v1.2.11 // indirect
 	golang.org/x/arch v0.3.0 // indirect
-	golang.org/x/crypto v0.14.0
+	golang.org/x/crypto v0.10.0
 	golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63
-	golang.org/x/net v0.17.0 // indirect
-	golang.org/x/sys v0.13.0 // indirect
-	golang.org/x/term v0.13.0
-	golang.org/x/text v0.13.0 // indirect
+	golang.org/x/net v0.10.0 // indirect
+	golang.org/x/sys v0.11.0 // indirect
+	golang.org/x/term v0.10.0
+	golang.org/x/text v0.10.0 // indirect
 	google.golang.org/protobuf v1.30.0 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
 )
--- a/go.sum
+++ b/go.sum
@@ -4,13 +4,17 @@ github.com/bytedance/sonic v1.9.1/go.mod h1:i736AoUSYt75HyZLoJW9ERYxcy6eaN6h4BZX
 github.com/chenzhuoyu/base64x v0.0.0-20211019084208-fb5309c8db06/go.mod h1:DH46F32mSOjUmXrMHnKwZdA8wcEefY7UVqBKYGjpdQY=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311 h1:qSGYFH7+jGhDF8vLC+iwCD4WpbV1EBDSzWkJODFLams=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311/go.mod h1:b583jCggY9gE99b6G5LEC39OIiVsWj+R97kbl5odCEk=
+github.com/chzyer/logex v1.2.1 h1:XHDu3E6q+gdHgsdTPH6ImJMIp436vR6MPtH8gP05QzM=
+github.com/chzyer/logex v1.2.1/go.mod h1:JLbx6lG2kDbNRFnfkgvh4eRJRPX1QCoOIWomwysCBrQ=
+github.com/chzyer/test v1.0.0 h1:p3BQDXSxOhOG0P9z6/hGnII4LGiEPOYBhs8asl/fC04=
+github.com/chzyer/test v1.0.0/go.mod h1:2JlltgoNkt4TW/z9V/IzDdFaMTM2JPIi26O1pF38GC8=
 github.com/cpuguy83/go-md2man/v2 v2.0.2/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o=
 github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E=
 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
-github.com/emirpasic/gods v1.18.1 h1:FXtiHYKDGKCW2KzwZKx0iC0PQmdlorYgdFG9jPXJ1Bc=
-github.com/emirpasic/gods v1.18.1/go.mod h1:8tpGGwCnJ5H4r6BWwaV6OrWmMoPhUl5jm/FMNAnJvWQ=
+github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
+github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
 github.com/gabriel-vasile/mimetype v1.4.2 h1:w5qFW6JKBz9Y393Y4q372O9A7cUSequkh1Q7OhCmWKU=
 github.com/gabriel-vasile/mimetype v1.4.2/go.mod h1:zApsH/mKG4w07erKIaJPFiX0Tsq9BFQgN3qGY5GnNgA=
 github.com/gin-contrib/cors v1.4.0 h1:oJ6gwtUl3lqV0WEIwM/LxPF1QZ5qe2lGWdY2+bz7y0g=
@@ -63,6 +67,8 @@ github.com/mattn/go-isatty v0.0.19/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D
 github.com/mattn/go-runewidth v0.0.9/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m2gUSrubnMI=
 github.com/mattn/go-runewidth v0.0.14 h1:+xnbZSEeDbOIg5/mE6JF0w6n9duR1l3/WmbinWVwUuU=
 github.com/mattn/go-runewidth v0.0.14/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
+github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ=
+github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw=
 github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg=
 github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
@@ -72,6 +78,8 @@ github.com/olekukonko/tablewriter v0.0.5 h1:P2Ga83D34wi1o9J6Wh1mRuqd4mF/x/lgBS7N
 github.com/olekukonko/tablewriter v0.0.5/go.mod h1:hPp6KlRPjbx+hW8ykQs1w3UBbZlj6HuIJcUGPhkA7kY=
 github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58 h1:onHthvaw9LFnH4t2DcNVpwGmV9E1BkGknEliJkfwQj0=
 github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58/go.mod h1:DXv8WO4yhMYhSNPKjeNKa5WY9YCIEBRbNzFFPJbWO6Y=
+github.com/pdevine/readline v1.5.2 h1:oz6Y5GdTmhPG+08hhxcAvtHitSANWuA2100Sppb38xI=
+github.com/pdevine/readline v1.5.2/go.mod h1:na/LbuE5PYwxI7GyopWdIs3U8HVe89lYlNTFTXH3wOw=
 github.com/pelletier/go-toml/v2 v2.0.1/go.mod h1:r9LEWfGN8R5k0VXJ+0BkIe7MYkRdwZOjgMj2KwnJFUo=
 github.com/pelletier/go-toml/v2 v2.0.8 h1:0ctb6s9mE31h0/lhu+J6OPmVeDxJn+kYnJc2jZR9tGQ=
 github.com/pelletier/go-toml/v2 v2.0.8/go.mod h1:vuYfssBdrU2XDZ9bYydBu6t+6a6PYNcZljzZR9VXg+4=
@@ -110,30 +118,31 @@ golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUu
 golang.org/x/arch v0.3.0 h1:02VY4/ZcO/gBOH6PUaoiptASxtXU10jazRCP865E97k=
 golang.org/x/arch v0.3.0/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
 golang.org/x/crypto v0.0.0-20210711020723-a769d52b0f97/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
-golang.org/x/crypto v0.14.0 h1:wBqGXzWJW6m1XrIKlAH0Hs1JJ7+9KBwnIO8v66Q9cHc=
-golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4=
+golang.org/x/crypto v0.10.0 h1:LKqV2xt9+kDzSTfOhx4FrkEBcMrAgHSYgzywV9zcGmM=
+golang.org/x/crypto v0.10.0/go.mod h1:o4eNf7Ede1fv+hwOwZsTHl9EsPFO6q6ZvYR8vYfY45I=
 golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63 h1:m64FZMko/V45gv0bNmrNYoDEq8U5YUhetc9cBWKS1TQ=
 golang.org/x/exp v0.0.0-20230817173708-d852ddb80c63/go.mod h1:0v4NqG35kSWCMzLaMeX+IQrlSnVE/bqGSyC2cz/9Le8=
 golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
-golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM=
-golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE=
+golang.org/x/net v0.10.0 h1:X2//UzNDwYmtCLn7To6G58Wr6f5ahEAQgKNzv9Y951M=
+golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
 golang.org/x/sync v0.3.0 h1:ftCYgMx6zT/asHUrPw8BLLscYtGznsLAnjq5RH9P66E=
 golang.org/x/sync v0.3.0/go.mod h1:FU7BRWz2tNW+3quACPkgCx/L+uEAv1htQ0V83Z9Rj+Y=
 golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20210806184541-e5e7981a1069/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.0.0-20220310020820-b874c991c1a5/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20220704084225-05e143d24a9e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
-golang.org/x/sys v0.13.0 h1:Af8nKPmuFypiUBjVoU9V20FiaFXOcuZI21p0ycVYYGE=
-golang.org/x/sys v0.13.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.11.0 h1:eG7RXZHdqOJ1i+0lgLgCpSXAp6M3LYlAo6osgSi0xOM=
+golang.org/x/sys v0.11.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
-golang.org/x/term v0.13.0 h1:bb+I9cTfFazGW51MZqBVmZy7+JEJMouUHTUSKVQLBek=
-golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U=
+golang.org/x/term v0.10.0 h1:3R7pNqamzBraeqj/Tj8qt1aQ2HpmlC+Cx/qL/7hn4/c=
+golang.org/x/term v0.10.0/go.mod h1:lpqdcUyK/oCiQxvxVrppt5ggO2KCZ5QblwqPnfZ6d5o=
 golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
 golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
-golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k=
-golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
+golang.org/x/text v0.10.0 h1:UpjohKhiEgNc0CSauXmwYftY1+LlaC75SJwh0SgCX58=
+golang.org/x/text v0.10.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
 golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
--- a/llm/falcon.go
+++ b/llm/falcon.go
@@ -0,0 +1,20 @@
+package llm
+
+const (
+	falconModelType7B   = 32
+	falconModelType40B  = 60
+	falconModelType180B = 80
+)
+
+func falconModelType(numLayer uint32) string {
+	switch numLayer {
+	case 32:
+		return "7B"
+	case 60:
+		return "40B"
+	case 80:
+		return "180B"
+	default:
+		return "unknown"
+	}
+}
--- a/llm/ggml.go
+++ b/llm/ggml.go
@@ -7,10 +7,9 @@ import (
 )

 type GGML struct {
+	magic uint32
 	container
 	model
-
-	Size int64
 }

 const (
@@ -83,7 +82,7 @@ type model interface {

 type container interface {
 	Name() string
-	Decode(*readSeekOffset) (model, error)
+	Decode(io.Reader) (model, error)
 }

 type containerGGML struct{}
@@ -92,9 +91,7 @@ func (c *containerGGML) Name() string {
 	return "ggml"
 }

-func (c *containerGGML) Decode(ro *readSeekOffset) (model, error) {
-	// file contents aren't decoded
-	ro.Seek(0, io.SeekEnd)
+func (c *containerGGML) Decode(r io.Reader) (model, error) {
 	return nil, nil
 }

@@ -106,9 +103,9 @@ func (c *containerGGMF) Name() string {
 	return "ggmf"
 }

-func (c *containerGGMF) Decode(ro *readSeekOffset) (model, error) {
+func (c *containerGGMF) Decode(r io.Reader) (model, error) {
 	var version uint32
-	binary.Read(ro, binary.LittleEndian, &version)
+	binary.Read(r, binary.LittleEndian, &version)

 	switch version {
 	case 1:
@@ -117,10 +114,6 @@ func (c *containerGGMF) Decode(ro *readSeekOffset) (model, error) {
 	}

 	c.version = version
-
-	// remaining file contents aren't decoded
-	ro.Seek(0, io.SeekEnd)
-
 	return nil, nil
 }

@@ -132,9 +125,9 @@ func (c *containerGGJT) Name() string {
 	return "ggjt"
 }

-func (c *containerGGJT) Decode(ro *readSeekOffset) (model, error) {
+func (c *containerGGJT) Decode(r io.Reader) (model, error) {
 	var version uint32
-	binary.Read(ro, binary.LittleEndian, &version)
+	binary.Read(r, binary.LittleEndian, &version)

 	switch version {
 	case 1, 2, 3:
@@ -146,11 +139,7 @@ func (c *containerGGJT) Decode(ro *readSeekOffset) (model, error) {

 	// different model types may have different layouts for hyperparameters
 	var llama llamaModel
-	binary.Read(ro, binary.LittleEndian, &llama.hyperparameters)
-
-	// remaining file contents aren't decoded
-	ro.Seek(0, io.SeekEnd)
-
+	binary.Read(r, binary.LittleEndian, &llama.hyperparameters)
 	return &llama, nil
 }

@@ -162,9 +151,9 @@ func (c *containerLORA) Name() string {
 	return "ggla"
 }

-func (c *containerLORA) Decode(ro *readSeekOffset) (model, error) {
+func (c *containerLORA) Decode(r io.Reader) (model, error) {
 	var version uint32
-	binary.Read(ro, binary.LittleEndian, &version)
+	binary.Read(r, binary.LittleEndian, &version)

 	switch version {
 	case 1:
@@ -173,10 +162,6 @@ func (c *containerLORA) Decode(ro *readSeekOffset) (model, error) {
 	}

 	c.version = version
-
-	// remaining file contents aren't decoded
-	ro.Seek(0, io.SeekEnd)
-
 	return nil, nil
 }

@@ -190,66 +175,35 @@ const (
 	// Magic constant for `ggla` files (LoRA adapter).
 	FILE_MAGIC_GGLA = 0x67676C61
 	// Magic constant for `gguf` files (versioned, gguf)
-	FILE_MAGIC_GGUF_LE = 0x46554747
-	FILE_MAGIC_GGUF_BE = 0x47475546
+	FILE_MAGIC_GGUF = 0x46554747
 )

 func DecodeGGML(r io.ReadSeeker) (*GGML, error) {
-	ro := readSeekOffset{ReadSeeker: r}
+	var ggml GGML
+	binary.Read(r, binary.LittleEndian, &ggml.magic)

-	var magic uint32
-	if err := binary.Read(&ro, binary.LittleEndian, &magic); err != nil {
-		return nil, err
-	}
-
-	var c container
-	switch magic {
+	switch ggml.magic {
 	case FILE_MAGIC_GGML:
-		c = &containerGGML{}
+		ggml.container = &containerGGML{}
 	case FILE_MAGIC_GGMF:
-		c = &containerGGMF{}
+		ggml.container = &containerGGMF{}
 	case FILE_MAGIC_GGJT:
-		c = &containerGGJT{}
+		ggml.container = &containerGGJT{}
 	case FILE_MAGIC_GGLA:
-		c = &containerLORA{}
-	case FILE_MAGIC_GGUF_LE:
-		c = &containerGGUF{bo: binary.LittleEndian}
-	case FILE_MAGIC_GGUF_BE:
-		c = &containerGGUF{bo: binary.BigEndian}
+		ggml.container = &containerLORA{}
+	case FILE_MAGIC_GGUF:
+		ggml.container = &containerGGUF{}
 	default:
 		return nil, errors.New("invalid file magic")
 	}

-	model, err := c.Decode(&ro)
+	model, err := ggml.Decode(r)
 	if err != nil {
 		return nil, err
 	}

+	ggml.model = model
+
 	// final model type
-	return &GGML{
-		container: c,
-		model:     model,
-		Size:      ro.offset,
-	}, nil
-}
-
-type readSeekOffset struct {
-	io.ReadSeeker
-	offset int64
-}
-
-func (rso *readSeekOffset) Seek(offset int64, whence int) (int64, error) {
-	offset, err := rso.ReadSeeker.Seek(offset, whence)
-	if err != nil {
-		return 0, err
-	}
-
-	rso.offset = offset
-	return offset, nil
-}
-
-func (rso *readSeekOffset) Read(p []byte) (int, error) {
-	n, err := rso.ReadSeeker.Read(p)
-	rso.offset += int64(n)
-	return n, err
+	return &ggml, nil
 }
--- a/llm/gguf.go
+++ b/llm/gguf.go
@@ -3,15 +3,12 @@ package llm
 import (
 	"bytes"
 	"encoding/binary"
+	"errors"
 	"fmt"
 	"io"
-
-	"github.com/jmorganca/ollama/format"
 )

 type containerGGUF struct {
-	bo binary.ByteOrder
-
 	Version uint32

 	V1 struct {
@@ -29,18 +26,20 @@ func (c *containerGGUF) Name() string {
 	return "gguf"
 }

-func (c *containerGGUF) Decode(rso *readSeekOffset) (model, error) {
-	binary.Read(rso, c.bo, &c.Version)
+func (c *containerGGUF) Decode(r io.Reader) (model, error) {
+	binary.Read(r, binary.LittleEndian, &c.Version)

 	switch c.Version {
 	case 1:
-		binary.Read(rso, c.bo, &c.V1)
+		binary.Read(r, binary.LittleEndian, &c.V1)
+	case 2:
+		binary.Read(r, binary.LittleEndian, &c.V2)
 	default:
-		binary.Read(rso, c.bo, &c.V2)
+		return nil, errors.New("invalid version")
 	}

 	model := newGGUFModel(c)
-	if err := model.Decode(rso); err != nil {
+	if err := model.Decode(r); err != nil {
 		return nil, err
 	}

@@ -65,23 +64,9 @@ const (

 type kv map[string]any

-type tensor struct {
-	name   string
-	kind   uint32
-	offset uint64
-	size   uint64
-
-	// shape is the number of elements in each dimension
-	shape [4]uint64
-}
-
 type ggufModel struct {
 	*containerGGUF
-
 	kv
-	tensors []tensor
-
-	parameters uint64
 }

 func newGGUFModel(container *containerGGUF) *ggufModel {
@@ -91,14 +76,6 @@ func newGGUFModel(container *containerGGUF) *ggufModel {
 	}
 }

-func (llm *ggufModel) NumTensor() uint64 {
-	if llm.Version == 1 {
-		return uint64(llm.V1.NumTensor)
-	}
-
-	return llm.V2.NumTensor
-}
-
 func (llm *ggufModel) NumKV() uint64 {
 	if llm.Version == 1 {
 		return uint64(llm.V1.NumKV)
@@ -108,7 +85,8 @@ func (llm *ggufModel) NumKV() uint64 {
 }

 func (llm *ggufModel) ModelFamily() string {
-	if t, ok := llm.kv["general.architecture"].(string); ok {
+	t, ok := llm.kv["general.architecture"].(string)
+	if ok {
 		return t
 	}

@@ -116,64 +94,96 @@ func (llm *ggufModel) ModelFamily() string {
 }

 func (llm *ggufModel) ModelType() string {
-	if llm.parameters > 0 {
-		return format.HumanNumber(llm.parameters)
+	switch llm.ModelFamily() {
+	case "llama":
+		if blocks, ok := llm.kv["llama.block_count"].(uint32); ok {
+			heads, headsOK := llm.kv["llama.head_count"].(uint32)
+			headKVs, headsKVsOK := llm.kv["llama.head_count_kv"].(uint32)
+			if headsOK && headsKVsOK && heads/headKVs == 8 {
+				return "70B"
+			}
+
+			return llamaModelType(blocks)
+		}
+	case "falcon":
+		if blocks, ok := llm.kv["falcon.block_count"].(uint32); ok {
+			return falconModelType(blocks)
+		}
+	case "starcoder":
+		if blocks, ok := llm.kv["starcoder.block_count"].(uint32); ok {
+			return starCoderModelType(blocks)
+		}
 	}

 	return "unknown"
 }

 func (llm *ggufModel) FileType() string {
-	if t, ok := llm.kv["general.file_type"].(uint32); ok {
+	t, ok := llm.kv["general.file_type"].(uint32)
+	if ok {
 		return fileType(t)
 	}

 	return "unknown"
 }

-func (llm *ggufModel) Decode(rso *readSeekOffset) error {
-	// decode key-values
+func (llm *ggufModel) Decode(r io.Reader) error {
+	read := llm.readString
+	if llm.Version == 1 {
+		read = llm.readStringV1
+	}
+
 	for i := 0; uint64(i) < llm.NumKV(); i++ {
-		k, err := llm.readString(rso)
+		k, err := read(r)
 		if err != nil {
 			return err
 		}

-		vtype := llm.readU32(rso)
+		vtype := llm.readU32(r)

 		var v any
 		switch vtype {
 		case ggufTypeUint8:
-			v = llm.readU8(rso)
+			v = llm.readU8(r)
 		case ggufTypeInt8:
-			v = llm.readI8(rso)
+			v = llm.readI8(r)
 		case ggufTypeUint16:
-			v = llm.readU16(rso)
+			v = llm.readU16(r)
 		case ggufTypeInt16:
-			v = llm.readI16(rso)
+			v = llm.readI16(r)
 		case ggufTypeUint32:
-			v = llm.readU32(rso)
+			v = llm.readU32(r)
 		case ggufTypeInt32:
-			v = llm.readI32(rso)
+			v = llm.readI32(r)
 		case ggufTypeUint64:
-			v = llm.readU64(rso)
+			v = llm.readU64(r)
 		case ggufTypeInt64:
-			v = llm.readI64(rso)
+			v = llm.readI64(r)
 		case ggufTypeFloat32:
-			v = llm.readF32(rso)
+			v = llm.readF32(r)
 		case ggufTypeFloat64:
-			v = llm.readF64(rso)
+			v = llm.readF64(r)
 		case ggufTypeBool:
-			v = llm.readBool(rso)
+			v = llm.readBool(r)
 		case ggufTypeString:
-			s, err := llm.readString(rso)
+			fn := llm.readString
+			if llm.Version == 1 {
+				fn = llm.readStringV1
+			}
+
+			s, err := fn(r)
 			if err != nil {
 				return err
 			}

 			v = s
 		case ggufTypeArray:
-			a, err := llm.readArray(rso)
+			fn := llm.readArray
+			if llm.Version == 1 {
+				fn = llm.readArrayV1
+			}
+
+			a, err := fn(r)
 			if err != nil {
 				return err
 			}
@@ -186,89 +196,6 @@ func (llm *ggufModel) Decode(rso *readSeekOffset) error {
 		llm.kv[k] = v
 	}

-	// decode tensors
-	for i := 0; uint64(i) < llm.NumTensor(); i++ {
-		name, err := llm.readString(rso)
-		if err != nil {
-			return err
-		}
-
-		// dims is the number of dimensions in the tensor
-		dims := llm.readU32(rso)
-
-		shape := [4]uint64{1, 1, 1, 1}
-		for i := 0; uint32(i) < dims; i++ {
-			shape[i] = llm.readU64(rso)
-		}
-
-		kind := llm.readU32(rso)
-		offset := llm.readU64(rso)
-
-		var blockSize uint64
-		switch {
-		case kind < 2:
-			blockSize = 1
-		case kind < 10:
-			blockSize = 32
-		default:
-			blockSize = 256
-		}
-
-		var typeSize uint64
-		switch kind {
-		case 0: // FP32
-			typeSize = 4
-		case 1: // FP16
-			typeSize = 2
-		case 2: // Q4_0
-			typeSize = 2 + blockSize/2
-		case 3: // Q4_1
-			typeSize = 2 + 2 + blockSize/2
-		case 6: // Q5_0
-			typeSize = 2 + 4 + blockSize/2
-		case 7: // Q5_1
-			typeSize = 2 + 2 + 4 + blockSize/2
-		case 8: // Q8_0
-			typeSize = 2 + blockSize
-		case 9: // Q8_1
-			typeSize = 4 + 4 + blockSize
-		case 10: // Q2_K
-			typeSize = blockSize/16 + blockSize/4 + 2 + 2
-		case 11: // Q3_K
-			typeSize = blockSize/8 + blockSize/4 + 12 + 2
-		case 12: // Q4_K
-			typeSize = 2 + 2 + 12 + blockSize/2
-		case 13: // Q5_K
-			typeSize = 2 + 2 + 12 + blockSize/8 + blockSize/2
-		case 14: // Q6_K
-			typeSize = blockSize/2 + blockSize/4 + blockSize/16 + 2
-		}
-
-		parameters := shape[0] * shape[1] * shape[2] * shape[3]
-		size := parameters * typeSize / blockSize
-
-		llm.tensors = append(llm.tensors, tensor{
-			name:   name,
-			kind:   kind,
-			offset: offset,
-			size:   size,
-			shape:  shape,
-		})
-
-		llm.parameters += parameters
-	}
-
-	alignment, ok := llm.kv["general.alignment"].(uint32)
-	if !ok {
-		alignment = 32
-	}
-
-	rso.Seek(int64(alignment)-rso.offset%int64(alignment), io.SeekCurrent)
-	for _, tensor := range llm.tensors {
-		padded := (int64(tensor.size) + int64(alignment) - 1) & ^(int64(alignment) - 1)
-		rso.Seek(padded, io.SeekCurrent)
-	}
-
 	return nil
 }

@@ -282,75 +209,75 @@ func (llm *ggufModel) NumLayers() int64 {
 	return int64(v)
 }

-func (llm ggufModel) readU8(r io.Reader) uint8 {
+func (ggufModel) readU8(r io.Reader) uint8 {
 	var u8 uint8
-	binary.Read(r, llm.bo, &u8)
+	binary.Read(r, binary.LittleEndian, &u8)
 	return u8
 }

-func (llm ggufModel) readI8(r io.Reader) int8 {
+func (ggufModel) readI8(r io.Reader) int8 {
 	var i8 int8
-	binary.Read(r, llm.bo, &i8)
+	binary.Read(r, binary.LittleEndian, &i8)
 	return i8
 }

-func (llm ggufModel) readU16(r io.Reader) uint16 {
+func (ggufModel) readU16(r io.Reader) uint16 {
 	var u16 uint16
-	binary.Read(r, llm.bo, &u16)
+	binary.Read(r, binary.LittleEndian, &u16)
 	return u16
 }

-func (llm ggufModel) readI16(r io.Reader) int16 {
+func (ggufModel) readI16(r io.Reader) int16 {
 	var i16 int16
-	binary.Read(r, llm.bo, &i16)
+	binary.Read(r, binary.LittleEndian, &i16)
 	return i16
 }

-func (llm ggufModel) readU32(r io.Reader) uint32 {
+func (ggufModel) readU32(r io.Reader) uint32 {
 	var u32 uint32
-	binary.Read(r, llm.bo, &u32)
+	binary.Read(r, binary.LittleEndian, &u32)
 	return u32
 }

-func (llm ggufModel) readI32(r io.Reader) int32 {
+func (ggufModel) readI32(r io.Reader) int32 {
 	var i32 int32
-	binary.Read(r, llm.bo, &i32)
+	binary.Read(r, binary.LittleEndian, &i32)
 	return i32
 }

-func (llm ggufModel) readU64(r io.Reader) uint64 {
+func (ggufModel) readU64(r io.Reader) uint64 {
 	var u64 uint64
-	binary.Read(r, llm.bo, &u64)
+	binary.Read(r, binary.LittleEndian, &u64)
 	return u64
 }

-func (llm ggufModel) readI64(r io.Reader) int64 {
+func (ggufModel) readI64(r io.Reader) int64 {
 	var i64 int64
-	binary.Read(r, llm.bo, &i64)
+	binary.Read(r, binary.LittleEndian, &i64)
 	return i64
 }

-func (llm ggufModel) readF32(r io.Reader) float32 {
+func (ggufModel) readF32(r io.Reader) float32 {
 	var f32 float32
-	binary.Read(r, llm.bo, &f32)
+	binary.Read(r, binary.LittleEndian, &f32)
 	return f32
 }

-func (llm ggufModel) readF64(r io.Reader) float64 {
+func (ggufModel) readF64(r io.Reader) float64 {
 	var f64 float64
-	binary.Read(r, llm.bo, &f64)
+	binary.Read(r, binary.LittleEndian, &f64)
 	return f64
 }

-func (llm ggufModel) readBool(r io.Reader) bool {
+func (ggufModel) readBool(r io.Reader) bool {
 	var b bool
-	binary.Read(r, llm.bo, &b)
+	binary.Read(r, binary.LittleEndian, &b)
 	return b
 }

-func (llm ggufModel) readStringV1(r io.Reader) (string, error) {
+func (ggufModel) readStringV1(r io.Reader) (string, error) {
 	var nameLength uint32
-	binary.Read(r, llm.bo, &nameLength)
+	binary.Read(r, binary.LittleEndian, &nameLength)

 	var b bytes.Buffer
 	if _, err := io.CopyN(&b, r, int64(nameLength)); err != nil {
@@ -364,12 +291,8 @@ func (llm ggufModel) readStringV1(r io.Reader) (string, error) {
 }

 func (llm ggufModel) readString(r io.Reader) (string, error) {
-	if llm.Version == 1 {
-		return llm.readStringV1(r)
-	}
-
 	var nameLength uint64
-	binary.Read(r, llm.bo, &nameLength)
+	binary.Read(r, binary.LittleEndian, &nameLength)

 	var b bytes.Buffer
 	if _, err := io.CopyN(&b, r, int64(nameLength)); err != nil {
@@ -388,7 +311,7 @@ func (llm *ggufModel) readArrayV1(r io.Reader) (arr []any, err error) {
 		case ggufTypeUint8:
 			arr = append(arr, llm.readU8(r))
 		case ggufTypeInt8:
-			arr = append(arr, llm.readI8(r))
+			arr = append(arr, llm.readU8(r))
 		case ggufTypeUint16:
 			arr = append(arr, llm.readU16(r))
 		case ggufTypeInt16:
@@ -417,10 +340,6 @@ func (llm *ggufModel) readArrayV1(r io.Reader) (arr []any, err error) {
 }

 func (llm *ggufModel) readArray(r io.Reader) (arr []any, err error) {
-	if llm.Version == 1 {
-		return llm.readArrayV1(r)
-	}
-
 	atype := llm.readU32(r)
 	n := llm.readU64(r)

@@ -429,7 +348,7 @@ func (llm *ggufModel) readArray(r io.Reader) (arr []any, err error) {
 		case ggufTypeUint8:
 			arr = append(arr, llm.readU8(r))
 		case ggufTypeInt8:
-			arr = append(arr, llm.readI8(r))
+			arr = append(arr, llm.readU8(r))
 		case ggufTypeUint16:
 			arr = append(arr, llm.readU16(r))
 		case ggufTypeInt16:
--- a/llm/llama.cpp/generate_darwin_amd64.go
+++ b/llm/llama.cpp/generate_darwin_amd64.go
@@ -7,12 +7,12 @@ package llm
 //go:generate git -C ggml apply ../patches/0002-34B-model-support.patch
 //go:generate git -C ggml apply ../patches/0003-metal-fix-synchronization-in-new-matrix-multiplicati.patch
 //go:generate git -C ggml apply ../patches/0004-metal-add-missing-barriers-for-mul-mat-2699.patch
-//go:generate cmake -S ggml -B ggml/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
+//go:generate cmake -S ggml -B ggml/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build ggml/build/cpu --target server --config Release
 //go:generate mv ggml/build/cpu/bin/server ggml/build/cpu/bin/ollama-runner

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_METAL=off -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=on
+//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner
--- a/llm/llama.cpp/generate_darwin_arm64.go
+++ b/llm/llama.cpp/generate_darwin_arm64.go
@@ -12,7 +12,7 @@ package llm
 //go:generate mv ggml/build/metal/bin/server ggml/build/metal/bin/ollama-runner

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
+//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
 //go:generate cmake -S gguf -B gguf/build/metal -DLLAMA_METAL=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0
 //go:generate cmake --build gguf/build/metal --target server --config Release
 //go:generate mv gguf/build/metal/bin/server gguf/build/metal/bin/ollama-runner
--- a/llm/llama.cpp/generate_linux.go
+++ b/llm/llama.cpp/generate_linux.go
@@ -13,14 +13,14 @@ package llm

 //go:generate git submodule update --force gguf
 //go:generate git -C gguf apply ../patches/0001-copy-cuda-runtime-libraries.patch
-//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
+//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate mv gguf/build/cpu/bin/server gguf/build/cpu/bin/ollama-runner

 //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build ggml/build/cuda --target server --config Release
 //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
-//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0
+//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cuda --target server --config Release
 //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
--- a/llm/llama.cpp/generate_windows.go
+++ b/llm/llama.cpp/generate_windows.go
@@ -10,15 +10,7 @@ package llm
 //go:generate cmd /c move ggml\build\cpu\bin\Release\server.exe ggml\build\cpu\bin\Release\ollama-runner.exe

 //go:generate git submodule update --force gguf
-//go:generate git -C gguf apply ../patches/0001-update-default-log-target.patch
-//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
+//go:generate git -C gguf apply ../patches/0001-remove-warm-up-logging.patch
+//go:generate cmake -S gguf -B gguf/build/cpu -DLLAMA_K_QUANTS=on
 //go:generate cmake --build gguf/build/cpu --target server --config Release
 //go:generate cmd /c move gguf\build\cpu\bin\Release\server.exe gguf\build\cpu\bin\Release\ollama-runner.exe
-
-//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on
-//go:generate cmake --build ggml/build/cuda --target server --config Release
-//go:generate cmd /c move ggml\build\cuda\bin\Release\server.exe ggml\build\cuda\bin\Release\ollama-runner.exe
-
-//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off
-//go:generate cmake --build gguf/build/cuda --target server --config Release
-//go:generate cmd /c move gguf\build\cuda\bin\Release\server.exe gguf\build\cuda\bin\Release\ollama-runner.exe
--- a/llm/llama.cpp/gguf
+++ b/llm/llama.cpp/gguf
--- a/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
+++ b/llm/llama.cpp/patches/0001-remove-warm-up-logging.patch
@@ -0,0 +1,25 @@
+From 8dbb5449db259a9c24796e7927d89bee98b6c8f5 Mon Sep 17 00:00:00 2001
+From: Bruce MacDonald <brucewmacdonald@gmail.com>
+Date: Thu, 5 Oct 2023 11:21:12 -0400
+Subject: [PATCH] remove warm up logging
+
+---
+ common/common.cpp | 2 --
+ 1 file changed, 2 deletions(-)
+
+diff --git a/common/common.cpp b/common/common.cpp
+index 7370017..c4433fe 100644
+--- a/common/common.cpp
+++ b/common/common.cpp
+@@ -839,8 +839,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
+     }
+ 
+     {
+-        LOG("warming up the model with an empty run\n");
+-
+         std::vector<llama_token> tmp = { llama_token_bos(lctx), llama_token_eos(lctx), };
+         llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
+         llama_kv_cache_tokens_rm(lctx, -1, -1);
+-- 
+2.39.2 (Apple Git-143)
+
--- a/llm/llama.cpp/patches/0001-update-default-log-target.patch
+++ b/llm/llama.cpp/patches/0001-update-default-log-target.patch
@@ -1,25 +0,0 @@
-From 6465fec6290f0a7f5d4d0fbe6bcf634e4810dde6 Mon Sep 17 00:00:00 2001
-From: Michael Yang <mxyng@pm.me>
-Date: Mon, 23 Oct 2023 10:39:34 -0700
-Subject: [PATCH] default log stderr
-
---
- common/log.h | 2 +-
- 1 file changed, 1 insertion(+), 1 deletion(-)
-
-diff --git a/common/log.h b/common/log.h
-index b8953fd..25522cd 100644
--- a/common/log.h
-+++ b/common/log.h
-@@ -90,7 +90,7 @@
- //  }
- //
- #ifndef LOG_TARGET
-    #define LOG_TARGET log_handler()
-+    #define LOG_TARGET nullptr
- #endif
- 
- #ifndef LOG_TEE_TARGET
-- 
-2.42.0
-
--- a/llm/llama.go
+++ b/llm/llama.go
@@ -27,39 +27,10 @@ import (
 	"github.com/jmorganca/ollama/format"
 )

-const jsonGrammar = `
-root   ::= object
-value  ::= object | array | string | number | ("true" | "false" | "null") ws
-
-object ::=
-  "{" ws (
-            string ":" ws value
-    ("," ws string ":" ws value)*
-  )? "}" ws
-
-array  ::=
-  "[" ws (
-            value
-    ("," ws value)*
-  )? "]" ws
-
-string ::=
-  "\"" (
-    [^"\\] |
-    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
-  )* "\"" ws
-
-number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws
-
-# Optional space: by convention, applied in this grammar after literal chars when allowed
-ws ::= ([ \t\n] ws)?
-`
-
 //go:embed llama.cpp/*/build/*/bin/*
 var llamaCppEmbed embed.FS

 type ModelRunner struct {
-	Type        string // "gguf" or "ggml"
 	Path        string // path to the model runner executable
 	Accelerated bool
 }
@@ -72,26 +43,24 @@ func chooseRunners(workDir, runnerType string) []ModelRunner {
 	// IMPORTANT: the order of the runners in the array is the priority order
 	switch runtime.GOOS {
 	case "darwin":
-		if runtime.GOARCH == "arm64" {
-			runners = []ModelRunner{{Type: runnerType, Path: path.Join(buildPath, "metal", "bin", "ollama-runner")}}
-		} else {
-			runners = []ModelRunner{{Type: runnerType, Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")}}
+		runners = []ModelRunner{
+			{Path: path.Join(buildPath, "metal", "bin", "ollama-runner")},
+			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
 		}
 	case "linux":
 		runners = []ModelRunner{
-			{Type: runnerType, Path: path.Join(buildPath, "cuda", "bin", "ollama-runner"), Accelerated: true},
-			{Type: runnerType, Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
+			{Path: path.Join(buildPath, "cuda", "bin", "ollama-runner"), Accelerated: true},
+			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
 		}
 	case "windows":
 		// TODO: select windows GPU runner here when available
 		runners = []ModelRunner{
-			{Type: runnerType, Path: path.Join(buildPath, "cuda", "bin", "Release", "ollama-runner.exe"), Accelerated: true},
-			{Type: runnerType, Path: path.Join(buildPath, "cpu", "bin", "Release", "ollama-runner.exe")},
+			{Path: path.Join(buildPath, "cpu", "bin", "Release", "ollama-runner.exe")},
 		}
 	default:
 		log.Printf("unknown OS, running on CPU: %s", runtime.GOOS)
 		runners = []ModelRunner{
-			{Type: runnerType, Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
+			{Path: path.Join(buildPath, "cpu", "bin", "ollama-runner")},
 		}
 	}

@@ -149,7 +118,6 @@ func chooseRunners(workDir, runnerType string) []ModelRunner {
 	for _, r := range runners {
 		// clean the ModelRunner paths so that they match the OS we are running on
 		localRunnersByPriority = append(localRunnersByPriority, ModelRunner{
-			Type:        r.Type,
 			Path:        filepath.Clean(path.Join(workDir, r.Path)),
 			Accelerated: r.Accelerated,
 		})
@@ -223,21 +191,12 @@ type Running struct {
 	*StatusWriter            // captures error messages from the llama runner process
 }

-type ImageData struct {
-	Data []byte `json:"data"`
-	ID   int    `json:"id"`
-}
-
 type llama struct {
 	api.Options
-	ImageData []ImageData
 	Running
 }

-var (
-	errNvidiaSMI     = errors.New("warning: gpu support may not be enabled, check that you have installed GPU drivers: nvidia-smi command failed")
-	errAvailableVRAM = errors.New("not enough VRAM available, falling back to CPU only")
-)
+var errNoGPU = errors.New("nvidia-smi command failed")

 // CheckVRAM returns the free VRAM in bytes on Linux machines with NVIDIA GPUs
 func CheckVRAM() (int64, error) {
@@ -246,17 +205,13 @@ func CheckVRAM() (int64, error) {
 	cmd.Stdout = &stdout
 	err := cmd.Run()
 	if err != nil {
-		return 0, errNvidiaSMI
+		return 0, errNoGPU
 	}

 	var freeMiB int64
 	scanner := bufio.NewScanner(&stdout)
 	for scanner.Scan() {
 		line := scanner.Text()
-		if strings.Contains(line, "[Insufficient Permissions]") {
-			return 0, fmt.Errorf("GPU support may not enabled, check you have installed GPU drivers and have the necessary permissions to run nvidia-smi")
-		}
-
 		vram, err := strconv.ParseInt(strings.TrimSpace(line), 10, 64)
 		if err != nil {
 			return 0, fmt.Errorf("failed to parse available VRAM: %v", err)
@@ -267,8 +222,8 @@ func CheckVRAM() (int64, error) {

 	freeBytes := freeMiB * 1024 * 1024
 	if freeBytes < 2*format.GigaByte {
-		log.Printf("less than 2 GB VRAM available")
-		return 0, errAvailableVRAM
+		log.Printf("less than 2 GB VRAM available, falling back to CPU only")
+		freeMiB = 0
 	}

 	return freeBytes, nil
@@ -278,25 +233,22 @@ func NumGPU(numLayer, fileSizeBytes int64, opts api.Options) int {
 	if opts.NumGPU != -1 {
 		return opts.NumGPU
 	}
-	if runtime.GOOS == "linux" || runtime.GOOS == "windows" {
+	if runtime.GOOS == "linux" {
 		freeBytes, err := CheckVRAM()
 		if err != nil {
-			if !errors.Is(err, errNvidiaSMI) {
+			if err.Error() != "nvidia-smi command failed" {
 				log.Print(err.Error())
 			}
 			// nvidia driver not installed or no nvidia GPU found
 			return 0
 		}

-		/*
-		 Calculate bytes per layer, this will roughly be the size of the model file divided by the number of layers.
-		 We can store the model weights and the kv cache in vram,
-		 to enable kv chache vram storage add two additional layers to the number of layers retrieved from the model file.
-		*/
+		// Calculate bytes per layer
+		// TODO: this is a rough heuristic, better would be to calculate this based on number of layers and context size
 		bytesPerLayer := fileSizeBytes / numLayer

-		// 75% of the absolute max number of layers we can fit in available VRAM, off-loading too many layers to the GPU can cause OOM errors
-		layers := int(freeBytes/bytesPerLayer) * 3 / 4
+		// max number of layers we can fit in VRAM, subtract 8% to prevent consuming all available VRAM and running out of memory
+		layers := int(freeBytes/bytesPerLayer) * 92 / 100
 		log.Printf("%d MB VRAM available, loading up to %d GPU layers", freeBytes/(1024*1024), layers)

 		return layers
@@ -333,7 +285,7 @@ func (w *StatusWriter) Write(b []byte) (int, error) {
 	return os.Stderr.Write(b)
 }

-func newLlama(model string, adapters, projectors []string, runners []ModelRunner, numLayers int64, opts api.Options) (*llama, error) {
+func newLlama(model string, adapters []string, runners []ModelRunner, numLayers int64, opts api.Options) (*llama, error) {
 	fileInfo, err := os.Stat(model)
 	if err != nil {
 		return nil, err
@@ -347,23 +299,13 @@ func newLlama(model string, adapters, projectors []string, runners []ModelRunner
 	params := []string{
 		"--model", model,
 		"--ctx-size", fmt.Sprintf("%d", opts.NumCtx),
+		"--rope-freq-base", fmt.Sprintf("%f", opts.RopeFrequencyBase),
+		"--rope-freq-scale", fmt.Sprintf("%f", opts.RopeFrequencyScale),
 		"--batch-size", fmt.Sprintf("%d", opts.NumBatch),
 		"--n-gpu-layers", fmt.Sprintf("%d", numGPU),
 		"--embedding",
 	}

-	if opts.MainGPU > 0 {
-		params = append(params, "--main-gpu", fmt.Sprintf("%d", opts.MainGPU))
-	}
-
-	if opts.RopeFrequencyBase > 0 {
-		params = append(params, "--rope-freq-base", fmt.Sprintf("%f", opts.RopeFrequencyBase))
-	}
-
-	if opts.RopeFrequencyScale > 0 {
-		params = append(params, "--rope-freq-scale", fmt.Sprintf("%f", opts.RopeFrequencyScale))
-	}
-
 	if opts.NumGQA > 0 {
 		params = append(params, "--gqa", fmt.Sprintf("%d", opts.NumGQA))
 	}
@@ -373,11 +315,6 @@ func newLlama(model string, adapters, projectors []string, runners []ModelRunner
 		params = append(params, "--lora", adapters[0])
 	}

-	if len(projectors) > 0 {
-		// TODO: applying multiple projectors is not supported by the llama.cpp server yet
-		params = append(params, "--mmproj", projectors[0])
-	}
-
 	if opts.NumThread > 0 {
 		params = append(params, "--threads", fmt.Sprintf("%d", opts.NumThread))
 	}
@@ -410,23 +347,13 @@ func newLlama(model string, adapters, projectors []string, runners []ModelRunner
 		}

 		port := rand.Intn(65535-49152) + 49152 // get a random port in the ephemeral range
-		params := append(params, "--port", strconv.Itoa(port))
-
 		ctx, cancel := context.WithCancel(context.Background())
 		cmd := exec.CommandContext(
 			ctx,
 			runner.Path,
-			params...,
+			append(params, "--port", strconv.Itoa(port))...,
 		)
-
-		var libraryPaths []string
-		if libraryPath, ok := os.LookupEnv("LD_LIBRARY_PATH"); ok {
-			libraryPaths = append(libraryPaths, libraryPath)
-		}
-
-		libraryPaths = append(libraryPaths, filepath.Dir(runner.Path))
-
-		cmd.Env = append(os.Environ(), fmt.Sprintf("LD_LIBRARY_PATH=%s", strings.Join(libraryPaths, ":")))
+		cmd.Env = append(os.Environ(), fmt.Sprintf("LD_LIBRARY_PATH=%s", filepath.Dir(runner.Path)))
 		cmd.Stdout = os.Stderr
 		statusWriter := NewStatusWriter()
 		cmd.Stderr = statusWriter
@@ -545,43 +472,25 @@ type prediction struct {
 }

 const maxBufferSize = 512 * format.KiloByte
-const maxRetries = 6

-type PredictOpts struct {
-	Prompt string
-	Format string
-	Images []api.ImageData
-}
-
-type PredictResult struct {
-	Content            string
-	Done               bool
-	PromptEvalCount    int
-	PromptEvalDuration time.Duration
-	EvalCount          int
-	EvalDuration       time.Duration
-}
-
-// IsRetryable checks if the line matches a condition that can be retried
-func isRetryable(line []byte) bool {
-	return bytes.Contains(line, []byte("slot unavailable"))
-}
-
-func (llm *llama) Predict(ctx context.Context, predict PredictOpts, fn func(PredictResult)) error {
-	imageData := llm.ImageData
-	if len(predict.Images) > 0 {
-		for cnt, i := range predict.Images {
-			imageData = append(imageData, ImageData{Data: i, ID: cnt})
-		}
+func (llm *llama) Predict(ctx context.Context, prevContext []int, prompt string, fn func(api.GenerateResponse)) error {
+	prevConvo, err := llm.Decode(ctx, prevContext)
+	if err != nil {
+		return err
 	}
-	log.Printf("loaded %d images", len(imageData))
+
+	// Remove leading spaces from prevConvo if present
+	prevConvo = strings.TrimPrefix(prevConvo, " ")
+
+	var nextContext strings.Builder
+	nextContext.WriteString(prevConvo)
+	nextContext.WriteString(prompt)

 	request := map[string]any{
-		"prompt":            predict.Prompt,
+		"prompt":            nextContext.String(),
 		"stream":            true,
 		"n_predict":         llm.NumPredict,
 		"n_keep":            llm.NumKeep,
-		"main_gpu":          llm.MainGPU,
 		"temperature":       llm.Temperature,
 		"top_k":             llm.TopK,
 		"top_p":             llm.TopP,
@@ -597,121 +506,99 @@ func (llm *llama) Predict(ctx context.Context, predict PredictOpts, fn func(Pred
 		"penalize_nl":       llm.PenalizeNewline,
 		"seed":              llm.Seed,
 		"stop":              llm.Stop,
-		"image_data":        imageData,
 	}

-	if predict.Format == "json" {
-		request["grammar"] = jsonGrammar
+	// Handling JSON marshaling with special characters unescaped.
+	buffer := &bytes.Buffer{}
+	enc := json.NewEncoder(buffer)
+	enc.SetEscapeHTML(false)
+
+	if err := enc.Encode(request); err != nil {
+		return fmt.Errorf("failed to marshal data: %v", err)
 	}

-	retryDelay := 100 * time.Microsecond
-	for retries := 0; retries < maxRetries; retries++ {
-		if retries > 0 {
-			time.Sleep(retryDelay) // wait before retrying
-			retryDelay *= 2        // exponential backoff
-		}
+	endpoint := fmt.Sprintf("http://127.0.0.1:%d/completion", llm.Port)
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, buffer)
+	if err != nil {
+		return fmt.Errorf("error creating POST request: %v", err)
+	}
+	req.Header.Set("Content-Type", "application/json")

-		// Handling JSON marshaling with special characters unescaped.
-		buffer := &bytes.Buffer{}
-		enc := json.NewEncoder(buffer)
-		enc.SetEscapeHTML(false)
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		return fmt.Errorf("POST predict: %v", err)
+	}
+	defer resp.Body.Close()

-		if err := enc.Encode(request); err != nil {
-			return fmt.Errorf("failed to marshal data: %v", err)
-		}
-
-		endpoint := fmt.Sprintf("http://127.0.0.1:%d/completion", llm.Port)
-		req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, buffer)
+	if resp.StatusCode >= 400 {
+		bodyBytes, err := io.ReadAll(resp.Body)
 		if err != nil {
-			return fmt.Errorf("error creating POST request: %v", err)
+			return fmt.Errorf("failed reading llm error response: %w", err)
 		}
-		req.Header.Set("Content-Type", "application/json")
+		log.Printf("llm predict error: %s", bodyBytes)
+		return fmt.Errorf("%s", bodyBytes)
+	}

-		resp, err := http.DefaultClient.Do(req)
-		if err != nil {
-			return fmt.Errorf("POST predict: %v", err)
-		}
-		defer resp.Body.Close()
-
-		if resp.StatusCode >= 400 {
-			bodyBytes, err := io.ReadAll(resp.Body)
-			if err != nil {
-				return fmt.Errorf("failed reading llm error response: %w", err)
+	scanner := bufio.NewScanner(resp.Body)
+	// increase the buffer size to avoid running out of space
+	buf := make([]byte, 0, maxBufferSize)
+	scanner.Buffer(buf, maxBufferSize)
+	for scanner.Scan() {
+		select {
+		case <-ctx.Done():
+			// This handles the request cancellation
+			return ctx.Err()
+		default:
+			line := scanner.Bytes()
+			if len(line) == 0 {
+				continue
 			}
-			log.Printf("llm predict error: %s", bodyBytes)
-			return fmt.Errorf("%s", bodyBytes)
-		}
-
-		scanner := bufio.NewScanner(resp.Body)
-		// increase the buffer size to avoid running out of space
-		buf := make([]byte, 0, maxBufferSize)
-		scanner.Buffer(buf, maxBufferSize)
-
-		retryNeeded := false
-		for scanner.Scan() {
-			select {
-			case <-ctx.Done():
-				// This handles the request cancellation
-				return ctx.Err()
-			default:
-				line := scanner.Bytes()
-				if len(line) == 0 {
-					continue
-				}
-
-				if isRetryable(line) {
-					retryNeeded = true
-					break
-				}
-
-				evt, ok := bytes.CutPrefix(line, []byte("data: "))
-				if !ok {
-					return fmt.Errorf("error parsing llm response stream: %s", line)
-				}

+			if evt, ok := bytes.CutPrefix(line, []byte("data: ")); ok {
 				var p prediction
 				if err := json.Unmarshal(evt, &p); err != nil {
 					return fmt.Errorf("error unmarshaling llm prediction response: %v", err)
 				}

 				if p.Content != "" {
-					fn(PredictResult{
-						Content: p.Content,
-					})
+					fn(api.GenerateResponse{Response: p.Content})
+					nextContext.WriteString(p.Content)
 				}

 				if p.Stop {
-					fn(PredictResult{
+					embd, err := llm.Encode(ctx, nextContext.String())
+					if err != nil {
+						return fmt.Errorf("encoding context: %v", err)
+					}
+
+					fn(api.GenerateResponse{
 						Done:               true,
+						Context:            embd,
 						PromptEvalCount:    p.Timings.PromptN,
 						PromptEvalDuration: parseDurationMs(p.Timings.PromptMS),
 						EvalCount:          p.Timings.PredictedN,
 						EvalDuration:       parseDurationMs(p.Timings.PredictedMS),
 					})
+
 					return nil
 				}
 			}
 		}
-
-		if err := scanner.Err(); err != nil {
-			if strings.Contains(err.Error(), "unexpected EOF") {
-				// this means the llama runner subprocess crashed
-				llm.Close()
-				if llm.StatusWriter != nil && llm.StatusWriter.LastErrMsg != "" {
-					return fmt.Errorf("llama runner exited: %v", llm.StatusWriter.LastErrMsg)
-				}
-				return fmt.Errorf("llama runner exited, you may not have enough available memory to run this model")
-			}
-			return fmt.Errorf("error reading llm response: %v", err)
-		}
-
-		if !retryNeeded {
-			return nil // success
-		}
 	}

-	// should never reach here ideally
-	return fmt.Errorf("max retries exceeded")
+	if err := scanner.Err(); err != nil {
+		if strings.Contains(err.Error(), "unexpected EOF") {
+			// this means the llama runner subprocess crashed
+			llm.Close()
+			if llm.StatusWriter != nil && llm.StatusWriter.LastErrMsg != "" {
+				return fmt.Errorf("llama runner exited: %v", llm.StatusWriter.LastErrMsg)
+			}
+			return fmt.Errorf("llama runner exited, you may not have enough available memory to run this model")
+		}
+		return fmt.Errorf("error reading llm response: %v", err)
+	}
+
+	return nil
 }

 type TokenizeRequest struct {
--- a/llm/llm.go
+++ b/llm/llm.go
@@ -14,7 +14,7 @@ import (
 )

 type LLM interface {
-	Predict(context.Context, PredictOpts, func(PredictResult)) error
+	Predict(context.Context, []int, string, func(api.GenerateResponse)) error
 	Embedding(context.Context, string) ([]float64, error)
 	Encode(context.Context, string) ([]int, error)
 	Decode(context.Context, []int) (string, error)
@@ -23,7 +23,7 @@ type LLM interface {
 	Ping(context.Context) error
 }

-func New(workDir, model string, adapters, projectors []string, opts api.Options) (LLM, error) {
+func New(workDir, model string, adapters []string, opts api.Options) (LLM, error) {
 	if _, err := os.Stat(model); err != nil {
 		return nil, err
 	}
@@ -41,13 +41,20 @@ func New(workDir, model string, adapters, projectors []string, opts api.Options)

 	if runtime.GOOS == "darwin" {
 		switch ggml.FileType() {
-		case "F32", "Q5_0", "Q5_1", "Q8_0":
+		case "Q8_0":
 			if ggml.Name() != "gguf" && opts.NumGPU != 0 {
 				// GGML Q8_0 do not support Metal API and will
 				// cause the runner to segmentation fault so disable GPU
 				log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
 				opts.NumGPU = 0
 			}
+		case "F32", "Q5_0", "Q5_1":
+			if opts.NumGPU != 0 {
+				// F32, Q5_0, Q5_1, and Q8_0 do not support Metal API and will
+				// cause the runner to segmentation fault so disable GPU
+				log.Printf("WARNING: GPU disabled for F32, Q5_0, Q5_1, and Q8_0")
+				opts.NumGPU = 0
+			}
 		}

 		var requiredMemory int64
@@ -78,13 +85,10 @@ func New(workDir, model string, adapters, projectors []string, opts api.Options)

 	switch ggml.Name() {
 	case "gguf":
-		// TODO: gguf will load these options automatically from the model binary
-		opts.NumGQA = 0
-		opts.RopeFrequencyBase = 0.0
-		opts.RopeFrequencyScale = 0.0
-		return newLlama(model, adapters, projectors, chooseRunners(workDir, "gguf"), ggml.NumLayers(), opts)
+		opts.NumGQA = 0 // TODO: remove this when llama.cpp runners differ enough to need separate newLlama functions
+		return newLlama(model, adapters, chooseRunners(workDir, "gguf"), ggml.NumLayers(), opts)
 	case "ggml", "ggmf", "ggjt", "ggla":
-		return newLlama(model, adapters, projectors, chooseRunners(workDir, "ggml"), ggml.NumLayers(), opts)
+		return newLlama(model, adapters, chooseRunners(workDir, "ggml"), ggml.NumLayers(), opts)
 	default:
 		return nil, fmt.Errorf("unknown ggml type: %s", ggml.ModelFamily())
 	}
--- a/llm/starcoder.go
+++ b/llm/starcoder.go
@@ -0,0 +1,23 @@
+package llm
+
+const (
+	starCoderModelType1B  = 24
+	starCoderModelType3B  = 36
+	starCoderModelType7B  = 42
+	starCoderModelType15B = 40
+)
+
+func starCoderModelType(numLayer uint32) string {
+	switch numLayer {
+	case 24:
+		return "1B"
+	case 36:
+		return "3B"
+	case 42:
+		return "7B"
+	case 40:
+		return "15B"
+	default:
+		return "unknown"
+	}
+}
--- a/parser/parser.go
+++ b/parser/parser.go
@@ -37,13 +37,10 @@ func Parse(reader io.Reader) ([]Command, error) {
 		switch string(bytes.ToUpper(fields[0])) {
 		case "FROM":
 			command.Name = "model"
-			command.Args = string(bytes.TrimSpace(fields[1]))
+			command.Args = string(fields[1])
 			// copy command for validation
 			modelCommand = command
-		case "ADAPTER":
-			command.Name = string(bytes.ToLower(fields[0]))
-			command.Args = string(bytes.TrimSpace(fields[1]))
-		case "LICENSE", "TEMPLATE", "SYSTEM", "PROMPT":
+		case "LICENSE", "TEMPLATE", "SYSTEM", "PROMPT", "ADAPTER":
 			command.Name = string(bytes.ToLower(fields[0]))
 			command.Args = string(fields[1])
 		case "PARAMETER":
@@ -53,7 +50,7 @@ func Parse(reader io.Reader) ([]Command, error) {
 			}

 			command.Name = string(fields[0])
-			command.Args = string(bytes.TrimSpace(fields[1]))
+			command.Args = string(fields[1])
 		case "EMBED":
 			return nil, fmt.Errorf("deprecated command: EMBED is no longer supported, use the /embed API endpoint instead")
 		default:
--- a/progress/bar.go
+++ b/progress/bar.go
@@ -1,215 +0,0 @@
-package progress
-
-import (
-	"fmt"
-	"os"
-	"strings"
-	"time"
-
-	"github.com/jmorganca/ollama/format"
-	"golang.org/x/term"
-)
-
-type Bar struct {
-	message      string
-	messageWidth int
-
-	maxValue     int64
-	initialValue int64
-	currentValue int64
-
-	started time.Time
-	stopped time.Time
-
-	maxBuckets int
-	buckets    []bucket
-}
-
-type bucket struct {
-	updated time.Time
-	value   int64
-}
-
-func NewBar(message string, maxValue, initialValue int64) *Bar {
-	b := Bar{
-		message:      message,
-		messageWidth: -1,
-		maxValue:     maxValue,
-		initialValue: initialValue,
-		currentValue: initialValue,
-		started:      time.Now(),
-		maxBuckets:   10,
-	}
-
-	if initialValue >= maxValue {
-		b.stopped = time.Now()
-	}
-
-	return &b
-}
-
-// formatDuration limits the rendering of a time.Duration to 2 units
-func formatDuration(d time.Duration) string {
-	switch {
-	case d >= 100*time.Hour:
-		return "99h+"
-	case d >= time.Hour:
-		return fmt.Sprintf("%dh%dm", int(d.Hours()), int(d.Minutes())%60)
-	default:
-		return d.Round(time.Second).String()
-	}
-}
-
-func (b *Bar) String() string {
-	termWidth, _, err := term.GetSize(int(os.Stderr.Fd()))
-	if err != nil {
-		termWidth = 80
-	}
-
-	var pre strings.Builder
-	if len(b.message) > 0 {
-		message := strings.TrimSpace(b.message)
-		if b.messageWidth > 0 && len(message) > b.messageWidth {
-			message = message[:b.messageWidth]
-		}
-
-		fmt.Fprintf(&pre, "%s", message)
-		if padding := b.messageWidth - pre.Len(); padding > 0 {
-			pre.WriteString(repeat(" ", padding))
-		}
-
-		pre.WriteString(" ")
-	}
-
-	fmt.Fprintf(&pre, "%3.0f%%", b.percent())
-
-	var suf strings.Builder
-	// max 13 characters: "999 MB/999 MB"
-	if b.stopped.IsZero() {
-		curValue := format.HumanBytes(b.currentValue)
-		suf.WriteString(repeat(" ", 6-len(curValue)))
-		suf.WriteString(curValue)
-		suf.WriteString("/")
-
-		maxValue := format.HumanBytes(b.maxValue)
-		suf.WriteString(repeat(" ", 6-len(maxValue)))
-		suf.WriteString(maxValue)
-	} else {
-		maxValue := format.HumanBytes(b.maxValue)
-		suf.WriteString(repeat(" ", 6-len(maxValue)))
-		suf.WriteString(maxValue)
-		suf.WriteString(repeat(" ", 7))
-	}
-
-	rate := b.rate()
-	// max 10 characters: "  999 MB/s"
-	if b.stopped.IsZero() && rate > 0 {
-		suf.WriteString("  ")
-		humanRate := format.HumanBytes(int64(rate))
-		suf.WriteString(repeat(" ", 6-len(humanRate)))
-		suf.WriteString(humanRate)
-		suf.WriteString("/s")
-	} else {
-		suf.WriteString(repeat(" ", 10))
-	}
-
-	// max 8 characters: "  59m59s"
-	if b.stopped.IsZero() && rate > 0 {
-		suf.WriteString("  ")
-		var remaining time.Duration
-		if rate > 0 {
-			remaining = time.Duration(int64(float64(b.maxValue-b.currentValue)/rate)) * time.Second
-		}
-
-		humanRemaining := formatDuration(remaining)
-		suf.WriteString(repeat(" ", 6-len(humanRemaining)))
-		suf.WriteString(humanRemaining)
-	} else {
-		suf.WriteString(repeat(" ", 8))
-	}
-
-	var mid strings.Builder
-	// add 5 extra spaces: 2 boundary characters and 1 space at each end
-	f := termWidth - pre.Len() - suf.Len() - 5
-	n := int(float64(f) * b.percent() / 100)
-
-	mid.WriteString(" ▕")
-
-	if n > 0 {
-		mid.WriteString(repeat("█", n))
-	}
-
-	if f-n > 0 {
-		mid.WriteString(repeat(" ", f-n))
-	}
-
-	mid.WriteString("▏ ")
-
-	return pre.String() + mid.String() + suf.String()
-}
-
-func (b *Bar) Set(value int64) {
-	if value >= b.maxValue {
-		value = b.maxValue
-	}
-
-	b.currentValue = value
-	if b.currentValue >= b.maxValue {
-		b.stopped = time.Now()
-	}
-
-	// throttle bucket updates to 1 per second
-	if len(b.buckets) == 0 || time.Since(b.buckets[len(b.buckets)-1].updated) > time.Second {
-		b.buckets = append(b.buckets, bucket{
-			updated: time.Now(),
-			value:   value,
-		})
-
-		if len(b.buckets) > b.maxBuckets {
-			b.buckets = b.buckets[1:]
-		}
-	}
-}
-
-func (b *Bar) percent() float64 {
-	if b.maxValue > 0 {
-		return float64(b.currentValue) / float64(b.maxValue) * 100
-	}
-
-	return 0
-}
-
-func (b *Bar) rate() float64 {
-	var numerator, denominator float64
-
-	if !b.stopped.IsZero() {
-		numerator = float64(b.currentValue - b.initialValue)
-		denominator = b.stopped.Sub(b.started).Round(time.Second).Seconds()
-	} else {
-		switch len(b.buckets) {
-		case 0:
-			// noop
-		case 1:
-			numerator = float64(b.buckets[0].value - b.initialValue)
-			denominator = b.buckets[0].updated.Sub(b.started).Round(time.Second).Seconds()
-		default:
-			first, last := b.buckets[0], b.buckets[len(b.buckets)-1]
-			numerator = float64(last.value - first.value)
-			denominator = last.updated.Sub(first.updated).Round(time.Second).Seconds()
-		}
-	}
-
-	if denominator != 0 {
-		return numerator / denominator
-	}
-
-	return 0
-}
-
-func repeat(s string, n int) string {
-	if n > 0 {
-		return strings.Repeat(s, n)
-	}
-
-	return ""
-}
--- a/progress/progress.go
+++ b/progress/progress.go
@@ -1,113 +0,0 @@
-package progress
-
-import (
-	"fmt"
-	"io"
-	"sync"
-	"time"
-)
-
-type State interface {
-	String() string
-}
-
-type Progress struct {
-	mu sync.Mutex
-	w  io.Writer
-
-	pos int
-
-	ticker *time.Ticker
-	states []State
-}
-
-func NewProgress(w io.Writer) *Progress {
-	p := &Progress{w: w}
-	go p.start()
-	return p
-}
-
-func (p *Progress) stop() bool {
-	for _, state := range p.states {
-		if spinner, ok := state.(*Spinner); ok {
-			spinner.Stop()
-		}
-	}
-
-	if p.ticker != nil {
-		p.ticker.Stop()
-		p.ticker = nil
-		p.render()
-		return true
-	}
-
-	return false
-}
-
-func (p *Progress) Stop() bool {
-	stopped := p.stop()
-	if stopped {
-		fmt.Fprint(p.w, "\n")
-	}
-	return stopped
-}
-
-func (p *Progress) StopAndClear() bool {
-	fmt.Fprint(p.w, "\033[?25l")
-	defer fmt.Fprint(p.w, "\033[?25h")
-
-	stopped := p.stop()
-	if stopped {
-		// clear all progress lines
-		for i := 0; i < p.pos; i++ {
-			if i > 0 {
-				fmt.Fprint(p.w, "\033[A")
-			}
-			fmt.Fprint(p.w, "\033[2K\033[1G")
-		}
-	}
-
-	return stopped
-}
-
-func (p *Progress) Add(key string, state State) {
-	p.mu.Lock()
-	defer p.mu.Unlock()
-
-	p.states = append(p.states, state)
-}
-
-func (p *Progress) render() error {
-	p.mu.Lock()
-	defer p.mu.Unlock()
-
-	fmt.Fprint(p.w, "\033[?25l")
-	defer fmt.Fprint(p.w, "\033[?25h")
-
-	// clear already rendered progress lines
-	for i := 0; i < p.pos; i++ {
-		if i > 0 {
-			fmt.Fprint(p.w, "\033[A")
-		}
-		fmt.Fprint(p.w, "\033[2K\033[1G")
-	}
-
-	// render progress lines
-	for i, state := range p.states {
-		fmt.Fprint(p.w, state.String())
-		if i < len(p.states)-1 {
-			fmt.Fprint(p.w, "\n")
-		}
-	}
-
-	p.pos = len(p.states)
-
-	return nil
-}
-
-func (p *Progress) start() {
-	p.ticker = time.NewTicker(100 * time.Millisecond)
-	for range p.ticker.C {
-		p.render()
-	}
-}
--- a/progress/spinner.go
+++ b/progress/spinner.go
@@ -1,73 +0,0 @@
-package progress
-
-import (
-	"fmt"
-	"strings"
-	"time"
-)
-
-type Spinner struct {
-	message      string
-	messageWidth int
-
-	parts []string
-
-	value int
-
-	ticker  *time.Ticker
-	started time.Time
-	stopped time.Time
-}
-
-func NewSpinner(message string) *Spinner {
-	s := &Spinner{
-		message: message,
-		parts: []string{
-			"⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏",
-		},
-		started: time.Now(),
-	}
-	go s.start()
-	return s
-}
-
-func (s *Spinner) String() string {
-	var sb strings.Builder
-	if len(s.message) > 0 {
-		message := strings.TrimSpace(s.message)
-		if s.messageWidth > 0 && len(message) > s.messageWidth {
-			message = message[:s.messageWidth]
-		}
-
-		fmt.Fprintf(&sb, "%s", message)
-		if padding := s.messageWidth - sb.Len(); padding > 0 {
-			sb.WriteString(strings.Repeat(" ", padding))
-		}
-
-		sb.WriteString(" ")
-	}
-
-	if s.stopped.IsZero() {
-		spinner := s.parts[s.value]
-		sb.WriteString(spinner)
-		sb.WriteString(" ")
-	}
-
-	return sb.String()
-}
-
-func (s *Spinner) start() {
-	s.ticker = time.NewTicker(100 * time.Millisecond)
-	for range s.ticker.C {
-		s.value = (s.value + 1) % len(s.parts)
-		if !s.stopped.IsZero() {
-			return
-		}
-	}
-}
-
-func (s *Spinner) Stop() {
-	if s.stopped.IsZero() {
-		s.stopped = time.Now()
-	}
-}
--- a/progressbar/LICENSE
+++ b/progressbar/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2017 Zack
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/progressbar/README.md
+++ b/progressbar/README.md
@@ -0,0 +1,121 @@
+# progressbar
+
+[![CI](https://github.com/schollz/progressbar/actions/workflows/ci.yml/badge.svg?branch=main&event=push)](https://github.com/schollz/progressbar/actions/workflows/ci.yml)
+[![go report card](https://goreportcard.com/badge/github.com/schollz/progressbar)](https://goreportcard.com/report/github.com/schollz/progressbar) 
+[![coverage](https://img.shields.io/badge/coverage-84%25-brightgreen.svg)](https://gocover.io/github.com/schollz/progressbar)
+[![godocs](https://godoc.org/github.com/schollz/progressbar?status.svg)](https://godoc.org/github.com/schollz/progressbar/v3) 
+
+A very simple thread-safe progress bar which should work on every OS without problems. I needed a progressbar for [croc](https://github.com/schollz/croc) and everything I tried had problems, so I made another one. In order to be OS agnostic I do not plan to support [multi-line outputs](https://github.com/schollz/progressbar/issues/6).
+
+
+## Install
+
+```
+go get -u github.com/schollz/progressbar/v3
+```
+
+## Usage 
+
+### Basic usage
+
+```golang
+bar := progressbar.Default(100)
+for i := 0; i < 100; i++ {
+    bar.Add(1)
+    time.Sleep(40 * time.Millisecond)
+}
+```
+
+which looks like:
+
+![Example of basic bar](examples/basic/basic.gif)
+
+
+### I/O operations
+
+The `progressbar` implements an `io.Writer` so it can automatically detect the number of bytes written to a stream, so you can use it as a progressbar for an `io.Reader`.
+
+```golang
+req, _ := http.NewRequest("GET", "https://dl.google.com/go/go1.14.2.src.tar.gz", nil)
+resp, _ := http.DefaultClient.Do(req)
+defer resp.Body.Close()
+
+f, _ := os.OpenFile("go1.14.2.src.tar.gz", os.O_CREATE|os.O_WRONLY, 0644)
+defer f.Close()
+
+bar := progressbar.DefaultBytes(
+    resp.ContentLength,
+    "downloading",
+)
+io.Copy(io.MultiWriter(f, bar), resp.Body)
+```
+
+which looks like:
+
+![Example of download bar](examples/download/download.gif)
+
+
+### Progress bar with unknown length
+
+A progressbar with unknown length is a spinner. Any bar with -1 length will automatically convert it to a spinner with a customizable spinner type. For example, the above code can be run and set the `resp.ContentLength` to `-1`.
+
+which looks like:
+
+![Example of download bar with unknown length](examples/download-unknown/download-unknown.gif)
+
+
+### Customization
+
+There is a lot of customization that you can do - change the writer, the color, the width, description, theme, etc. See [all the options](https://pkg.go.dev/github.com/schollz/progressbar/v3?tab=doc#Option).
+
+```golang
+bar := progressbar.NewOptions(1000,
+    progressbar.OptionSetWriter(ansi.NewAnsiStdout()),
+    progressbar.OptionEnableColorCodes(true),
+    progressbar.OptionShowBytes(true),
+    progressbar.OptionSetWidth(15),
+    progressbar.OptionSetDescription("[cyan][1/3][reset] Writing moshable file..."),
+    progressbar.OptionSetTheme(progressbar.Theme{
+        Saucer:        "[green]=[reset]",
+        SaucerHead:    "[green]>[reset]",
+        SaucerPadding: " ",
+        BarStart:      "[",
+        BarEnd:        "]",
+    }))
+for i := 0; i < 1000; i++ {
+    bar.Add(1)
+    time.Sleep(5 * time.Millisecond)
+}
+```
+
+which looks like:
+
+![Example of customized bar](examples/customization/customization.gif)
+
+
+## Contributing
+
+Pull requests are welcome. Feel free to...
+
+- Revise documentation
+- Add new features
+- Fix bugs
+- Suggest improvements
+
+## Thanks
+
+Thanks [@Dynom](https://github.com/dynom) for massive improvements in version 2.0!
+
+Thanks [@CrushedPixel](https://github.com/CrushedPixel) for adding descriptions and color code support!
+
+Thanks [@MrMe42](https://github.com/MrMe42) for adding some minor features!
+
+Thanks [@tehstun](https://github.com/tehstun) for some great PRs!
+
+Thanks [@Benzammour](https://github.com/Benzammour) and [@haseth](https://github.com/haseth) for helping create v3!
+
+Thanks [@briandowns](https://github.com/briandowns) for compiling the list of spinners.
+
+## License
+
+MIT
--- a/progressbar/progressbar.go
+++ b/progressbar/progressbar.go
--- a/progressbar/spinners.go
+++ b/progressbar/spinners.go
@@ -0,0 +1,80 @@
+package progressbar
+
+var spinners = map[int][]string{
+	0:  {"←", "↖", "↑", "↗", "→", "↘", "↓", "↙"},
+	1:  {"▁", "▃", "▄", "▅", "▆", "▇", "█", "▇", "▆", "▅", "▄", "▃", "▁"},
+	2:  {"▖", "▘", "▝", "▗"},
+	3:  {"┤", "┘", "┴", "└", "├", "┌", "┬", "┐"},
+	4:  {"◢", "◣", "◤", "◥"},
+	5:  {"◰", "◳", "◲", "◱"},
+	6:  {"◴", "◷", "◶", "◵"},
+	7:  {"◐", "◓", "◑", "◒"},
+	8:  {".", "o", "O", "@", "*"},
+	9:  {"|", "/", "-", "\\"},
+	10: {"◡◡", "⊙⊙", "◠◠"},
+	11: {"⣾", "⣽", "⣻", "⢿", "⡿", "⣟", "⣯", "⣷"},
+	12: {">))'>", " >))'>", "  >))'>", "   >))'>", "    >))'>", "   <'((<", "  <'((<", " <'((<"},
+	13: {"⠁", "⠂", "⠄", "⡀", "⢀", "⠠", "⠐", "⠈"},
+	14: {"⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"},
+	15: {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"},
+	16: {"▉", "▊", "▋", "▌", "▍", "▎", "▏", "▎", "▍", "▌", "▋", "▊", "▉"},
+	17: {"■", "□", "▪", "▫"},
+	18: {"←", "↑", "→", "↓"},
+	19: {"╫", "╪"},
+	20: {"⇐", "⇖", "⇑", "⇗", "⇒", "⇘", "⇓", "⇙"},
+	21: {"⠁", "⠁", "⠉", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠤", "⠄", "⠄", "⠤", "⠠", "⠠", "⠤", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋", "⠉", "⠈", "⠈"},
+	22: {"⠈", "⠉", "⠋", "⠓", "⠒", "⠐", "⠐", "⠒", "⠖", "⠦", "⠤", "⠠", "⠠", "⠤", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋", "⠉", "⠈"},
+	23: {"⠁", "⠉", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠤", "⠄", "⠄", "⠤", "⠴", "⠲", "⠒", "⠂", "⠂", "⠒", "⠚", "⠙", "⠉", "⠁"},
+	24: {"⠋", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋"},
+	25: {"ｦ", "ｧ", "ｨ", "ｩ", "ｪ", "ｫ", "ｬ", "ｭ", "ｮ", "ｯ", "ｱ", "ｲ", "ｳ", "ｴ", "ｵ", "ｶ", "ｷ", "ｸ", "ｹ", "ｺ", "ｻ", "ｼ", "ｽ", "ｾ", "ｿ", "ﾀ", "ﾁ", "ﾂ", "ﾃ", "ﾄ", "ﾅ", "ﾆ", "ﾇ", "ﾈ", "ﾉ", "ﾊ", "ﾋ", "ﾌ", "ﾍ", "ﾎ", "ﾏ", "ﾐ", "ﾑ", "ﾒ", "ﾓ", "ﾔ", "ﾕ", "ﾖ", "ﾗ", "ﾘ", "ﾙ", "ﾚ", "ﾛ", "ﾜ", "ﾝ"},
+	26: {".", "..", "..."},
+	27: {"▁", "▂", "▃", "▄", "▅", "▆", "▇", "█", "▉", "▊", "▋", "▌", "▍", "▎", "▏", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█", "▇", "▆", "▅", "▄", "▃", "▂", "▁"},
+	28: {".", "o", "O", "°", "O", "o", "."},
+	29: {"+", "x"},
+	30: {"v", "<", "^", ">"},
+	31: {">>--->", " >>--->", "  >>--->", "   >>--->", "    >>--->", "    <---<<", "   <---<<", "  <---<<", " <---<<", "<---<<"},
+	32: {"|", "||", "|||", "||||", "|||||", "|||||||", "||||||||", "|||||||", "||||||", "|||||", "||||", "|||", "||", "|"},
+	33: {"[          ]", "[=         ]", "[==        ]", "[===       ]", "[====      ]", "[=====     ]", "[======    ]", "[=======   ]", "[========  ]", "[========= ]", "[==========]"},
+	34: {"(*---------)", "(-*--------)", "(--*-------)", "(---*------)", "(----*-----)", "(-----*----)", "(------*---)", "(-------*--)", "(--------*-)", "(---------*)"},
+	35: {"█▒▒▒▒▒▒▒▒▒", "███▒▒▒▒▒▒▒", "█████▒▒▒▒▒", "███████▒▒▒", "██████████"},
+	36: {"[                    ]", "[=>                  ]", "[===>                ]", "[=====>              ]", "[======>             ]", "[========>           ]", "[==========>         ]", "[============>       ]", "[==============>     ]", "[================>   ]", "[==================> ]", "[===================>]"},
+	37: {"ဝ", "၀"},
+	38: {"▌", "▀", "▐▄"},
+	39: {"🌍", "🌎", "🌏"},
+	40: {"◜", "◝", "◞", "◟"},
+	41: {"⬒", "⬔", "⬓", "⬕"},
+	42: {"⬖", "⬘", "⬗", "⬙"},
+	43: {"[>>>          >]", "[]>>>>        []", "[]  >>>>      []", "[]    >>>>    []", "[]      >>>>  []", "[]        >>>>[]", "[>>          >>]"},
+	44: {"♠", "♣", "♥", "♦"},
+	45: {"➞", "➟", "➠", "➡", "➠", "➟"},
+	46: {"  |  ", ` \   `, "_    ", ` \   `, "  |  ", "   / ", "    _", "   / "},
+	47: {"  . . . .", ".   . . .", ". .   . .", ". . .   .", ". . . .  ", ". . . . ."},
+	48: {" |     ", "  /    ", "   _   ", `    \  `, "     | ", `    \  `, "   _   ", "  /    "},
+	49: {"⎺", "⎻", "⎼", "⎽", "⎼", "⎻"},
+	50: {"▹▹▹▹▹", "▸▹▹▹▹", "▹▸▹▹▹", "▹▹▸▹▹", "▹▹▹▸▹", "▹▹▹▹▸"},
+	51: {"[    ]", "[   =]", "[  ==]", "[ ===]", "[====]", "[=== ]", "[==  ]", "[=   ]"},
+	52: {"( ●    )", "(  ●   )", "(   ●  )", "(    ● )", "(     ●)", "(    ● )", "(   ●  )", "(  ●   )", "( ●    )"},
+	53: {"✶", "✸", "✹", "✺", "✹", "✷"},
+	54: {"▐|\\____________▌", "▐_|\\___________▌", "▐__|\\__________▌", "▐___|\\_________▌", "▐____|\\________▌", "▐_____|\\_______▌", "▐______|\\______▌", "▐_______|\\_____▌", "▐________|\\____▌", "▐_________|\\___▌", "▐__________|\\__▌", "▐___________|\\_▌", "▐____________|\\▌", "▐____________/|▌", "▐___________/|_▌", "▐__________/|__▌", "▐_________/|___▌", "▐________/|____▌", "▐_______/|_____▌", "▐______/|______▌", "▐_____/|_______▌", "▐____/|________▌", "▐___/|_________▌", "▐__/|__________▌", "▐_/|___________▌", "▐/|____________▌"},
+	55: {"▐⠂       ▌", "▐⠈       ▌", "▐ ⠂      ▌", "▐ ⠠      ▌", "▐  ⡀     ▌", "▐  ⠠     ▌", "▐   ⠂    ▌", "▐   ⠈    ▌", "▐    ⠂   ▌", "▐    ⠠   ▌", "▐     ⡀  ▌", "▐     ⠠  ▌", "▐      ⠂ ▌", "▐      ⠈ ▌", "▐       ⠂▌", "▐       ⠠▌", "▐       ⡀▌", "▐      ⠠ ▌", "▐      ⠂ ▌", "▐     ⠈  ▌", "▐     ⠂  ▌", "▐    ⠠   ▌", "▐    ⡀   ▌", "▐   ⠠    ▌", "▐   ⠂    ▌", "▐  ⠈     ▌", "▐  ⠂     ▌", "▐ ⠠      ▌", "▐ ⡀      ▌", "▐⠠       ▌"},
+	56: {"¿", "?"},
+	57: {"⢹", "⢺", "⢼", "⣸", "⣇", "⡧", "⡗", "⡏"},
+	58: {"⢄", "⢂", "⢁", "⡁", "⡈", "⡐", "⡠"},
+	59: {".  ", ".. ", "...", " ..", "  .", "   "},
+	60: {".", "o", "O", "°", "O", "o", "."},
+	61: {"▓", "▒", "░"},
+	62: {"▌", "▀", "▐", "▄"},
+	63: {"⊶", "⊷"},
+	64: {"▪", "▫"},
+	65: {"□", "■"},
+	66: {"▮", "▯"},
+	67: {"-", "=", "≡"},
+	68: {"d", "q", "p", "b"},
+	69: {"∙∙∙", "●∙∙", "∙●∙", "∙∙●", "∙∙∙"},
+	70: {"🌑 ", "🌒 ", "🌓 ", "🌔 ", "🌕 ", "🌖 ", "🌗 ", "🌘 "},
+	71: {"☗", "☖"},
+	72: {"⧇", "⧆"},
+	73: {"◉", "◎"},
+	74: {"㊂", "㊀", "㊁"},
+	75: {"⦾", "⦿"},
+}
--- a/readline/buffer.go
+++ b/readline/buffer.go
@@ -1,372 +0,0 @@
-package readline
-
-import (
-	"fmt"
-	"os"
-
-	"github.com/emirpasic/gods/lists/arraylist"
-	"golang.org/x/term"
-)
-
-type Buffer struct {
-	Pos       int
-	Buf       *arraylist.List
-	Prompt    *Prompt
-	LineWidth int
-	Width     int
-	Height    int
-}
-
-func NewBuffer(prompt *Prompt) (*Buffer, error) {
-	fd := int(os.Stdout.Fd())
-	width, height, err := term.GetSize(fd)
-	if err != nil {
-		fmt.Println("Error getting size:", err)
-		return nil, err
-	}
-
-	lwidth := width - len(prompt.Prompt)
-	if prompt.UseAlt {
-		lwidth = width - len(prompt.AltPrompt)
-	}
-
-	b := &Buffer{
-		Pos:       0,
-		Buf:       arraylist.New(),
-		Prompt:    prompt,
-		Width:     width,
-		Height:    height,
-		LineWidth: lwidth,
-	}
-
-	return b, nil
-}
-
-func (b *Buffer) MoveLeft() {
-	if b.Pos > 0 {
-		if b.Pos%b.LineWidth == 0 {
-			fmt.Printf(CursorUp + CursorBOL + cursorRightN(b.Width))
-		} else {
-			fmt.Print(CursorLeft)
-		}
-		b.Pos -= 1
-	}
-}
-
-func (b *Buffer) MoveLeftWord() {
-	if b.Pos > 0 {
-		var foundNonspace bool
-		for {
-			v, _ := b.Buf.Get(b.Pos - 1)
-			if v == ' ' {
-				if foundNonspace {
-					break
-				}
-			} else {
-				foundNonspace = true
-			}
-			b.MoveLeft()
-
-			if b.Pos == 0 {
-				break
-			}
-		}
-	}
-}
-
-func (b *Buffer) MoveRight() {
-	if b.Pos < b.Size() {
-		b.Pos += 1
-		if b.Pos%b.LineWidth == 0 {
-			fmt.Printf(CursorDown + CursorBOL + cursorRightN(b.PromptSize()))
-		} else {
-			fmt.Print(CursorRight)
-		}
-	}
-}
-
-func (b *Buffer) MoveRightWord() {
-	if b.Pos < b.Size() {
-		for {
-			b.MoveRight()
-			v, _ := b.Buf.Get(b.Pos)
-			if v == ' ' {
-				break
-			}
-
-			if b.Pos == b.Size() {
-				break
-			}
-		}
-	}
-}
-
-func (b *Buffer) MoveToStart() {
-	if b.Pos > 0 {
-		currLine := b.Pos / b.LineWidth
-		if currLine > 0 {
-			for cnt := 0; cnt < currLine; cnt++ {
-				fmt.Print(CursorUp)
-			}
-		}
-		fmt.Printf(CursorBOL + cursorRightN(b.PromptSize()))
-		b.Pos = 0
-	}
-}
-
-func (b *Buffer) MoveToEnd() {
-	if b.Pos < b.Size() {
-		currLine := b.Pos / b.LineWidth
-		totalLines := b.Size() / b.LineWidth
-		if currLine < totalLines {
-			for cnt := 0; cnt < totalLines-currLine; cnt++ {
-				fmt.Print(CursorDown)
-			}
-			remainder := b.Size() % b.LineWidth
-			fmt.Printf(CursorBOL + cursorRightN(b.PromptSize()+remainder))
-		} else {
-			fmt.Print(cursorRightN(b.Size() - b.Pos))
-		}
-
-		b.Pos = b.Size()
-	}
-}
-
-func (b *Buffer) Size() int {
-	return b.Buf.Size()
-}
-
-func min(n, m int) int {
-	if n > m {
-		return m
-	}
-	return n
-}
-
-func (b *Buffer) PromptSize() int {
-	if b.Prompt.UseAlt {
-		return len(b.Prompt.AltPrompt)
-	}
-	return len(b.Prompt.Prompt)
-}
-
-func (b *Buffer) Add(r rune) {
-	if b.Pos == b.Buf.Size() {
-		fmt.Printf("%c", r)
-		b.Buf.Add(r)
-		b.Pos += 1
-		if b.Pos > 0 && b.Pos%b.LineWidth == 0 {
-			fmt.Printf("\n%s", b.Prompt.AltPrompt)
-		}
-	} else {
-		fmt.Printf("%c", r)
-		b.Buf.Insert(b.Pos, r)
-		b.Pos += 1
-		if b.Pos > 0 && b.Pos%b.LineWidth == 0 {
-			fmt.Printf("\n%s", b.Prompt.AltPrompt)
-		}
-		b.drawRemaining()
-	}
-}
-
-func (b *Buffer) drawRemaining() {
-	var place int
-	remainingText := b.StringN(b.Pos)
-	if b.Pos > 0 {
-		place = b.Pos % b.LineWidth
-	}
-	fmt.Print(CursorHide)
-
-	// render the rest of the current line
-	currLine := remainingText[:min(b.LineWidth-place, len(remainingText))]
-	if len(currLine) > 0 {
-		fmt.Printf(ClearToEOL + currLine)
-		fmt.Print(cursorLeftN(len(currLine)))
-	} else {
-		fmt.Print(ClearToEOL)
-	}
-
-	// render the other lines
-	if len(remainingText) > len(currLine) {
-		remaining := []rune(remainingText[len(currLine):])
-		var totalLines int
-		for i, c := range remaining {
-			if i%b.LineWidth == 0 {
-				fmt.Printf("\n%s", b.Prompt.AltPrompt)
-				totalLines += 1
-			}
-			fmt.Printf("%c", c)
-		}
-		fmt.Print(ClearToEOL)
-		fmt.Print(cursorUpN(totalLines))
-		fmt.Printf(CursorBOL + cursorRightN(b.Width-len(currLine)))
-	}
-
-	fmt.Print(CursorShow)
-}
-
-func (b *Buffer) Remove() {
-	if b.Buf.Size() > 0 && b.Pos > 0 {
-		if b.Pos%b.LineWidth == 0 {
-			// if the user backspaces over the word boundary, do this magic to clear the line
-			// and move to the end of the previous line
-			fmt.Printf(CursorBOL + ClearToEOL)
-			fmt.Printf(CursorUp + CursorBOL + cursorRightN(b.Width) + " " + CursorLeft)
-		} else {
-			fmt.Printf(CursorLeft + " " + CursorLeft)
-		}
-
-		var eraseExtraLine bool
-		if (b.Size()-1)%b.LineWidth == 0 {
-			eraseExtraLine = true
-		}
-
-		b.Pos -= 1
-		b.Buf.Remove(b.Pos)
-
-		if b.Pos < b.Size() {
-			b.drawRemaining()
-			// this erases a line which is left over when backspacing in the middle of a line and there
-			// are trailing characters which go over the line width boundary
-			if eraseExtraLine {
-				remainingLines := (b.Size() - b.Pos) / b.LineWidth
-				fmt.Printf(cursorDownN(remainingLines+1) + CursorBOL + ClearToEOL)
-				place := b.Pos % b.LineWidth
-				fmt.Printf(cursorUpN(remainingLines+1) + cursorRightN(place+len(b.Prompt.Prompt)))
-			}
-		}
-	}
-}
-
-func (b *Buffer) Delete() {
-	if b.Size() > 0 && b.Pos < b.Size() {
-		b.Buf.Remove(b.Pos)
-		b.drawRemaining()
-		if b.Size()%b.LineWidth == 0 {
-			if b.Pos != b.Size() {
-				remainingLines := (b.Size() - b.Pos) / b.LineWidth
-				fmt.Printf(cursorDownN(remainingLines) + CursorBOL + ClearToEOL)
-				place := b.Pos % b.LineWidth
-				fmt.Printf(cursorUpN(remainingLines) + cursorRightN(place+len(b.Prompt.Prompt)))
-			}
-		}
-	}
-}
-
-func (b *Buffer) DeleteBefore() {
-	if b.Pos > 0 {
-		for cnt := b.Pos - 1; cnt >= 0; cnt-- {
-			b.Remove()
-		}
-	}
-}
-
-func (b *Buffer) DeleteRemaining() {
-	if b.Size() > 0 && b.Pos < b.Size() {
-		charsToDel := b.Size() - b.Pos
-		for cnt := 0; cnt < charsToDel; cnt++ {
-			b.Delete()
-		}
-	}
-}
-
-func (b *Buffer) DeleteWord() {
-	if b.Buf.Size() > 0 && b.Pos > 0 {
-		var foundNonspace bool
-		for {
-			v, _ := b.Buf.Get(b.Pos - 1)
-			if v == ' ' {
-				if !foundNonspace {
-					b.Remove()
-				} else {
-					break
-				}
-			} else {
-				foundNonspace = true
-				b.Remove()
-			}
-
-			if b.Pos == 0 {
-				break
-			}
-		}
-	}
-}
-
-func (b *Buffer) ClearScreen() {
-	fmt.Printf(ClearScreen + CursorReset + b.Prompt.Prompt)
-	if b.IsEmpty() {
-		ph := b.Prompt.Placeholder
-		fmt.Printf(ColorGrey + ph + cursorLeftN(len(ph)) + ColorDefault)
-	} else {
-		currPos := b.Pos
-		b.Pos = 0
-		b.drawRemaining()
-		fmt.Printf(CursorReset + cursorRightN(len(b.Prompt.Prompt)))
-		if currPos > 0 {
-			targetLine := currPos / b.LineWidth
-			if targetLine > 0 {
-				for cnt := 0; cnt < targetLine; cnt++ {
-					fmt.Print(CursorDown)
-				}
-			}
-			remainder := currPos % b.LineWidth
-			if remainder > 0 {
-				fmt.Print(cursorRightN(remainder))
-			}
-			if currPos%b.LineWidth == 0 {
-				fmt.Printf(CursorBOL + b.Prompt.AltPrompt)
-			}
-		}
-		b.Pos = currPos
-	}
-}
-
-func (b *Buffer) IsEmpty() bool {
-	return b.Buf.Empty()
-}
-
-func (b *Buffer) Replace(r []rune) {
-	b.Pos = 0
-	b.Buf.Clear()
-	fmt.Printf(ClearLine + CursorBOL + b.Prompt.Prompt)
-	for _, c := range r {
-		b.Add(c)
-	}
-}
-
-func (b *Buffer) String() string {
-	return b.StringN(0)
-}
-
-func (b *Buffer) StringN(n int) string {
-	return b.StringNM(n, 0)
-}
-
-func (b *Buffer) StringNM(n, m int) string {
-	var s string
-	if m == 0 {
-		m = b.Size()
-	}
-	for cnt := n; cnt < m; cnt++ {
-		c, _ := b.Buf.Get(cnt)
-		s += string(c.(rune))
-	}
-	return s
-}
-
-func cursorLeftN(n int) string {
-	return fmt.Sprintf(CursorLeftN, n)
-}
-
-func cursorRightN(n int) string {
-	return fmt.Sprintf(CursorRightN, n)
-}
-
-func cursorUpN(n int) string {
-	return fmt.Sprintf(CursorUpN, n)
-}
-
-func cursorDownN(n int) string {
-	return fmt.Sprintf(CursorDownN, n)
-}
--- a/readline/errors.go
+++ b/readline/errors.go
@@ -1,17 +0,0 @@
-package readline
-
-import (
-	"errors"
-)
-
-var (
-	ErrInterrupt = errors.New("Interrupt")
-)
-
-type InterruptError struct {
-	Line []rune
-}
-
-func (*InterruptError) Error() string {
-	return "Interrupted"
-}
--- a/readline/history.go
+++ b/readline/history.go
@@ -1,155 +0,0 @@
-package readline
-
-import (
-	"bufio"
-	"errors"
-	"io"
-	"os"
-	"path/filepath"
-	"strings"
-
-	"github.com/emirpasic/gods/lists/arraylist"
-)
-
-type History struct {
-	Buf      *arraylist.List
-	Autosave bool
-	Pos      int
-	Limit    int
-	Filename string
-	Enabled  bool
-}
-
-func NewHistory() (*History, error) {
-	h := &History{
-		Buf:      arraylist.New(),
-		Limit:    100, //resizeme
-		Autosave: true,
-		Enabled:  true,
-	}
-
-	err := h.Init()
-	if err != nil {
-		return nil, err
-	}
-
-	return h, nil
-}
-
-func (h *History) Init() error {
-	home, err := os.UserHomeDir()
-	if err != nil {
-		return err
-	}
-
-	path := filepath.Join(home, ".ollama", "history")
-	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
-		return err
-	}
-
-	h.Filename = path
-
-	f, err := os.OpenFile(path, os.O_CREATE|os.O_RDONLY, 0600)
-	if err != nil {
-		if errors.Is(err, os.ErrNotExist) {
-			return nil
-		}
-		return err
-	}
-	defer f.Close()
-
-	r := bufio.NewReader(f)
-	for {
-		line, err := r.ReadString('\n')
-		if err != nil {
-			if err == io.EOF {
-				break
-			}
-			return err
-		}
-
-		line = strings.TrimSpace(line)
-		if len(line) == 0 {
-			continue
-		}
-
-		h.Add([]rune(line))
-	}
-
-	return nil
-}
-
-func (h *History) Add(l []rune) {
-	h.Buf.Add(l)
-	h.Compact()
-	h.Pos = h.Size()
-	if h.Autosave {
-		h.Save()
-	}
-}
-
-func (h *History) Compact() {
-	s := h.Buf.Size()
-	if s > h.Limit {
-		for cnt := 0; cnt < s-h.Limit; cnt++ {
-			h.Buf.Remove(0)
-		}
-	}
-}
-
-func (h *History) Clear() {
-	h.Buf.Clear()
-}
-
-func (h *History) Prev() []rune {
-	var line []rune
-	if h.Pos > 0 {
-		h.Pos -= 1
-	}
-	v, _ := h.Buf.Get(h.Pos)
-	line, _ = v.([]rune)
-	return line
-}
-
-func (h *History) Next() []rune {
-	var line []rune
-	if h.Pos < h.Buf.Size() {
-		h.Pos += 1
-		v, _ := h.Buf.Get(h.Pos)
-		line, _ = v.([]rune)
-	}
-	return line
-}
-
-func (h *History) Size() int {
-	return h.Buf.Size()
-}
-
-func (h *History) Save() error {
-	if !h.Enabled {
-		return nil
-	}
-
-	tmpFile := h.Filename + ".tmp"
-
-	f, err := os.OpenFile(tmpFile, os.O_CREATE|os.O_WRONLY|os.O_TRUNC|os.O_APPEND, 0666)
-	if err != nil {
-		return err
-	}
-	defer f.Close()
-
-	buf := bufio.NewWriter(f)
-	for cnt := 0; cnt < h.Size(); cnt++ {
-		v, _ := h.Buf.Get(cnt)
-		line, _ := v.([]rune)
-		buf.WriteString(string(line) + "\n")
-	}
-	buf.Flush()
-	f.Close()
-
-	if err = os.Rename(tmpFile, h.Filename); err != nil {
-		return err
-	}
-
-	return nil
-}
--- a/readline/readline.go
+++ b/readline/readline.go
@@ -1,255 +0,0 @@
-package readline
-
-import (
-	"bufio"
-	"fmt"
-	"io"
-	"os"
-	"syscall"
-)
-
-type Prompt struct {
-	Prompt         string
-	AltPrompt      string
-	Placeholder    string
-	AltPlaceholder string
-	UseAlt         bool
-}
-
-type Terminal struct {
-	outchan chan rune
-}
-
-type Instance struct {
-	Prompt   *Prompt
-	Terminal *Terminal
-	History  *History
-	Pasting  bool
-}
-
-func New(prompt Prompt) (*Instance, error) {
-	term, err := NewTerminal()
-	if err != nil {
-		return nil, err
-	}
-
-	history, err := NewHistory()
-	if err != nil {
-		return nil, err
-	}
-
-	return &Instance{
-		Prompt:   &prompt,
-		Terminal: term,
-		History:  history,
-	}, nil
-}
-
-func (i *Instance) Readline() (string, error) {
-	prompt := i.Prompt.Prompt
-	if i.Prompt.UseAlt || i.Pasting {
-		prompt = i.Prompt.AltPrompt
-	}
-	fmt.Print(prompt)
-
-	fd := int(syscall.Stdin)
-	termios, err := SetRawMode(fd)
-	if err != nil {
-		return "", err
-	}
-	defer UnsetRawMode(fd, termios)
-
-	buf, _ := NewBuffer(i.Prompt)
-
-	var esc bool
-	var escex bool
-	var metaDel bool
-
-	var currentLineBuf []rune
-
-	for {
-		// don't show placeholder when pasting unless we're in multiline mode
-		showPlaceholder := !i.Pasting || i.Prompt.UseAlt
-		if buf.IsEmpty() && showPlaceholder {
-			ph := i.Prompt.Placeholder
-			if i.Prompt.UseAlt {
-				ph = i.Prompt.AltPlaceholder
-			}
-			fmt.Printf(ColorGrey + ph + fmt.Sprintf(CursorLeftN, len(ph)) + ColorDefault)
-		}
-
-		r, err := i.Terminal.Read()
-
-		if buf.IsEmpty() {
-			fmt.Print(ClearToEOL)
-		}
-
-		if err != nil {
-			return "", io.EOF
-		}
-
-		if escex {
-			escex = false
-
-			switch r {
-			case KeyUp:
-				if i.History.Pos > 0 {
-					if i.History.Pos == i.History.Size() {
-						currentLineBuf = []rune(buf.String())
-					}
-					buf.Replace(i.History.Prev())
-				}
-			case KeyDown:
-				if i.History.Pos < i.History.Size() {
-					buf.Replace(i.History.Next())
-					if i.History.Pos == i.History.Size() {
-						buf.Replace(currentLineBuf)
-					}
-				}
-			case KeyLeft:
-				buf.MoveLeft()
-			case KeyRight:
-				buf.MoveRight()
-			case CharBracketedPaste:
-				var code string
-				for cnt := 0; cnt < 3; cnt++ {
-					r, err = i.Terminal.Read()
-					if err != nil {
-						return "", io.EOF
-					}
-
-					code += string(r)
-				}
-				if code == CharBracketedPasteStart {
-					i.Pasting = true
-				} else if code == CharBracketedPasteEnd {
-					i.Pasting = false
-				}
-			case KeyDel:
-				if buf.Size() > 0 {
-					buf.Delete()
-				}
-				metaDel = true
-			case MetaStart:
-				buf.MoveToStart()
-			case MetaEnd:
-				buf.MoveToEnd()
-			default:
-				// skip any keys we don't know about
-				continue
-			}
-			continue
-		} else if esc {
-			esc = false
-
-			switch r {
-			case 'b':
-				buf.MoveLeftWord()
-			case 'f':
-				buf.MoveRightWord()
-			case CharBackspace:
-				buf.DeleteWord()
-			case CharEscapeEx:
-				escex = true
-			}
-			continue
-		}
-
-		switch r {
-		case CharNull:
-			continue
-		case CharEsc:
-			esc = true
-		case CharInterrupt:
-			return "", ErrInterrupt
-		case CharLineStart:
-			buf.MoveToStart()
-		case CharLineEnd:
-			buf.MoveToEnd()
-		case CharBackward:
-			buf.MoveLeft()
-		case CharForward:
-			buf.MoveRight()
-		case CharBackspace, CharCtrlH:
-			buf.Remove()
-		case CharTab:
-			// todo: convert back to real tabs
-			for cnt := 0; cnt < 8; cnt++ {
-				buf.Add(' ')
-			}
-		case CharDelete:
-			if buf.Size() > 0 {
-				buf.Delete()
-			} else {
-				return "", io.EOF
-			}
-		case CharKill:
-			buf.DeleteRemaining()
-		case CharCtrlU:
-			buf.DeleteBefore()
-		case CharCtrlL:
-			buf.ClearScreen()
-		case CharCtrlW:
-			buf.DeleteWord()
-		case CharCtrlZ:
-			return handleCharCtrlZ(fd, termios)
-		case CharEnter:
-			output := buf.String()
-			if output != "" {
-				i.History.Add([]rune(output))
-			}
-			buf.MoveToEnd()
-			fmt.Println()
-
-			return output, nil
-		default:
-			if metaDel {
-				metaDel = false
-				continue
-			}
-			if r >= CharSpace || r == CharEnter {
-				buf.Add(r)
-			}
-		}
-	}
-}
-
-func (i *Instance) HistoryEnable() {
-	i.History.Enabled = true
-}
-
-func (i *Instance) HistoryDisable() {
-	i.History.Enabled = false
-}
-
-func NewTerminal() (*Terminal, error) {
-	t := &Terminal{
-		outchan: make(chan rune),
-	}
-
-	go t.ioloop()
-
-	return t, nil
-}
-
-func (t *Terminal) ioloop() {
-	buf := bufio.NewReader(os.Stdin)
-
-	for {
-		r, _, err := buf.ReadRune()
-		if err != nil {
-			close(t.outchan)
-			break
-		}
-		t.outchan <- r
-	}
-}
-
-func (t *Terminal) Read() (rune, error) {
-	r, ok := <-t.outchan
-	if !ok {
-		return 0, io.EOF
-	}
-
-	return r, nil
-}
--- a/readline/readline_unix.go
+++ b/readline/readline_unix.go
@@ -1,18 +0,0 @@
-//go:build !windows
-
-package readline
-
-import (
-	"syscall"
-)
-
-func handleCharCtrlZ(fd int, termios *Termios) (string, error) {
-	if err := UnsetRawMode(fd, termios); err != nil {
-		return "", err
-	}
-
-	syscall.Kill(0, syscall.SIGSTOP)
-
-	// on resume...
-	return "", nil
-}
--- a/readline/readline_windows.go
+++ b/readline/readline_windows.go
@@ -1,6 +0,0 @@
-package readline
-
-func handleCharCtrlZ(fd int, state *State) (string, error) {
-	// not supported
-	return "", nil
-}
--- a/readline/term.go
+++ b/readline/term.go
@@ -1,36 +0,0 @@
-//go:build aix || darwin || dragonfly || freebsd || (linux && !appengine) || netbsd || openbsd || os400 || solaris
-
-package readline
-
-import (
-	"syscall"
-)
-
-type Termios syscall.Termios
-
-func SetRawMode(fd int) (*Termios, error) {
-	termios, err := getTermios(fd)
-	if err != nil {
-		return nil, err
-	}
-
-	newTermios := *termios
-	newTermios.Iflag &^= syscall.IGNBRK | syscall.BRKINT | syscall.PARMRK | syscall.ISTRIP | syscall.INLCR | syscall.IGNCR | syscall.ICRNL | syscall.IXON
-	newTermios.Lflag &^= syscall.ECHO | syscall.ECHONL | syscall.ICANON | syscall.ISIG | syscall.IEXTEN
-	newTermios.Cflag &^= syscall.CSIZE | syscall.PARENB
-	newTermios.Cflag |= syscall.CS8
-	newTermios.Cc[syscall.VMIN] = 1
-	newTermios.Cc[syscall.VTIME] = 0
-
-	return termios, setTermios(fd, &newTermios)
-}
-
-func UnsetRawMode(fd int, termios *Termios) error {
-	return setTermios(fd, termios)
-}
-
-// IsTerminal returns true if the given file descriptor is a terminal.
-func IsTerminal(fd int) bool {
-	_, err := getTermios(fd)
-	return err == nil
-}
--- a/readline/term_bsd.go
+++ b/readline/term_bsd.go
@@ -1,25 +0,0 @@
-//go:build darwin || freebsd || netbsd || openbsd
-
-package readline
-
-import (
-	"syscall"
-	"unsafe"
-)
-
-func getTermios(fd int) (*Termios, error) {
-	termios := new(Termios)
-	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), syscall.TIOCGETA, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
-	if err != 0 {
-		return nil, err
-	}
-	return termios, nil
-}
-
-func setTermios(fd int, termios *Termios) error {
-	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), syscall.TIOCSETA, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
-	if err != 0 {
-		return err
-	}
-	return nil
-}
--- a/readline/term_linux.go
+++ b/readline/term_linux.go
@@ -1,28 +0,0 @@
-//go:build linux || solaris
-
-package readline
-
-import (
-	"syscall"
-	"unsafe"
-)
-
-const tcgets = 0x5401
-const tcsets = 0x5402
-
-func getTermios(fd int) (*Termios, error) {
-	termios := new(Termios)
-	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), tcgets, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
-	if err != 0 {
-		return nil, err
-	}
-	return termios, nil
-}
-
-func setTermios(fd int, termios *Termios) error {
-	_, _, err := syscall.Syscall6(syscall.SYS_IOCTL, uintptr(fd), tcsets, uintptr(unsafe.Pointer(termios)), 0, 0, 0)
-	if err != 0 {
-		return err
-	}
-	return nil
-}
--- a/readline/term_windows.go
+++ b/readline/term_windows.go
@@ -1,62 +0,0 @@
-package readline
-
-import (
-	"syscall"
-	"unsafe"
-)
-
-const (
-	enableLineInput       = 2
-	enableWindowInput     = 8
-	enableMouseInput      = 16
-	enableInsertMode      = 32
-	enableQuickEditMode   = 64
-	enableExtendedFlags   = 128
-	enableProcessedOutput = 1
-	enableWrapAtEolOutput = 2
-	enableAutoPosition    = 256 // Cursor position is not affected by writing data to the console.
-	enableEchoInput       = 4   // Characters are written to the console as they're read.
-	enableProcessedInput  = 1   // Enables input processing (like recognizing Ctrl+C).
-)
-
-var kernel32 = syscall.NewLazyDLL("kernel32.dll")
-
-var (
-	procGetConsoleMode = kernel32.NewProc("GetConsoleMode")
-	procSetConsoleMode = kernel32.NewProc("SetConsoleMode")
-)
-
-type State struct {
-	mode uint32
-}
-
-// IsTerminal checks if the given file descriptor is associated with a terminal
-func IsTerminal(fd int) bool {
-	var st uint32
-	r, _, e := syscall.SyscallN(procGetConsoleMode.Addr(), uintptr(fd), uintptr(unsafe.Pointer(&st)), 0)
-	// if the call succeeds and doesn't produce an error, it's a terminal
-	return r != 0 && e == 0
-}
-
-func SetRawMode(fd int) (*State, error) {
-	var st uint32
-	// retrieve the current mode of the terminal
-	_, _, e := syscall.SyscallN(procGetConsoleMode.Addr(), uintptr(fd), uintptr(unsafe.Pointer(&st)), 0)
-	if e != 0 {
-		return nil, error(e)
-	}
-	// modify the mode to set it to raw
-	raw := st &^ (enableEchoInput | enableProcessedInput | enableLineInput | enableProcessedOutput)
-	// apply the new mode to the terminal
-	_, _, e = syscall.SyscallN(procSetConsoleMode.Addr(), uintptr(fd), uintptr(raw), 0)
-	if e != 0 {
-		return nil, error(e)
-	}
-	// return the original state so that it can be restored later
-	return &State{st}, nil
-}
-
-func UnsetRawMode(fd int, state *State) error {
-	_, _, err := syscall.SyscallN(procSetConsoleMode.Addr(), uintptr(fd), uintptr(state.mode), 0)
-	return err
-}
--- a/readline/types.go
+++ b/readline/types.go
@@ -1,78 +0,0 @@
-package readline
-
-const (
-	CharNull      = 0
-	CharLineStart = 1
-	CharBackward  = 2
-	CharInterrupt = 3
-	CharDelete    = 4
-	CharLineEnd   = 5
-	CharForward   = 6
-	CharBell      = 7
-	CharCtrlH     = 8
-	CharTab       = 9
-	CharCtrlJ     = 10
-	CharKill      = 11
-	CharCtrlL     = 12
-	CharEnter     = 13
-	CharNext      = 14
-	CharPrev      = 16
-	CharBckSearch = 18
-	CharFwdSearch = 19
-	CharTranspose = 20
-	CharCtrlU     = 21
-	CharCtrlW     = 23
-	CharCtrlY     = 25
-	CharCtrlZ     = 26
-	CharEsc       = 27
-	CharSpace     = 32
-	CharEscapeEx  = 91
-	CharBackspace = 127
-)
-
-const (
-	KeyDel    = 51
-	KeyUp     = 65
-	KeyDown   = 66
-	KeyRight  = 67
-	KeyLeft   = 68
-	MetaEnd   = 70
-	MetaStart = 72
-)
-
-const (
-	CursorUp    = "\033[1A"
-	CursorDown  = "\033[1B"
-	CursorRight = "\033[1C"
-	CursorLeft  = "\033[1D"
-
-	CursorSave    = "\033[s"
-	CursorRestore = "\033[u"
-
-	CursorUpN    = "\033[%dA"
-	CursorDownN  = "\033[%dB"
-	CursorRightN = "\033[%dC"
-	CursorLeftN  = "\033[%dD"
-
-	CursorEOL  = "\033[E"
-	CursorBOL  = "\033[1G"
-	CursorHide = "\033[?25l"
-	CursorShow = "\033[?25h"
-
-	ClearToEOL  = "\033[K"
-	ClearLine   = "\033[2K"
-	ClearScreen = "\033[2J"
-	CursorReset = "\033[0;0f"
-
-	ColorGrey    = "\033[38;5;245m"
-	ColorDefault = "\033[0m"
-
-	StartBracketedPaste = "\033[?2004h"
-	EndBracketedPaste   = "\033[?2004l"
-)
-
-const (
-	CharBracketedPaste      = 50
-	CharBracketedPasteStart = "00~"
-	CharBracketedPasteEnd   = "01~"
-)
--- a/runner/.gitignore
+++ b/runner/.gitignore
@@ -0,0 +1,2 @@
+model.bin
+runner
--- a/runner/darwin.go
+++ b/runner/darwin.go
@@ -0,0 +1,39 @@
+package main
+
+import (
+	"embed"
+	"io"
+	"os"
+	"path/filepath"
+)
+
+//go:embed ggml-metal.metal
+var fs embed.FS
+
+func init() {
+	exec, err := os.Executable()
+	if err != nil {
+		return
+	}
+
+	exec, err = filepath.EvalSymlinks(exec)
+	if err != nil {
+		return
+	}
+
+	dst, err := os.Create(filepath.Join(filepath.Dir(exec), "ggml-metal.metal"))
+	if err != nil {
+		return
+	}
+	defer dst.Close()
+
+	src, err := fs.Open("ggml-metal.metal")
+	if err != nil {
+		return
+	}
+	defer src.Close()
+
+	if _, err := io.Copy(dst, src); err != nil {
+		return
+	}
+}
--- a/runner/ggml-alloc.c
+++ b/runner/ggml-alloc.c
@@ -0,0 +1,620 @@
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml.h"
+#include <assert.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+
+#define UNUSED(x) (void)(x)
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define GGML_MAX_CONCUR (2*GGML_MAX_NODES)
+
+//#define GGML_ALLOCATOR_DEBUG
+
+//#define AT_PRINTF printf
+#define AT_PRINTF(...) ((void)0)
+
+struct hash_node {
+    struct ggml_tensor * t;
+    int n_children;
+    int n_views;
+};
+
+static size_t hash(void * p) {
+    return (size_t)p % GGML_GRAPH_HASHTABLE_SIZE;
+}
+
+static struct hash_node * hash_get(struct hash_node hash_table[], struct ggml_tensor * t) {
+    size_t h = hash(t);
+
+    // linear probing
+    size_t i = h;
+    while (hash_table[i].t != NULL) {
+        if (hash_table[i].t == t) {
+            return &hash_table[i];
+        }
+        i = (i + 1) % GGML_GRAPH_HASHTABLE_SIZE;
+        if (i == h) {
+            // hash table is full
+            GGML_ASSERT(false);
+        }
+    }
+
+    hash_table[i].t = t;
+    return &hash_table[i];
+}
+
+// TODO: GGML_PAD ?
+static size_t aligned_offset(const void * buffer, size_t offset, size_t alignment) {
+    assert(alignment && !(alignment & (alignment - 1))); // power of 2
+    size_t align = (alignment - (((uintptr_t)buffer + offset) % alignment)) % alignment;
+    return offset + align;
+}
+
+struct free_block {
+    void * addr;
+    size_t size;
+};
+
+#define MAX_FREE_BLOCKS 256
+
+struct ggml_allocr {
+    struct ggml_backend_buffer * buffer;
+    bool buffer_owned;
+    void * data;
+    size_t alignment;
+    int n_free_blocks;
+    struct free_block free_blocks[MAX_FREE_BLOCKS];
+    struct hash_node hash_table[GGML_GRAPH_HASHTABLE_SIZE];
+    size_t max_size;
+    bool measure;
+    int parse_seq[GGML_MAX_CONCUR];
+    int parse_seq_len;
+
+#ifdef GGML_ALLOCATOR_DEBUG
+    struct ggml_tensor * allocated_tensors[1024];
+#endif
+};
+
+#ifdef GGML_ALLOCATOR_DEBUG
+static void add_allocated_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
+    for (int i = 0; i < 1024; i++) {
+        if (alloc->allocated_tensors[i] == NULL) {
+            alloc->allocated_tensors[i] = tensor;
+            return;
+        }
+    }
+    GGML_ASSERT(!"out of allocated_tensors");
+}
+static void remove_allocated_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
+    for (int i = 0; i < 1024; i++) {
+        if (alloc->allocated_tensors[i] == tensor ||
+            (alloc->allocated_tensors[i] != NULL && alloc->allocated_tensors[i]->data == tensor->data)) {
+            alloc->allocated_tensors[i] = NULL;
+            return;
+        }
+    }
+    printf("tried to free tensor %s not found\n", tensor->name);
+    GGML_ASSERT(!"tensor not found");
+}
+#endif
+
+// check if a tensor is allocated by this buffer
+static bool ggml_allocr_is_own(struct ggml_allocr * alloc, const struct ggml_tensor * tensor) {
+    return tensor->buffer == alloc->buffer;
+}
+
+static bool ggml_is_view(struct ggml_tensor * t) {
+    return t->view_src != NULL;
+}
+
+void ggml_allocr_alloc(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
+    GGML_ASSERT(!ggml_is_view(tensor)); // views generally get data pointer from one of their sources
+    GGML_ASSERT(tensor->data == NULL); // avoid allocating tensor which already has memory allocated
+
+    size_t size = ggml_backend_buffer_get_alloc_size(alloc->buffer, tensor);
+    size = aligned_offset(NULL, size, alloc->alignment);
+
+    AT_PRINTF("%s: allocating %s (%zu bytes) - ", __func__, tensor->name, size);
+
+    size_t max_avail = 0;
+
+    // find the best fitting free block besides the last block
+    int best_fit_block = -1;
+    size_t best_fit_size = SIZE_MAX;
+    for (int i = 0; i < alloc->n_free_blocks - 1; i++) {
+        struct free_block * block = &alloc->free_blocks[i];
+        max_avail = MAX(max_avail, block->size);
+        if (block->size >= size && block->size <= best_fit_size) {
+            best_fit_block = i;
+            best_fit_size = block->size;
+        }
+    }
+
+    AT_PRINTF("block %d\n", best_fit_block);
+
+    if (best_fit_block == -1) {
+        // the last block is our last resort
+        struct free_block * block = &alloc->free_blocks[alloc->n_free_blocks - 1];
+        max_avail = MAX(max_avail, block->size);
+        if (block->size >= size) {
+            best_fit_block = alloc->n_free_blocks - 1;
+        } else {
+            fprintf(stderr, "%s: not enough space in the buffer (needed %zu, largest block available %zu)\n",
+                    __func__, size, max_avail);
+            GGML_ASSERT(!"not enough space in the buffer");
+            return;
+        }
+    }
+    struct free_block * block = &alloc->free_blocks[best_fit_block];
+    void * addr = block->addr;
+    block->addr = (char*)block->addr + size;
+    block->size -= size;
+    if (block->size == 0) {
+        // remove block if empty
+        alloc->n_free_blocks--;
+        for (int j = best_fit_block; j < alloc->n_free_blocks; j++) {
+            alloc->free_blocks[j] = alloc->free_blocks[j+1];
+        }
+    }
+
+    tensor->data = addr;
+    AT_PRINTF("%s: allocated data at %p\n", __func__, tensor->data);
+    tensor->buffer = alloc->buffer;
+    ggml_backend_buffer_init_tensor(alloc->buffer, tensor);
+
+#ifdef GGML_ALLOCATOR_DEBUG
+    add_allocated_tensor(alloc, tensor);
+    size_t cur_max = (char*)addr - (char*)alloc->data + size;
+    if (cur_max > alloc->max_size) {
+        printf("max_size = %.2f MB: tensors: ", cur_max / 1024.0 / 1024.0);
+        for (int i = 0; i < 1024; i++) {
+            if (alloc->allocated_tensors[i]) {
+                printf("%s (%.2f MB) ", alloc->allocated_tensors[i]->name, ggml_nbytes(alloc->allocated_tensors[i]) / 1024.0 / 1024.0);
+            }
+        }
+        printf("\n");
+    }
+#endif
+
+    alloc->max_size = MAX(alloc->max_size, (char*)addr - (char*)alloc->data + size);
+}
+
+// this is a very naive implementation, but for our case the number of free blocks should be very small
+static void ggml_allocr_free_tensor(struct ggml_allocr * alloc, struct ggml_tensor * tensor) {
+    if (ggml_allocr_is_own(alloc, tensor) == false) {
+        // the tensor was not allocated in this buffer
+        // this can happen because the graph allocator will try to free weights and other tensors from different buffers
+        // the easiest way to deal with this is just to ignore it
+        AT_PRINTF("ignoring %s (their buffer: %p, our buffer: %p)\n", tensor->name, (void *)tensor->buffer, (void *)alloc->buffer);
+        return;
+    }
+
+    void * ptr = tensor->data;
+
+    size_t size = ggml_backend_buffer_get_alloc_size(alloc->buffer, tensor);
+    size = aligned_offset(NULL, size, alloc->alignment);
+    AT_PRINTF("%s: freeing %s at %p (%zu bytes) - n_free_blocks = %d\n", __func__, tensor->name, ptr, size, alloc->n_free_blocks);
+
+    ggml_backend_buffer_free_tensor(alloc->buffer, tensor);
+
+#ifdef GGML_ALLOCATOR_DEBUG
+    remove_allocated_tensor(alloc, tensor);
+#endif
+
+    // see if we can merge with an existing block
+    for (int i = 0; i < alloc->n_free_blocks; i++) {
+        struct free_block * block = &alloc->free_blocks[i];
+        // check if ptr is at the end of the block
+        if ((char*)block->addr + block->size == ptr) {
+            block->size += size;
+            // check if we can merge with the next block
+            if (i < alloc->n_free_blocks - 1 && (char*)block->addr + block->size == alloc->free_blocks[i+1].addr) {
+                block->size += alloc->free_blocks[i+1].size;
+                alloc->n_free_blocks--;
+                for (int j = i+1; j < alloc->n_free_blocks; j++) {
+                    alloc->free_blocks[j] = alloc->free_blocks[j+1];
+                }
+            }
+            return;
+        }
+        // check if ptr is at the beginning of the block
+        if ((char*)ptr + size == block->addr) {
+            block->addr = ptr;
+            block->size += size;
+            // check if we can merge with the previous block
+            if (i > 0 && (char*)alloc->free_blocks[i-1].addr + alloc->free_blocks[i-1].size == block->addr) {
+                alloc->free_blocks[i-1].size += block->size;
+                alloc->n_free_blocks--;
+                for (int j = i; j < alloc->n_free_blocks; j++) {
+                    alloc->free_blocks[j] = alloc->free_blocks[j+1];
+                }
+            }
+            return;
+        }
+    }
+    // otherwise, add a new block
+    GGML_ASSERT(alloc->n_free_blocks < MAX_FREE_BLOCKS && "out of free blocks");
+    // insert the new block in the correct position to keep the array sorted by address (to make merging blocks faster)
+    int insert_pos = 0;
+    while (insert_pos < alloc->n_free_blocks && alloc->free_blocks[insert_pos].addr < ptr) {
+        insert_pos++;
+    }
+    // shift all blocks from insert_pos onward to make room for the new block
+    for (int i = alloc->n_free_blocks; i > insert_pos; i--) {
+        alloc->free_blocks[i] = alloc->free_blocks[i-1];
+    }
+    // insert the new block
+    alloc->free_blocks[insert_pos].addr = ptr;
+    alloc->free_blocks[insert_pos].size = size;
+    alloc->n_free_blocks++;
+}
+
+void ggml_allocr_set_parse_seq(struct ggml_allocr * alloc, const int * list, int n) {
+    for (int i = 0; i < n; i++) {
+        alloc->parse_seq[i] = list[i];
+    }
+    alloc->parse_seq_len = n;
+}
+
+void ggml_allocr_reset(struct ggml_allocr * alloc) {
+    alloc->n_free_blocks = 1;
+    size_t align_offset = aligned_offset(alloc->data, 0, alloc->alignment);
+    alloc->free_blocks[0].addr = (char *)alloc->data + align_offset;
+    alloc->free_blocks[0].size = ggml_backend_buffer_get_size(alloc->buffer) - align_offset;
+}
+
+struct ggml_allocr * ggml_allocr_new(void * data, size_t size, size_t alignment) {
+    struct ggml_backend_buffer * buffer = ggml_backend_cpu_buffer_from_ptr(NULL, data, size);
+
+    struct ggml_allocr * alloc = (struct ggml_allocr *)malloc(sizeof(struct ggml_allocr));
+
+    *alloc = (struct ggml_allocr){
+        /*.buffer        = */ buffer,
+        /*.buffer_owned  = */ true,
+        /*.base          = */ ggml_backend_buffer_get_base(buffer),
+        /*.alignment     = */ alignment,
+        /*.n_free_blocks = */ 0,
+        /*.free_blocks   = */ {{0}},
+        /*.hash_table    = */ {{0}},
+        /*.max_size      = */ 0,
+        /*.measure       = */ false,
+        /*.parse_seq     = */ {0},
+        /*.parse_seq_len = */ 0,
+#ifdef GGML_ALLOCATOR_DEBUG
+        /*.allocated_tensors = */ {0},
+#endif
+    };
+
+    ggml_allocr_reset(alloc);
+
+    return alloc;
+}
+
+struct ggml_allocr * ggml_allocr_new_measure(size_t alignment) {
+    struct ggml_allocr * alloc = ggml_allocr_new((void *)0x1000, (size_t)-0x1001, alignment);
+    alloc->measure = true;
+
+    return alloc;
+}
+
+struct ggml_allocr * ggml_allocr_new_from_buffer(struct ggml_backend_buffer * buffer) {
+    struct ggml_allocr * alloc = (struct ggml_allocr *)malloc(sizeof(struct ggml_allocr));
+
+    *alloc = (struct ggml_allocr){
+        /*.buffer        = */ buffer,
+        /*.buffer_owned  = */ false,
+        /*.base          = */ ggml_backend_buffer_get_base(buffer),
+        /*.alignment     = */ ggml_backend_buffer_get_alignment(buffer),
+        /*.n_free_blocks = */ 0,
+        /*.free_blocks   = */ {{0}},
+        /*.hash_table    = */ {{0}},
+        /*.max_size      = */ 0,
+        /*.measure       = */ false,
+        /*.parse_seq     = */ {0},
+        /*.parse_seq_len = */ 0,
+#ifdef GGML_ALLOCATOR_DEBUG
+        /*.allocated_tensors = */ {0},
+#endif
+    };
+
+    ggml_allocr_reset(alloc);
+
+    return alloc;
+}
+
+void ggml_allocr_free(struct ggml_allocr * alloc) {
+    if (alloc->buffer_owned) {
+        ggml_backend_buffer_free(alloc->buffer);
+    }
+    free(alloc);
+}
+
+bool ggml_allocr_is_measure(struct ggml_allocr * alloc) {
+    return alloc->measure;
+}
+
+//////////// compute graph allocator
+
+static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
+    if (a->type != b->type) {
+        return false;
+    }
+    for (int i = 0; i < GGML_MAX_DIMS; i++) {
+        if (a->ne[i] != b->ne[i]) {
+            return false;
+        }
+        if (a->nb[i] != b->nb[i]) {
+            return false;
+        }
+    }
+    return true;
+}
+
+static bool ggml_op_can_inplace(enum ggml_op op) {
+    switch (op) {
+        case GGML_OP_SCALE:
+        case GGML_OP_DIAG_MASK_ZERO:
+        case GGML_OP_DIAG_MASK_INF:
+        case GGML_OP_ADD:
+        case GGML_OP_ADD1:
+        case GGML_OP_SUB:
+        case GGML_OP_MUL:
+        case GGML_OP_DIV:
+        case GGML_OP_SQR:
+        case GGML_OP_SQRT:
+        case GGML_OP_LOG:
+        case GGML_OP_UNARY:
+        case GGML_OP_ROPE:
+        case GGML_OP_RMS_NORM:
+        case GGML_OP_SOFT_MAX:
+            return true;
+
+        default:
+            return false;
+    }
+}
+
+static void init_view(struct ggml_allocr * alloc, struct ggml_tensor * view) {
+    assert(view->view_src != NULL && view->view_src->data != NULL);
+    view->backend = view->view_src->backend;
+    view->buffer  = view->view_src->buffer;
+    view->data    = (char *)view->view_src->data + view->view_offs;
+
+    // FIXME: the view should be initialized by the owning buffer, but currently this breaks the CUDA backend
+    // due to the ggml_tensor_extra_gpu ring buffer overwriting the KV cache extras
+    assert(ggml_allocr_is_measure(alloc) || !view->buffer || view->buffer->backend == alloc->buffer->backend);
+    ggml_backend_buffer_init_tensor(alloc->buffer, view);
+}
+
+static void allocate_node(struct ggml_allocr * alloc, struct ggml_tensor * node) {
+    struct hash_node * ht = alloc->hash_table;
+    if (node->data == NULL) {
+        if (ggml_is_view(node)) {
+            init_view(alloc, node);
+        } else {
+            // see if we can reuse a parent's buffer (inplace)
+            if (ggml_op_can_inplace(node->op)) {
+                for (int i = 0; i < GGML_MAX_SRC; i++) {
+                    struct ggml_tensor * parent = node->src[i];
+                    if (parent == NULL) {
+                        break;
+                    }
+
+                    // if the node's data is external, then we cannot re-use it
+                    if (ggml_allocr_is_own(alloc, parent) == false) {
+                        AT_PRINTF("not reusing parent %s for %s as %p is external\n", parent->name, node->name, parent->data);
+                        continue;
+                    }
+
+                    struct hash_node * p_hn = hash_get(ht, parent);
+                    if (parent->data != NULL && p_hn->n_children == 1 && p_hn->n_views == 0 && ggml_are_same_layout(node, parent)) {
+                        if (ggml_is_view(parent)) {
+                            struct ggml_tensor * view_src = parent->view_src;
+                            struct hash_node * view_src_hn = hash_get(ht, view_src);
+                            if (view_src_hn->n_views == 1 && view_src_hn->n_children == 0 && view_src->data == parent->data) {
+                                // TODO: the offset of the view parent must be kept to ensure that the op doesn't overwrite
+                                // the parent's data that it will need later (same layout requirement). the problem is that then
+                                // we cannot free the tensor because the original address of the allocation is lost.
+                                // adding a view_src pointer to the tensor would solve this and simplify the code dealing with views
+                                // for now, we only reuse the parent's data if the offset is zero (view_src->data == parent->data)
+                                AT_PRINTF("reusing view parent %s (%s) for %s\n", parent->name, view_src->name, node->name);
+                                node->view_src = view_src;
+                                view_src_hn->n_views += 1;
+                                init_view(alloc, node);
+                                return;
+                            }
+                        }
+                        else {
+                            AT_PRINTF("reusing parent %s for %s\n", parent->name, node->name);
+                            node->view_src = parent;
+                            p_hn->n_views += 1;
+                            init_view(alloc, node);
+                            return;
+                        }
+                    }
+                }
+            }
+            ggml_allocr_alloc(alloc, node);
+        }
+    }
+}
+
+size_t ggml_allocr_alloc_graph_n(
+    struct ggml_allocr * alloc,
+    struct ggml_cgraph ** graphs, int n_graphs,
+    struct ggml_tensor *** inputs, struct ggml_tensor *** outputs) {
+
+    // reset hash table
+    struct hash_node * ht = alloc->hash_table;
+    memset(ht, 0, sizeof(struct hash_node) * GGML_GRAPH_HASHTABLE_SIZE);
+
+    // count number of children and views
+    for (int g = 0; g < n_graphs; g++) {
+        struct ggml_cgraph * gf = graphs[g];
+        for (int i = 0; i < gf->n_nodes; i++) {
+            struct ggml_tensor * node = gf->nodes[i];
+
+            if (ggml_is_view(node)) {
+                struct ggml_tensor * view_src = node->view_src;
+                hash_get(ht, view_src)->n_views += 1;
+                if (node->buffer == NULL && node->data != NULL) {
+                    // view of a pre-allocated tensor, didn't call init_view() yet
+                    init_view(alloc, node);
+                }
+            }
+
+            for (int j = 0; j < GGML_MAX_SRC; j++) {
+                struct ggml_tensor * parent = node->src[j];
+                if (parent == NULL) {
+                    break;
+                }
+                hash_get(ht, parent)->n_children += 1;
+                if (ggml_is_view(parent) && parent->buffer == NULL && parent->data != NULL) {
+                    init_view(alloc, parent);
+                }
+            }
+        }
+    }
+
+    // allocate tensors
+    for (int g = 0; g < n_graphs; g++) {
+        struct ggml_cgraph * gf = graphs[g];
+        AT_PRINTF("####### graph %d/%d\n", g, n_graphs);
+        // graph inputs are allocated first to ensure that they are not overwritten by each other
+        if (inputs != NULL && inputs[g] != NULL) {
+            for (int i = 0; inputs[g][i] != NULL; i++) {
+                struct ggml_tensor * input = inputs[g][i];
+                AT_PRINTF("input: %s\n", input->name);
+                allocate_node(alloc, input);
+            }
+        }
+        // if we have parse_seq then we allocate nodes following the list, and we only free nodes at barriers
+        int last_barrier_pos = 0;
+        int n_nodes = alloc->parse_seq_len ? alloc->parse_seq_len : gf->n_nodes;
+
+        for (int ind = 0; ind < n_nodes; ind++) {
+            // allocate a node if there is no parse_seq or this is not a barrier
+            if ((alloc->parse_seq_len==0) || alloc->parse_seq[ind] != -1) {
+                int i = alloc->parse_seq_len ? alloc->parse_seq[ind] : ind;
+                struct ggml_tensor * node = gf->nodes[i];
+
+                // allocate parents (leafs)
+                for (int j = 0; j < GGML_MAX_SRC; j++) {
+                    struct ggml_tensor * parent = node->src[j];
+                    if (parent == NULL) {
+                        break;
+                    }
+                    allocate_node(alloc, parent);
+                }
+
+                // allocate node
+                allocate_node(alloc, node);
+
+                AT_PRINTF("exec: %s (%s) <= ", ggml_op_name(node->op), node->name);
+                for (int j = 0; j < GGML_MAX_SRC; j++) {
+                    struct ggml_tensor * parent = node->src[j];
+                    if (parent == NULL) {
+                        break;
+                    }
+                    AT_PRINTF("%s", parent->name);
+                    if (j < GGML_MAX_SRC - 1 && node->src[j + 1] != NULL) {
+                        AT_PRINTF(", ");
+                    }
+                }
+                AT_PRINTF("\n");
+            }
+
+            // update parents
+            // update immediately if there is no parse_seq
+            // update only at barriers if there is parse_seq
+            if ((alloc->parse_seq_len == 0) || alloc->parse_seq[ind] == -1) {
+                int update_start = alloc->parse_seq_len ? last_barrier_pos : ind;
+                int update_end   = alloc->parse_seq_len ? ind              : ind + 1;
+                for (int i = update_start; i < update_end; i++) {
+                    int node_i = alloc->parse_seq_len ? alloc->parse_seq[i] : i;
+                    struct ggml_tensor * node = gf->nodes[node_i];
+
+                    for (int j = 0; j < GGML_MAX_SRC; j++) {
+                        struct ggml_tensor * parent = node->src[j];
+                        if (parent == NULL) {
+                            break;
+                        }
+                        struct hash_node * p_hn = hash_get(ht, parent);
+                        p_hn->n_children -= 1;
+
+                        //AT_PRINTF("parent %s: %d children, %d views\n", parent->name, parent->n_children, parent->n_views);
+
+                        if (p_hn->n_children == 0 && p_hn->n_views == 0) {
+                            if (ggml_is_view(parent)) {
+                                struct ggml_tensor * view_src = parent->view_src;
+                                struct hash_node * view_src_hn = hash_get(ht, view_src);
+                                view_src_hn->n_views -= 1;
+                                AT_PRINTF("view_src %s: %d children, %d views\n", view_src->name, view_src_hn->n_children, view_src_hn->n_views);
+                                if (view_src_hn->n_views == 0 && view_src_hn->n_children == 0 && view_src->data != node->data) {
+                                    ggml_allocr_free_tensor(alloc, view_src);
+                                }
+                            }
+                            else {
+                                if (parent->data != node->data) {
+                                    ggml_allocr_free_tensor(alloc, parent);
+                                }
+                            }
+                        }
+                    }
+                }
+                AT_PRINTF("\n");
+                if (alloc->parse_seq_len) {
+                    last_barrier_pos = ind + 1;
+                }
+            }
+        }
+        // free graph outputs here that wouldn't be freed otherwise because they have no children
+        if (outputs != NULL && outputs[g] != NULL) {
+            for (int i = 0; outputs[g][i] != NULL; i++) {
+                struct ggml_tensor * output = outputs[g][i];
+                AT_PRINTF("output: %s\n", output->name);
+                ggml_allocr_free_tensor(alloc, output);
+            }
+        }
+    }
+
+    return alloc->max_size;
+}
+
+size_t ggml_allocr_alloc_graph(struct ggml_allocr * alloc, struct ggml_cgraph * graph) {
+    return ggml_allocr_alloc_graph_n(alloc, &graph, 1, NULL, NULL);
+}
+
+size_t ggml_allocr_max_size(struct ggml_allocr * alloc) {
+    return alloc->max_size;
+}
--- a/runner/ggml-alloc.h
+++ b/runner/ggml-alloc.h
@@ -0,0 +1,59 @@
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#pragma once
+
+#include "ggml.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+struct ggml_backend_buffer;
+
+GGML_API struct ggml_allocr * ggml_allocr_new(void * data, size_t size, size_t alignment);
+GGML_API struct ggml_allocr * ggml_allocr_new_measure(size_t alignment);
+GGML_API struct ggml_allocr * ggml_allocr_new_from_buffer(struct ggml_backend_buffer * buffer);
+
+// tell the allocator to parse nodes following the order described in the list
+// you should call this if your graph are optimized to execute out-of-order
+GGML_API void   ggml_allocr_set_parse_seq(struct ggml_allocr * alloc, const int * list, int n);
+
+GGML_API void   ggml_allocr_free       (struct ggml_allocr * alloc);
+GGML_API bool   ggml_allocr_is_measure (struct ggml_allocr * alloc);
+GGML_API void   ggml_allocr_reset      (struct ggml_allocr * alloc);
+GGML_API void   ggml_allocr_alloc      (struct ggml_allocr * alloc, struct ggml_tensor * tensor);
+GGML_API size_t ggml_allocr_alloc_graph(struct ggml_allocr * alloc, struct ggml_cgraph * graph);
+GGML_API size_t ggml_allocr_max_size   (struct ggml_allocr * alloc);
+
+GGML_API size_t ggml_allocr_alloc_graph_n(
+                    struct ggml_allocr * alloc,
+                    struct ggml_cgraph ** graphs, int n_graphs,
+                    struct ggml_tensor *** inputs, struct ggml_tensor *** outputs);
+
+#ifdef  __cplusplus
+}
+#endif
--- a/runner/ggml-backend.c
+++ b/runner/ggml-backend.c
@@ -0,0 +1,411 @@
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ggml-backend.h"
+#include "ggml-alloc.h"
+
+#include <assert.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#define UNUSED GGML_UNUSED
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+// backend buffer
+
+ggml_backend_buffer_t ggml_backend_buffer_init(
+        struct ggml_backend                  * backend,
+        struct ggml_backend_buffer_i           iface,
+               ggml_backend_buffer_context_t   context,
+               size_t                          size) {
+    ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
+
+    GGML_ASSERT(iface.get_base != NULL);
+
+    (*buffer) = (struct ggml_backend_buffer) {
+        /* .interface = */ iface,
+        /* .backend   = */ backend,
+        /* .context   = */ context,
+        /* .size      = */ size,
+    };
+
+    return buffer;
+}
+
+void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
+    if (buffer->iface.free_buffer != NULL) {
+        buffer->iface.free_buffer(buffer);
+    }
+    free(buffer);
+}
+
+size_t ggml_backend_buffer_get_alignment(ggml_backend_buffer_t buffer) {
+    return ggml_backend_get_alignment(buffer->backend);
+}
+
+void * ggml_backend_buffer_get_base(ggml_backend_buffer_t buffer) {
+    return buffer->iface.get_base(buffer);
+}
+
+size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
+    return buffer->size;
+}
+
+size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
+    if (buffer->iface.get_alloc_size) {
+        return buffer->iface.get_alloc_size(buffer, tensor);
+    }
+    return ggml_nbytes(tensor);
+}
+
+void ggml_backend_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
+    if (buffer->iface.init_tensor) {
+        buffer->iface.init_tensor(buffer, tensor);
+    }
+}
+
+void ggml_backend_buffer_free_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
+    if (buffer->iface.free_tensor) {
+        buffer->iface.free_tensor(buffer, tensor);
+    }
+}
+
+// backend
+
+ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor) {
+    return tensor->buffer->backend;
+}
+
+const char * ggml_backend_name(ggml_backend_t backend) {
+    return backend->iface.get_name(backend);
+}
+
+void ggml_backend_free(ggml_backend_t backend) {
+    backend->iface.free(backend);
+}
+
+ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size) {
+    return backend->iface.alloc_buffer(backend, size);
+}
+
+size_t ggml_backend_get_alignment(ggml_backend_t backend) {
+    return backend->iface.get_alignment(backend);
+}
+
+void ggml_backend_tensor_set_async(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
+    ggml_get_backend(tensor)->iface.set_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
+}
+
+void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
+    ggml_get_backend(tensor)->iface.get_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
+}
+
+void ggml_backend_tensor_set(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
+    ggml_get_backend(tensor)->iface.set_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
+    ggml_get_backend(tensor)->iface.synchronize(ggml_get_backend(tensor));
+}
+
+void ggml_backend_tensor_get(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
+    ggml_get_backend(tensor)->iface.get_tensor_async(ggml_get_backend(tensor), tensor, data, offset, size);
+    ggml_get_backend(tensor)->iface.synchronize(ggml_get_backend(tensor));
+}
+
+void ggml_backend_synchronize(ggml_backend_t backend) {
+    backend->iface.synchronize(backend);
+}
+
+ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
+    return backend->iface.graph_plan_create(backend, cgraph);
+}
+
+void ggml_backend_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
+    backend->iface.graph_plan_free(backend, plan);
+}
+
+void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
+    backend->iface.graph_plan_compute(backend, plan);
+}
+
+void ggml_backend_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
+    backend->iface.graph_compute(backend, cgraph);
+}
+
+bool ggml_backend_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+    return backend->iface.supports_op(backend, op);
+}
+
+// backend copy
+
+static bool ggml_are_same_layout(const struct ggml_tensor * a, const struct ggml_tensor * b) {
+    if (a->type != b->type) {
+        return false;
+    }
+    for (int i = 0; i < GGML_MAX_DIMS; i++) {
+        if (a->ne[i] != b->ne[i]) {
+            return false;
+        }
+        if (a->nb[i] != b->nb[i]) {
+            return false;
+        }
+    }
+    return true;
+}
+
+void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst) {
+    //printf("src: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", src->name, (int)src->ne[0], (int)src->ne[1], (int)src->ne[2], (int)src->ne[3], (int)src->nb[0], (int)src->nb[1], (int)src->nb[2], (int)src->nb[3]);
+    //printf("dst: %s ne: [%d %d %d %d] nb: [%d %d %d %d]\n", dst->name, (int)dst->ne[0], (int)dst->ne[1], (int)dst->ne[2], (int)dst->ne[3], (int)dst->nb[0], (int)dst->nb[1], (int)dst->nb[2], (int)dst->nb[3]);
+    GGML_ASSERT(ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts");
+
+    // printf("cpy tensor %s from %s to %s (%lu bytes)\n", src->name, ggml_backend_name(src->backend), ggml_backend_name(dst->backend), ggml_nbytes(src));
+
+    if (src == dst) {
+        return;
+    }
+
+    // TODO: allow backends to support copy to/from same backend
+
+    if (ggml_get_backend(dst)->iface.cpy_tensor_from != NULL) {
+        ggml_get_backend(dst)->iface.cpy_tensor_from(ggml_get_backend(dst)->context, src, dst);
+    } else if (ggml_get_backend(src)->iface.cpy_tensor_to != NULL) {
+        ggml_get_backend(src)->iface.cpy_tensor_to(ggml_get_backend(src)->context, src, dst);
+    } else {
+        // shouldn't be hit when copying from/to CPU
+        #ifndef NDEBUG
+        fprintf(stderr, "ggml_backend_tensor_copy: neither cpy_tensor_from nor cpy_tensor_to are implemented for backends %s and %s, falling back to get/set\n", ggml_backend_name(src->buffer->backend), ggml_backend_name(dst->buffer->backend));
+        #endif
+        size_t nbytes = ggml_nbytes(src);
+        void * data = malloc(nbytes);
+        ggml_backend_tensor_get(src, data, 0, nbytes);
+        ggml_backend_tensor_set(dst, data, 0, nbytes);
+        free(data);
+    }
+}
+
+// backend CPU
+
+struct ggml_backend_cpu_context {
+    int n_threads;
+    void * work_data;
+    size_t work_size;
+};
+
+static const char * ggml_backend_cpu_name(ggml_backend_t backend) {
+    return "CPU";
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_free(ggml_backend_t backend) {
+    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
+    free(cpu_ctx->work_data);
+    free(cpu_ctx);
+    free(backend);
+}
+
+static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
+    return (void *)buffer->context;
+}
+
+static void ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t buffer) {
+    free(buffer->context);
+    UNUSED(buffer);
+}
+
+static struct ggml_backend_buffer_i cpu_backend_buffer_i = {
+    /* .free_buffer    = */ ggml_backend_cpu_buffer_free_buffer,
+    /* .get_base       = */ ggml_backend_cpu_buffer_get_base,
+    /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
+    /* .init_tensor    = */ NULL, // no initialization required
+    /* .free_tensor    = */ NULL, // no cleanup required
+};
+
+// for buffers from ptr, free is not called
+static struct ggml_backend_buffer_i cpu_backend_buffer_i_from_ptr = {
+    /* .free_buffer    = */ NULL, // ptr is not owned by the buffer, so it does not need to be freed
+    /* .get_base       = */ ggml_backend_cpu_buffer_get_base,
+    /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
+    /* .init_tensor    = */ NULL,
+    /* .free_tensor    = */ NULL,
+};
+
+static const size_t TENSOR_ALIGNMENT = 64; // should be enough for AVX 512
+
+static ggml_backend_buffer_t ggml_backend_cpu_alloc_buffer(ggml_backend_t backend, size_t size) {
+    size += TENSOR_ALIGNMENT;   // malloc may return an address that is not aligned
+    void * data = malloc(size); // TODO: maybe use GGML_ALIGNED_MALLOC?
+
+    return ggml_backend_buffer_init(backend, cpu_backend_buffer_i, data, size);
+}
+
+static size_t ggml_backend_cpu_get_alignment(ggml_backend_t backend) {
+    return TENSOR_ALIGNMENT;
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
+    GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
+    GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
+
+    memcpy((char *)tensor->data + offset, data, size);
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_get_tensor_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
+    GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor read out of bounds");
+    GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
+
+    memcpy(data, (const char *)tensor->data + offset, size);
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_synchronize(ggml_backend_t backend) {
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_cpy_tensor_from(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
+    ggml_backend_tensor_get(src, dst->data, 0, ggml_nbytes(src));
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_cpy_tensor_to(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst) {
+    // for a backend such as CUDA that can queue async calls, it is ok to do this asynchronously, but it may not be the case for other backends
+    ggml_backend_tensor_set_async(dst, src->data, 0, ggml_nbytes(src));
+
+    UNUSED(backend);
+}
+
+struct ggml_backend_plan_cpu {
+    struct ggml_cplan cplan;
+    struct ggml_cgraph cgraph;
+};
+
+static ggml_backend_graph_plan_t ggml_backend_cpu_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
+    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
+
+    struct ggml_backend_plan_cpu * cpu_plan = malloc(sizeof(struct ggml_backend_plan_cpu));
+
+    cpu_plan->cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
+    cpu_plan->cgraph = *cgraph;
+
+    if (cpu_plan->cplan.work_size > 0) {
+        cpu_plan->cplan.work_data = malloc(cpu_plan->cplan.work_size);
+    }
+
+    return cpu_plan;
+}
+
+static void ggml_backend_cpu_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
+    struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
+
+    free(cpu_plan->cplan.work_data);
+    free(cpu_plan);
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
+    struct ggml_backend_plan_cpu * cpu_plan = (struct ggml_backend_plan_cpu *)plan;
+
+    ggml_graph_compute(&cpu_plan->cgraph, &cpu_plan->cplan);
+
+    UNUSED(backend);
+}
+
+static void ggml_backend_cpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
+    struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
+
+    struct ggml_cplan cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
+
+    if (cpu_ctx->work_size < cplan.work_size) {
+        // TODO: may be faster to free and use malloc to avoid the copy
+        cpu_ctx->work_data = realloc(cpu_ctx->work_data, cplan.work_size);
+        cpu_ctx->work_size = cplan.work_size;
+    }
+
+    cplan.work_data = cpu_ctx->work_data;
+
+    ggml_graph_compute(cgraph, &cplan);
+}
+
+static bool ggml_backend_cpu_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+    return true;
+    UNUSED(backend);
+    UNUSED(op);
+}
+
+static struct ggml_backend_i cpu_backend_i = {
+    /* .get_name            = */ ggml_backend_cpu_name,
+    /* .free                = */ ggml_backend_cpu_free,
+    /* .alloc_buffer        = */ ggml_backend_cpu_alloc_buffer,
+    /* .get_alignment       = */ ggml_backend_cpu_get_alignment,
+    /* .set_tensor_async    = */ ggml_backend_cpu_set_tensor_async,
+    /* .get_tensor_async    = */ ggml_backend_cpu_get_tensor_async,
+    /* .synchronize         = */ ggml_backend_cpu_synchronize,
+    /* .cpy_tensor_from     = */ ggml_backend_cpu_cpy_tensor_from,
+    /* .cpy_tensor_to       = */ ggml_backend_cpu_cpy_tensor_to,
+    /* .graph_plan_create   = */ ggml_backend_cpu_graph_plan_create,
+    /* .graph_plan_free     = */ ggml_backend_cpu_graph_plan_free,
+    /* .graph_plan_compute  = */ ggml_backend_cpu_graph_plan_compute,
+    /* .graph_compute       = */ ggml_backend_cpu_graph_compute,
+    /* .supports_op         = */ ggml_backend_cpu_supports_op,
+};
+
+ggml_backend_t ggml_backend_cpu_init(void) {
+    struct ggml_backend_cpu_context * ctx = malloc(sizeof(struct ggml_backend_cpu_context));
+
+    ctx->n_threads = GGML_DEFAULT_N_THREADS;
+    ctx->work_data = NULL;
+    ctx->work_size = 0;
+
+    ggml_backend_t cpu_backend = malloc(sizeof(struct ggml_backend));
+
+    *cpu_backend = (struct ggml_backend) {
+        /* .interface = */ cpu_backend_i,
+        /* .context   = */ ctx
+    };
+    return cpu_backend;
+}
+
+bool ggml_backend_is_cpu(ggml_backend_t backend) {
+    return backend->iface.get_name == ggml_backend_cpu_name;
+}
+
+void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads) {
+    GGML_ASSERT(ggml_backend_is_cpu(backend_cpu));
+
+    struct ggml_backend_cpu_context * ctx = (struct ggml_backend_cpu_context *)backend_cpu->context;
+    ctx->n_threads = n_threads;
+}
+
+ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size) {
+    return ggml_backend_buffer_init(backend_cpu, cpu_backend_buffer_i_from_ptr, ptr, size);
+}
--- a/runner/ggml-backend.h
+++ b/runner/ggml-backend.h
@@ -0,0 +1,169 @@
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#pragma once
+
+#include "ggml.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+    struct ggml_backend;
+    struct ggml_backend_buffer;
+
+    // type-erased backend-specific types / wrappers
+    typedef void * ggml_backend_context_t;
+    typedef void * ggml_backend_graph_plan_t;
+    typedef void * ggml_backend_buffer_context_t;
+
+    // avoid accessing internals of these types
+    typedef struct ggml_backend        * ggml_backend_t;
+    typedef struct ggml_backend_buffer * ggml_backend_buffer_t;
+
+    //
+    // backend buffer
+    //
+
+    struct ggml_backend_buffer_i {
+        void   (*free_buffer)   (ggml_backend_buffer_t buffer);
+        void * (*get_base)      (ggml_backend_buffer_t buffer); // get base pointer
+        size_t (*get_alloc_size)(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-allocation callback
+        void   (*init_tensor)   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // post-allocation callback
+        void   (*free_tensor)   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor); // pre-free callback
+    };
+
+    // TODO: hide behind API
+    struct ggml_backend_buffer {
+        struct ggml_backend_buffer_i iface;
+
+        ggml_backend_t                backend;
+        ggml_backend_buffer_context_t context;
+
+        size_t size;
+    };
+
+    // backend buffer functions
+    GGML_API ggml_backend_buffer_t ggml_backend_buffer_init(
+            struct ggml_backend                  * backend,
+            struct ggml_backend_buffer_i           iface,
+                   ggml_backend_buffer_context_t   context,
+                   size_t                          size);
+
+    GGML_API void   ggml_backend_buffer_free          (ggml_backend_buffer_t buffer);
+    GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
+    GGML_API void * ggml_backend_buffer_get_base      (ggml_backend_buffer_t buffer);
+    GGML_API size_t ggml_backend_buffer_get_size      (ggml_backend_buffer_t buffer);
+    GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
+    GGML_API void   ggml_backend_buffer_init_tensor   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
+    GGML_API void   ggml_backend_buffer_free_tensor   (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
+
+    //
+    // backend
+    //
+
+    struct ggml_backend_i {
+        const char * (*get_name)(ggml_backend_t backend);
+
+        void (*free)(ggml_backend_t backend);
+
+        // buffer allocation
+        ggml_backend_buffer_t (*alloc_buffer)(ggml_backend_t backend, size_t size);
+
+        // get buffer alignment
+        size_t (*get_alignment)(ggml_backend_t backend);
+
+        // tensor data access
+        // these functions can be asynchronous, helper functions are provided for synchronous access that automatically call synchronize
+        void (*set_tensor_async)(ggml_backend_t backend,       struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
+        void (*get_tensor_async)(ggml_backend_t backend, const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
+        void (*synchronize)     (ggml_backend_t backend);
+
+        // (optional) copy tensor between different backends, allow for single-copy tranfers
+        void (*cpy_tensor_from)(ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
+        void (*cpy_tensor_to)  (ggml_backend_t backend, struct ggml_tensor * src, struct ggml_tensor * dst);
+
+        // compute graph with a plan
+        ggml_backend_graph_plan_t (*graph_plan_create) (ggml_backend_t backend, struct ggml_cgraph * cgraph);
+        void                      (*graph_plan_free)   (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
+        void                      (*graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
+
+        // compute graph without a plan
+        void (*graph_compute)(ggml_backend_t backend, struct ggml_cgraph * cgraph);
+
+        // check if the backend supports an operation
+        bool (*supports_op)(ggml_backend_t backend, const struct ggml_tensor * op);
+    };
+
+    // TODO: hide behind API
+    struct ggml_backend {
+        struct ggml_backend_i iface;
+
+        ggml_backend_context_t context;
+    };
+
+    // backend helper functions
+    GGML_API ggml_backend_t ggml_get_backend(const struct ggml_tensor * tensor);
+
+    GGML_API const char * ggml_backend_name(ggml_backend_t backend);
+    GGML_API void         ggml_backend_free(ggml_backend_t backend);
+
+    GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
+
+    GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
+
+    GGML_API void ggml_backend_tensor_set_async(      struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
+    GGML_API void ggml_backend_tensor_get_async(const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
+
+    GGML_API void ggml_backend_tensor_set(      struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
+    GGML_API void ggml_backend_tensor_get(const struct ggml_tensor * tensor,       void * data, size_t offset, size_t size);
+
+    GGML_API void ggml_backend_synchronize(ggml_backend_t backend);
+
+    GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create (ggml_backend_t backend, struct ggml_cgraph * cgraph);
+
+    GGML_API void ggml_backend_graph_plan_free   (ggml_backend_t backend, ggml_backend_graph_plan_t plan);
+    GGML_API void ggml_backend_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
+    GGML_API void ggml_backend_graph_compute     (ggml_backend_t backend, struct ggml_cgraph * cgraph);
+    GGML_API bool ggml_backend_supports_op       (ggml_backend_t backend, const struct ggml_tensor * op);
+
+    // tensor copy between different backends
+    GGML_API void ggml_backend_tensor_copy(struct ggml_tensor * src, struct ggml_tensor * dst);
+
+    //
+    // CPU backend
+    //
+
+    GGML_API ggml_backend_t ggml_backend_cpu_init(void);
+
+    GGML_API bool ggml_backend_is_cpu(ggml_backend_t backend);
+
+    GGML_API void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads);
+
+    GGML_API ggml_backend_buffer_t ggml_backend_cpu_buffer_from_ptr(ggml_backend_t backend_cpu, void * ptr, size_t size);
+
+#ifdef  __cplusplus
+}
+#endif
--- a/runner/ggml-cuda.cu
+++ b/runner/ggml-cuda.cu
--- a/runner/ggml-cuda.h
+++ b/runner/ggml-cuda.h
@@ -0,0 +1,77 @@
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#pragma once
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#ifdef GGML_USE_HIPBLAS
+#define GGML_CUDA_NAME "ROCm"
+#define GGML_CUBLAS_NAME "hipBLAS"
+#else
+#define GGML_CUDA_NAME "CUDA"
+#define GGML_CUBLAS_NAME "cuBLAS"
+#endif
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+#define GGML_CUDA_MAX_DEVICES       16
+
+GGML_API void   ggml_init_cublas(void);
+GGML_API void * ggml_cuda_host_malloc(size_t size);
+GGML_API void   ggml_cuda_host_free(void * ptr);
+
+GGML_API bool   ggml_cuda_can_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst);
+GGML_API void   ggml_cuda_set_tensor_split(const float * tensor_split);
+GGML_API void   ggml_cuda_transform_tensor(void * data, struct ggml_tensor * tensor);
+GGML_API void   ggml_cuda_free_data(struct ggml_tensor * tensor);
+
+GGML_API void   ggml_cuda_assign_buffers(struct ggml_tensor * tensor);
+GGML_API void   ggml_cuda_assign_buffers_no_scratch(struct ggml_tensor * tensor);
+GGML_API void   ggml_cuda_assign_buffers_force_inplace(struct ggml_tensor * tensor);
+
+GGML_API void   ggml_cuda_assign_buffers_no_alloc(struct ggml_tensor * tensor);
+GGML_API void   ggml_cuda_assign_scratch_offset(struct ggml_tensor * tensor, size_t offset);
+GGML_API void   ggml_cuda_copy_to_device(struct ggml_tensor * tensor);
+
+GGML_API void   ggml_cuda_set_main_device(int main_device);
+GGML_API void   ggml_cuda_set_mul_mat_q(bool mul_mat_q);
+GGML_API void   ggml_cuda_set_scratch_size(size_t scratch_size);
+GGML_API void   ggml_cuda_free_scratch(void);
+GGML_API bool   ggml_cuda_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor);
+
+GGML_API int    ggml_cuda_get_device_count(void);
+GGML_API void   ggml_cuda_get_device_description(int device, char * description, size_t description_size);
+
+// backend API
+GGML_API ggml_backend_t ggml_backend_cuda_init(void); // TODO: take a list of devices to use
+
+#ifdef  __cplusplus
+}
+#endif
--- a/runner/ggml-metal.h
+++ b/runner/ggml-metal.h
@@ -0,0 +1,134 @@
+//go:build darwin
+
+/**
+ * llama.cpp - git 465219b9143ac01db0990bbcb0a081ef72ec2008
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+// An interface allowing to compute ggml_cgraph with Metal
+//
+// This is a fully functional interface that extends ggml with GPU support for Apple devices.
+// A similar interface can be created for other GPU backends (e.g. Vulkan, CUDA, OpenCL, etc.)
+//
+// How it works?
+//
+// As long as your program can create and evaluate a ggml_cgraph on the CPU, you can use this
+// interface to evaluate the same graph on the GPU. Instead of using ggml_graph_compute(), you
+// use ggml_metal_graph_compute() (or ggml_vulkan_graph_compute(), etc.)
+//
+// You only need to make sure that all memory buffers that you used during the graph creation
+// are mapped to the device memory with the ggml_metal_add_buffer() function. This mapping is
+// used during the graph evaluation to determine the arguments of the compute kernels.
+//
+// Synchronization between device and host memory (for example for input and output tensors)
+// is done with the ggml_metal_set_tensor() and ggml_metal_get_tensor() functions.
+//
+
+#pragma once
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#include <stddef.h>
+#include <stdbool.h>
+
+// max memory buffers that can be mapped to the device
+#define GGML_METAL_MAX_BUFFERS 16
+#define GGML_METAL_MAX_COMMAND_BUFFERS 32
+
+struct ggml_tensor;
+struct ggml_cgraph;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+//
+// internal API
+// temporary exposed to user-code
+//
+
+struct ggml_metal_context;
+
+void ggml_metal_log_set_callback(ggml_log_callback log_callback, void * user_data);
+
+// number of command buffers to use
+struct ggml_metal_context * ggml_metal_init(int n_cb);
+void ggml_metal_free(struct ggml_metal_context * ctx);
+
+void * ggml_metal_host_malloc(size_t n);
+void   ggml_metal_host_free  (void * data);
+
+// set the number of command buffers to use
+void ggml_metal_set_n_cb(struct ggml_metal_context * ctx, int n_cb);
+
+// creates a mapping between a host memory buffer and a device memory buffer
+// - make sure to map all buffers used in the graph before calling ggml_metal_graph_compute
+// - the mapping is used during computation to determine the arguments of the compute kernels
+// - you don't need to keep the host memory buffer allocated as it is never accessed by Metal
+// - max_size specifies the maximum size of a tensor and is used to create shared views such
+//   that it is guaranteed that the tensor will fit in at least one of the views
+//
+bool ggml_metal_add_buffer(
+        struct ggml_metal_context * ctx,
+                       const char * name,
+                             void * data,
+                           size_t   size,
+                           size_t   max_size);
+
+// set data from host memory into the device
+void ggml_metal_set_tensor(struct ggml_metal_context * ctx, struct ggml_tensor * t);
+
+// get data from the device into host memory
+void ggml_metal_get_tensor(struct ggml_metal_context * ctx, struct ggml_tensor * t);
+
+// try to find operations that can be run concurrently in the graph
+// you should run it again if the topology of your graph changes
+void ggml_metal_graph_find_concurrency(struct ggml_metal_context * ctx, struct ggml_cgraph * gf, bool check_mem);
+
+// if the graph has been optimized for concurrently dispatch, return length of the concur_list if optimized
+int ggml_metal_if_optimized(struct ggml_metal_context * ctx);
+
+// output the concur_list for ggml_alloc
+int * ggml_metal_get_concur_list(struct ggml_metal_context * ctx);
+
+// same as ggml_graph_compute but uses Metal
+// creates gf->n_threads command buffers in parallel
+void ggml_metal_graph_compute(struct ggml_metal_context * ctx, struct ggml_cgraph * gf);
+
+//
+// backend API
+// user-code should use only these functions
+//
+
+GGML_API ggml_backend_t ggml_backend_metal_init(void);
+
+GGML_API bool ggml_backend_is_metal(ggml_backend_t backend);
+
+GGML_API void ggml_backend_metal_set_n_cb(ggml_backend_t backend, int n_cb);
+
+#ifdef __cplusplus
+}
+#endif
+
--- a/runner/ggml-metal.m
+++ b/runner/ggml-metal.m
--- a/Show More
+++ b/Show More
				`@@ -1 +0,0 @@`
				`{ "dependencies": { "@types/node": "^20.10.4", "prompt-sync": "^4.2.0", "readline": "^1.3.0" } }`