add copy command

change push to chunked uploads from monolithic (#179 )
Merge pull request #164 from jmorganca/restart-server
2023-07-24 10:55:38 -04:00 · 2023-07-22 17:31:26 -07:00 · 2023-07-22 18:19:22 -04:00 · 2023-07-22 09:40:37 -07:00 · 2023-07-22 09:40:01 -07:00 · 2023-07-22 09:39:08 -07:00
78 changed files with 7000 additions and 1202 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,5 +2,6 @@
 .vscode
 .env
 .venv
+.swp
 dist
 ollama
--- a/README.md
+++ b/README.md
@@ -1,25 +1,50 @@
 <div align="center">
  <picture>
-    <source media="(prefers-color-scheme: dark)" height="200px" srcset="https://github.com/jmorganca/ollama/assets/3325447/318048d2-b2dd-459c-925a-ac8449d5f02c">
-    <img alt="logo" height="200px" src="https://github.com/jmorganca/ollama/assets/3325447/c7d6e15f-7f4d-4776-b568-c084afa297c2">
+    <source media="(prefers-color-scheme: dark)" height="200px" srcset="https://github.com/jmorganca/ollama/assets/3325447/56ea1849-1284-4645-8970-956de6e51c3c">
+    <img alt="logo" height="200px" src="https://github.com/jmorganca/ollama/assets/3325447/0d0b44e2-8f4a-4e99-9b52-a5c1c741c8f7">
  </picture>
 </div>

 # Ollama

-Create, run, and share self-contained large language models (LLMs). Ollama bundles a model’s weights, configuration, prompts, and more into self-contained packages that run anywhere.
+[![Discord](https://dcbadge.vercel.app/api/server/ollama?style=flat&compact=true)](https://discord.gg/ollama)

 > Note: Ollama is in early preview. Please report any issues you find.

+Run, create, and share large language models (LLMs).
+
 ## Download

 - [Download](https://ollama.ai/download) for macOS on Apple Silicon (Intel coming soon)
 - Download for Windows and Linux (coming soon)
 - Build [from source](#building)

+## Quickstart
+
+To run and chat with [Llama 2](https://ai.meta.com/llama), the new model by Meta:
+
+```
+ollama run llama2
+```
+
+## Model library
+
+`ollama` includes a library of open-source models:
+
+| Model                    | Parameters | Size  | Download                    |
+| ------------------------ | ---------- | ----- | --------------------------- |
+| Llama2                   | 7B         | 3.8GB | `ollama pull llama2`        |
+| Llama2 13B               | 13B        | 7.3GB | `ollama pull llama2:13b`    |
+| Orca Mini                | 3B         | 1.9GB | `ollama pull orca`          |
+| Vicuna                   | 7B         | 3.8GB | `ollama pull vicuna`        |
+| Nous-Hermes              | 13B        | 7.3GB | `ollama pull nous-hermes`   |
+| Wizard Vicuna Uncensored | 13B        | 7.3GB | `ollama pull wizard-vicuna` |
+
+> Note: You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.
+
 ## Examples

-### Quickstart
+### Run a model

 ```
 ollama run llama2
@@ -27,17 +52,25 @@ ollama run llama2
 Hello! How can I help you today?
 ```

-### Creating a custom model
+### Create a custom model
+
+Pull a base model:
+
+```
+ollama pull llama2
+```

 Create a `Modelfile`:

 ```
 FROM llama2
-PROMPT """
-You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.

-User: {{ .Prompt }}
-Mario:
+# set the temperature to 1 [higher is more creative, lower is more coherent]
+PARAMETER temperature 1
+
+# set the system prompt
+SYSTEM """
+You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
 """
 ```

@@ -50,16 +83,30 @@ ollama run mario
 Hello! It's your friend Mario.
 ```

-## Model library
+For more examples, see the [examples](./examples) directory.

-Ollama includes a library of open-source, pre-trained models. More models are coming soon.
+### Pull a model from the registry

-| Model       | Parameters | Size  | Download                  |
-| ----------- | ---------- | ----- | ------------------------- |
-| Llama2      | 7B         | 3.8GB | `ollama pull llama2`      |
-| Orca Mini   | 3B         | 1.9GB | `ollama pull orca`        |
-| Vicuna      | 7B         | 3.8GB | `ollama pull vicuna`      |
-| Nous-Hermes | 13B         | 7.3GB | `ollama pull nous-hermes` |
+```
+ollama pull orca
+```
+
+### Listing local models
+
+```
+ollama list
+```
+
+## Model packages
+
+### Overview
+
+Ollama bundles model weights, configuration, and data into a single package, defined by a [Modelfile](./docs/modelfile.md).
+
+<picture>
+  <source media="(prefers-color-scheme: dark)" height="480" srcset="https://github.com/jmorganca/ollama/assets/251292/2fd96b5f-191b-45c1-9668-941cfad4eb70">
+  <img alt="logo" height="480" src="https://github.com/jmorganca/ollama/assets/251292/2fd96b5f-191b-45c1-9668-941cfad4eb70">
+</picture>

 ## Building

@@ -70,7 +117,7 @@ go build .
 To run it start the server:

 ```
-./ollama server &
+./ollama serve &
 ```

 Finally, run a model!
@@ -78,3 +125,13 @@ Finally, run a model!
 ```
 ./ollama run llama2
 ```
+
+## REST API
+
+### `POST /api/generate`
+
+Generate text from a model.
+
+```
+curl -X POST http://localhost:11434/api/generate -d '{"model": "llama2", "prompt":"Why is the sky blue?"}'
+```
--- a/api/client.go
+++ b/api/client.go
@@ -27,7 +27,7 @@ func checkError(resp *http.Response, body []byte) error {
 	err := json.Unmarshal(body, &apiError)
 	if err != nil {
 		// Use the full body as the message if we fail to decode a response.
-		apiError.Message = string(body)
+		apiError.ErrorMessage = string(body)
 	}

 	return apiError
@@ -92,7 +92,6 @@ func (c *Client) do(ctx context.Context, method, path string, reqData, respData
 		}
 	}
 	return nil
-
 }

 func (c *Client) stream(ctx context.Context, method, path string, data any, fn func([]byte) error) error {
@@ -131,11 +130,15 @@ func (c *Client) stream(ctx context.Context, method, path string, data any, fn f
 			return fmt.Errorf("unmarshal: %w", err)
 		}

+		if errorResponse.Error != "" {
+			return fmt.Errorf("stream: %s", errorResponse.Error)
+		}
+
 		if response.StatusCode >= 400 {
 			return StatusError{
-				StatusCode: response.StatusCode,
-				Status:     response.Status,
-				Message:    errorResponse.Error,
+				StatusCode:   response.StatusCode,
+				Status:       response.Status,
+				ErrorMessage: errorResponse.Error,
 			}
 		}

@@ -206,3 +209,17 @@ func (c *Client) List(ctx context.Context) (*ListResponse, error) {
 	}
 	return &lr, nil
 }
+
+func (c *Client) Copy(ctx context.Context, req *CopyRequest) error {
+	if err := c.do(ctx, http.MethodPost, "/api/copy", req, nil); err != nil {
+		return err
+	}
+	return nil
+}
+
+func (c *Client) Delete(ctx context.Context, req *DeleteRequest) error {
+	if err := c.do(ctx, http.MethodDelete, "/api/delete", req, nil); err != nil {
+		return err
+	}
+	return nil
+}
--- a/api/types.go
+++ b/api/types.go
@@ -8,16 +8,23 @@ import (
 )

 type StatusError struct {
-	StatusCode int
-	Status     string
-	Message    string
+	StatusCode   int
+	Status       string
+	ErrorMessage string `json:"error"`
 }

 func (e StatusError) Error() string {
-	if e.Message != "" {
-		return fmt.Sprintf("%s: %s", e.Status, e.Message)
+	switch {
+	case e.Status != "" && e.ErrorMessage != "":
+		return fmt.Sprintf("%s: %s", e.Status, e.ErrorMessage)
+	case e.Status != "":
+		return e.Status
+	case e.ErrorMessage != "":
+		return e.ErrorMessage
+	default:
+		// this should not happen
+		return "something went wrong, please see the ollama server logs for details"
 	}
-	return e.Status
 }

 type GenerateRequest struct {
@@ -37,21 +44,32 @@ type CreateProgress struct {
 	Status string `json:"status"`
 }

+type DeleteRequest struct {
+	Name string `json:"name"`
+}
+
+type CopyRequest struct {
+	Source      string `json:"source"`
+	Destination string `json:"destination"`
+}
+
 type PullRequest struct {
 	Name     string `json:"name"`
+	Insecure bool   `json:"insecure,omitempty"`
 	Username string `json:"username"`
 	Password string `json:"password"`
 }

 type ProgressResponse struct {
-	Status    string  `json:"status"`
-	Digest    string  `json:"digest,omitempty"`
-	Total     int     `json:"total,omitempty"`
-	Completed int     `json:"completed,omitempty"`
+	Status    string `json:"status"`
+	Digest    string `json:"digest,omitempty"`
+	Total     int    `json:"total,omitempty"`
+	Completed int    `json:"completed,omitempty"`
 }

 type PushRequest struct {
 	Name     string `json:"name"`
+	Insecure bool   `json:"insecure,omitempty"`
 	Username string `json:"username"`
 	Password string `json:"password"`
 }
--- a/app/assets/ollama_outline_icon_16x16Template.png
+++ b/app/assets/ollama_outline_icon_16x16Template.png
--- a/app/assets/ollama_outline_icon_16x16Template@2x.png
+++ b/app/assets/ollama_outline_icon_16x16Template@2x.png
--- a/app/forge.config.ts
+++ b/app/forge.config.ts
@@ -21,6 +21,8 @@ const config: ForgeConfig = {
      '../ollama',
      path.join(__dirname, './assets/ollama_icon_16x16Template.png'),
      path.join(__dirname, './assets/ollama_icon_16x16Template@2x.png'),
+      path.join(__dirname, './assets/ollama_outline_icon_16x16Template.png'),
+      path.join(__dirname, './assets/ollama_outline_icon_16x16Template@2x.png'),
      ...(process.platform === 'darwin' ? ['../llama/ggml-metal.metal'] : []),
    ],
    ...(process.env.SIGN
--- a/app/package.json
+++ b/app/package.json
@@ -11,7 +11,9 @@
    "make": "electron-forge make",
    "make:sign": "SIGN=1 electron-forge make",
    "publish": "SIGN=1 electron-forge publish",
-    "lint": "eslint --ext .ts,.tsx ."
+    "lint": "eslint --ext .ts,.tsx .",
+    "format": "prettier --check . --ignore-path .gitignore",
+    "format:fix": "prettier --write . --ignore-path .gitignore"
  },
  "keywords": [],
  "author": {
--- a/app/src/declarations.d.ts
+++ b/app/src/declarations.d.ts
@@ -1,4 +1,4 @@
 declare module '*.svg' {
-  const content: string;
-  export default content;
-}
+  const content: string
+  export default content
+}
--- a/app/src/index.ts
+++ b/app/src/index.ts
@@ -1,5 +1,5 @@
 import { spawn } from 'child_process'
-import { app, autoUpdater, dialog, Tray, Menu, BrowserWindow } from 'electron'
+import { app, autoUpdater, dialog, Tray, Menu, BrowserWindow, nativeTheme } from 'electron'
 import Store from 'electron-store'
 import winston from 'winston'
 import 'winston-daily-rotate-file'
@@ -66,14 +66,30 @@ function firstRunWindow() {
 }

 function createSystemtray() {
-  let iconPath = path.join(__dirname, '..', '..', 'assets', 'ollama_icon_16x16Template.png')
+  let iconPath = nativeTheme.shouldUseDarkColors
+    ? path.join(__dirname, '..', '..', 'assets', 'ollama_icon_16x16Template.png')
+    : path.join(__dirname, '..', '..', 'assets', 'ollama_outline_icon_16x16Template.png')

  if (app.isPackaged) {
-    iconPath = path.join(process.resourcesPath, 'ollama_icon_16x16Template.png')
+    iconPath = nativeTheme.shouldUseDarkColors
+      ? path.join(process.resourcesPath, 'ollama_icon_16x16Template.png')
+      : path.join(process.resourcesPath, 'ollama_outline_icon_16x16Template.png')
  }

  tray = new Tray(iconPath)

+  nativeTheme.on('updated', function theThemeHasChanged() {
+    if (nativeTheme.shouldUseDarkColors) {
+      app.isPackaged
+        ? tray.setImage(path.join(process.resourcesPath, 'ollama_icon_16x16Template.png'))
+        : tray.setImage(path.join(__dirname, '..', '..', 'assets', 'ollama_icon_16x16Template.png'))
+    } else {
+      app.isPackaged
+        ? tray.setImage(path.join(process.resourcesPath, 'ollama_outline_icon_16x16Template.png'))
+        : tray.setImage(path.join(__dirname, '..', '..', 'assets', 'ollama_outline_icon_16x16Template.png'))
+    }
+  })
+
  const contextMenu = Menu.buildFromTemplate([{ role: 'quit', label: 'Quit Ollama', accelerator: 'Command+Q' }])

  tray.setContextMenu(contextMenu)
@@ -100,8 +116,7 @@ function server() {
  })

  function restart() {
-    logger.info('Restarting the server...')
-    server()
+    setTimeout(server, 3000)
  }

  proc.on('exit', restart)
--- a/app/src/install.ts
+++ b/app/src/install.ts
@@ -13,7 +13,9 @@ export function installed() {
 }

 export async function install() {
-  const command = `do shell script "ln -F -s ${ollama} ${symlinkPath}" with administrator privileges`
+  const command = `do shell script "mkdir -p ${path.dirname(
+    symlinkPath
+  )} && ln -F -s ${ollama} ${symlinkPath}" with administrator privileges`

  try {
    await exec(`osascript -e '${command}'`)
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -5,6 +5,7 @@ import (
 	"context"
 	"errors"
 	"fmt"
+	"io"
 	"log"
 	"net"
 	"net/http"
@@ -13,18 +14,18 @@ import (
 	"strings"
 	"time"

+	"github.com/chzyer/readline"
 	"github.com/dustin/go-humanize"
 	"github.com/olekukonko/tablewriter"
-	"github.com/schollz/progressbar/v3"
 	"github.com/spf13/cobra"
-	"golang.org/x/term"

 	"github.com/jmorganca/ollama/api"
 	"github.com/jmorganca/ollama/format"
+	"github.com/jmorganca/ollama/progressbar"
 	"github.com/jmorganca/ollama/server"
 )

-func create(cmd *cobra.Command, args []string) error {
+func CreateHandler(cmd *cobra.Command, args []string) error {
 	filename, _ := cmd.Flags().GetString("file")
 	filename, err := filepath.Abs(filename)
 	if err != nil {
@@ -58,7 +59,7 @@ func create(cmd *cobra.Command, args []string) error {
 	return nil
 }

-func RunRun(cmd *cobra.Command, args []string) error {
+func RunHandler(cmd *cobra.Command, args []string) error {
 	mp := server.ParseModelPath(args[0])
 	fp, err := mp.GetManifestPath(false)
 	if err != nil {
@@ -68,7 +69,7 @@ func RunRun(cmd *cobra.Command, args []string) error {
 	_, err = os.Stat(fp)
 	switch {
 	case errors.Is(err, os.ErrNotExist):
-		if err := pull(args[0]); err != nil {
+		if err := pull(args[0], false); err != nil {
 			var apiStatusError api.StatusError
 			if !errors.As(err, &apiStatusError) {
 				return err
@@ -85,12 +86,33 @@ func RunRun(cmd *cobra.Command, args []string) error {
 	return RunGenerate(cmd, args)
 }

-func push(cmd *cobra.Command, args []string) error {
+func PushHandler(cmd *cobra.Command, args []string) error {
 	client := api.NewClient()

-	request := api.PushRequest{Name: args[0]}
+	insecure, err := cmd.Flags().GetBool("insecure")
+	if err != nil {
+		return err
+	}
+
+	var currentDigest string
+	var bar *progressbar.ProgressBar
+
+	request := api.PushRequest{Name: args[0], Insecure: insecure}
 	fn := func(resp api.ProgressResponse) error {
-		fmt.Println(resp.Status)
+		if resp.Digest != currentDigest && resp.Digest != "" {
+			currentDigest = resp.Digest
+			bar = progressbar.DefaultBytes(
+				int64(resp.Total),
+				fmt.Sprintf("pushing %s...", resp.Digest[7:19]),
+			)
+
+			bar.Set(resp.Completed)
+		} else if resp.Digest == currentDigest && resp.Digest != "" {
+			bar.Set(resp.Completed)
+		} else {
+			currentDigest = ""
+			fmt.Println(resp.Status)
+		}
 		return nil
 	}

@@ -100,7 +122,7 @@ func push(cmd *cobra.Command, args []string) error {
 	return nil
 }

-func list(cmd *cobra.Command, args []string) error {
+func ListHandler(cmd *cobra.Command, args []string) error {
 	client := api.NewClient()

 	models, err := client.List(context.Background())
@@ -111,7 +133,9 @@ func list(cmd *cobra.Command, args []string) error {
 	var data [][]string

 	for _, m := range models.Models {
-		data = append(data, []string{m.Name, humanize.Bytes(uint64(m.Size)), format.HumanTime(m.ModifiedAt, "Never")})
+		if len(args) == 0 || strings.HasPrefix(m.Name, args[0]) {
+			data = append(data, []string{m.Name, humanize.Bytes(uint64(m.Size)), format.HumanTime(m.ModifiedAt, "Never")})
+		}
 	}

 	table := tablewriter.NewWriter(os.Stdout)
@@ -128,17 +152,44 @@ func list(cmd *cobra.Command, args []string) error {
 	return nil
 }

-func RunPull(cmd *cobra.Command, args []string) error {
-	return pull(args[0])
+func DeleteHandler(cmd *cobra.Command, args []string) error {
+	client := api.NewClient()
+
+	req := api.DeleteRequest{Name: args[0]}
+	if err := client.Delete(context.Background(), &req); err != nil {
+		return err
+	}
+	fmt.Printf("deleted '%s'\n", args[0])
+	return nil
 }

-func pull(model string) error {
+func CopyHandler(cmd *cobra.Command, args []string) error {
+	client := api.NewClient()
+
+	req := api.CopyRequest{Source: args[0], Destination: args[1]}
+	if err := client.Copy(context.Background(), &req); err != nil {
+		return err
+	}
+	fmt.Printf("copied '%s' to '%s'\n", args[0], args[1])
+	return nil
+}
+
+func PullHandler(cmd *cobra.Command, args []string) error {
+	insecure, err := cmd.Flags().GetBool("insecure")
+	if err != nil {
+		return err
+	}
+
+	return pull(args[0], insecure)
+}
+
+func pull(model string, insecure bool) error {
 	client := api.NewClient()

 	var currentDigest string
 	var bar *progressbar.ProgressBar

-	request := api.PullRequest{Name: model}
+	request := api.PullRequest{Name: model, Insecure: insecure}
 	fn := func(resp api.ProgressResponse) error {
 		if resp.Digest != currentDigest && resp.Digest != "" {
 			currentDigest = resp.Digest
@@ -169,7 +220,7 @@ func RunGenerate(cmd *cobra.Command, args []string) error {
 		return generate(cmd, args[0], strings.Join(args[1:], " "))
 	}

-	if term.IsTerminal(int(os.Stdin.Fd())) {
+	if readline.IsTerminal(int(os.Stdin.Fd())) {
 		return generateInteractive(cmd, args[0])
 	}

@@ -227,17 +278,111 @@ func generate(cmd *cobra.Command, model, prompt string) error {
 }

 func generateInteractive(cmd *cobra.Command, model string) error {
-	fmt.Print(">>> ")
-	scanner := bufio.NewScanner(os.Stdin)
-	for scanner.Scan() {
-		if err := generate(cmd, model, scanner.Text()); err != nil {
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return err
+	}
+
+	completer := readline.NewPrefixCompleter(
+		readline.PcItem("/help"),
+		readline.PcItem("/list"),
+		readline.PcItem("/set",
+			readline.PcItem("history"),
+			readline.PcItem("nohistory"),
+			readline.PcItem("verbose"),
+			readline.PcItem("quiet"),
+			readline.PcItem("mode",
+				readline.PcItem("vim"),
+				readline.PcItem("emacs"),
+				readline.PcItem("default"),
+			),
+		),
+		readline.PcItem("/exit"),
+		readline.PcItem("/bye"),
+	)
+
+	usage := func() {
+		fmt.Fprintln(os.Stderr, "commands:")
+		fmt.Fprintln(os.Stderr, completer.Tree("  "))
+	}
+
+	config := readline.Config{
+		Prompt:       ">>> ",
+		HistoryFile:  filepath.Join(home, ".ollama", "history"),
+		AutoComplete: completer,
+	}
+
+	scanner, err := readline.NewEx(&config)
+	if err != nil {
+		return err
+	}
+	defer scanner.Close()
+
+	for {
+		line, err := scanner.Readline()
+		switch {
+		case errors.Is(err, io.EOF):
+			return nil
+		case errors.Is(err, readline.ErrInterrupt):
+			if line == "" {
+				return nil
+			}
+
+			continue
+		case err != nil:
 			return err
 		}

-		fmt.Print(">>> ")
-	}
+		line = strings.TrimSpace(line)

-	return nil
+		switch {
+		case strings.HasPrefix(line, "/list"):
+			args := strings.Fields(line)
+			if err := ListHandler(cmd, args[1:]); err != nil {
+				return err
+			}
+
+			continue
+		case strings.HasPrefix(line, "/set"):
+			args := strings.Fields(line)
+			if len(args) > 1 {
+				switch args[1] {
+				case "history":
+					scanner.HistoryEnable()
+					continue
+				case "nohistory":
+					scanner.HistoryDisable()
+					continue
+				case "verbose":
+					cmd.Flags().Set("verbose", "true")
+					continue
+				case "quiet":
+					cmd.Flags().Set("verbose", "false")
+					continue
+				case "mode":
+					if len(args) > 2 {
+						switch args[2] {
+						case "vim":
+							scanner.SetVimMode(true)
+							continue
+						case "emacs", "default":
+							scanner.SetVimMode(false)
+							continue
+						}
+					}
+				}
+			}
+		case line == "/help", line == "/?":
+			usage()
+			continue
+		case line == "/exit", line == "/bye":
+			return nil
+		}
+
+		if err := generate(cmd, model, line); err != nil {
+			return err
+		}
+	}
 }

 func generateBatch(cmd *cobra.Command, model string) error {
@@ -290,7 +435,7 @@ func NewCLI() *cobra.Command {
 		Use:   "create MODEL",
 		Short: "Create a model from a Modelfile",
 		Args:  cobra.MinimumNArgs(1),
-		RunE:  create,
+		RunE:  CreateHandler,
 	}

 	createCmd.Flags().StringP("file", "f", "Modelfile", "Name of the Modelfile (default \"Modelfile\")")
@@ -299,7 +444,7 @@ func NewCLI() *cobra.Command {
 		Use:   "run MODEL [PROMPT]",
 		Short: "Run a model",
 		Args:  cobra.MinimumNArgs(1),
-		RunE:  RunRun,
+		RunE:  RunHandler,
 	}

 	runCmd.Flags().Bool("verbose", false, "Show timings for response")
@@ -315,20 +460,39 @@ func NewCLI() *cobra.Command {
 		Use:   "pull MODEL",
 		Short: "Pull a model from a registry",
 		Args:  cobra.MinimumNArgs(1),
-		RunE:  RunPull,
+		RunE:  PullHandler,
 	}

+	pullCmd.Flags().Bool("insecure", false, "Use an insecure registry")
+
 	pushCmd := &cobra.Command{
 		Use:   "push MODEL",
 		Short: "Push a model to a registry",
 		Args:  cobra.MinimumNArgs(1),
-		RunE:  push,
+		RunE:  PushHandler,
 	}

+	pushCmd.Flags().Bool("insecure", false, "Use an insecure registry")
+
 	listCmd := &cobra.Command{
-		Use:   "list",
-		Short: "List models",
-		RunE:  list,
+		Use:     "list",
+		Aliases: []string{"ls"},
+		Short:   "List models",
+		RunE:    ListHandler,
+	}
+
+	copyCmd := &cobra.Command{
+		Use:   "cp",
+		Short: "Copy a model",
+		Args:  cobra.MinimumNArgs(2),
+		RunE:  CopyHandler,
+	}
+
+	deleteCmd := &cobra.Command{
+		Use:   "rm",
+		Short: "Remove a model",
+		Args:  cobra.MinimumNArgs(1),
+		RunE:  DeleteHandler,
 	}

 	rootCmd.AddCommand(
@@ -338,6 +502,8 @@ func NewCLI() *cobra.Command {
 		pullCmd,
 		pushCmd,
 		listCmd,
+		copyCmd,
+		deleteCmd,
 	)

 	return rootCmd
--- a/cmd/spinner.go
+++ b/cmd/spinner.go
@@ -5,7 +5,7 @@ import (
 	"os"
 	"time"

-	"github.com/schollz/progressbar/v3"
+	"github.com/jmorganca/ollama/progressbar"
 )

 type Spinner struct {
--- a/docs/development.md
+++ b/docs/development.md
@@ -6,6 +6,12 @@ Install required tools:
 brew install go
 ```

+Enable CGO:
+
+```
+export CGO_ENABLED=1
+```
+
 Then build ollama:

 ```
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -1,80 +1,105 @@
-# Ollama Model File Reference
+# Ollama Model File

-Ollama can build models automatically by reading the instructions from a Modelfile. A Modelfile is a text document that represents the complete configuration of the Model. You can see that a Modelfile is very similar to a Dockerfile.
+> Note: this model file syntax is in development
+
+A model file is the blueprint to create and share models with Ollama.

 ## Format

-Here is the format of the Modelfile:
+The format of the Modelfile:

 ```modelfile
 # comment
 INSTRUCTION arguments
 ```

-Nothing in the file is case-sensitive. However, the convention is for instructions to be uppercase to make it easier to distinguish from the arguments.
+| Instruction       | Description                                           |
+| ----------------- | ----------------------------------------------------- |
+| `FROM` (required) | Defines the base model to use                         |
+| `PARAMETER`       | Sets the parameters for how Ollama will run the model |
+| `SYSTEM`          | Specifies the system prompt that will set the context |
+| `TEMPLATE`        | The full prompt template to be sent to the model      |
+| `LICENSE`         | Specifies the legal license                           |

-A Modelfile can include instructions in any order. But the convention is to start the Modelfile with the FROM instruction.
+## Examples

-Although the example above shows a comment starting with a hash character, any instruction that is not recognized is seen as a comment. 
+An example of a model file creating a mario blueprint:

-## FROM
+```
+FROM llama2
+# sets the temperature to 1 [higher is more creative, lower is more coherent]
+# sets the context size to 4096
+PARAMETER temperature 1
+PARAMETER num_ctx 4096

-```modelfile
-FROM <image>[:<tag>]
+# Overriding the system prompt
+SYSTEM You are Mario from super mario bros, acting as an assistant.
 ```

-This defines the base model to be used. An image can be a known image on the Ollama Hub, or a fully-qualified path to a model file on your system
+To use this:

-## PARAMETER
+1. Save it as a file (eg. `Modelfile`)
+2. `ollama create NAME -f <location of the file eg. ./Modelfile>'`
+3. `ollama run NAME`
+4. Start using the model!

-The PARAMETER instruction defines a parameter that can be set when the model is run. 
+## FROM (Required)

-```modelfile
+The FROM instruction defines the base model to use when creating a model.
+
+```
+FROM <model name>:<tag>
+```
+
+### Build from llama2
+
+```
+FROM llama2
+```
+
+A list of available base models:
+<https://github.com/jmorganca/ollama#model-library>
+
+### Build from a bin file
+
+```
+FROM ./ollama-model.bin
+```
+
+## PARAMETER (Optional)
+
+The `PARAMETER` instruction defines a parameter that can be set when the model is run.
+
+```
 PARAMETER <parameter> <parametervalue>
 ```

 ### Valid Parameters and Values

-| Parameter        | Description                                                                                 | Value Type | Value Range |
-| ---------------- | ------------------------------------------------------------------------------------------- | ---------- | ----------- |
-| NumCtx           |                                                                                             | int        |             |
-| NumGPU           |                                                                                             | int        |             |
-| MainGPU          |                                                                                             | int        |             |
-| LowVRAM          |                                                                                             | bool       |             |
-| F16KV            |                                                                                             | bool       |             |
-| LogitsAll        |                                                                                             | bool       |             |
-| VocabOnly        |                                                                                             | bool       |             |
-| UseMMap          |                                                                                             | bool       |             |
-| EmbeddingOnly    |                                                                                             | bool       |             |
-| RepeatLastN      |                                                                                             | int        |             |
-| RepeatPenalty    |                                                                                             | float      |             |
-| FrequencyPenalty |                                                                                             | float      |             |
-| PresencePenalty  |                                                                                             | float      |             |
-| temperature      | The temperature of the model. Higher temperatures result in more creativity in the response | float      | 0 - 1       |
-| TopK             |                                                                                             | int        |             |
-| TopP             |                                                                                             | float      |             |
-| TFSZ             |                                                                                             | float      |             |
-| TypicalP         |                                                                                             | float      |             |
-| Mirostat         |                                                                                             | int        |             |
-| MirostatTau      |                                                                                             | float      |             |
-| MirostatEta      |                                                                                             | float      |             |
-| NumThread        |                                                                                             | int |             |
+| Parameter      | Description                                                                                                                                                                                                                                             | Value Type | Example Usage      |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | ------------------ |
+| num_ctx        | Sets the size of the prompt context size length model. (Default: 2048)                                                                                                                                                                                  | int        | num_ctx 4096       |
+| temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7    |
+| top_k          | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)                                                                        | int        | top_k 40           |
+| top_p          | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)                                                                 | float      | top_p 0.9          |
+| num_gpu        | The number of GPUs to use. On macOS it defaults to 1 to enable metal support, 0 to disable.                                                                                                                                                             | int        | num_gpu 1          |
+| repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = ctx-size)                                                                                                                                          | int        | repeat_last_n 64   |
+| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1 |
+| tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1            |
+| mirostat       | Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)                                                                                                                                         | int        | mirostat 0         |
+| mirostat_tau   | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0)                                                                                                         | float      | mirostat_tau 5.0   |
+| mirostat_eta   | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1)                        | float      | mirostat_eta 0.1   |
+| num_thread     | Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). | int        | num_thread 8       |

+## Prompt

-## PROMPT
+When building on top of the base models supplied by Ollama, it comes with the prompt template predefined. To override the supplied system prompt, simply add `SYSTEM insert system prompt` to change the system prompt.

-Prompt is a multiline instruction that defines the prompt to be used when the model is run. Typically there are 3-4 components to a prompt: System, context, user, and response.
+### Prompt Template

-```modelfile
-PROMPT """
-{{- if not .Context }}
-### System:
-You are a content marketer who needs to come up with a short but succinct tweet. Make sure to include the appropriate hashtags and links. Sometimes when appropriate, describe a meme that can be includes as well. All answers should be in the form of a tweet which has a max size of 280 characters. Every instruction will be the topic to create a tweet about.
-{{- end }}
-### Instruction:
-{{ .Prompt }}
+`TEMPLATE` the full prompt template to be passed into the model. It may include (optionally) a system prompt, user prompt, and assistant prompt. This is used to create a full custom prompt, and syntax may be model specific.

-### Response:
-"""
+## Notes

-```
+- the **modelfile is not case sensitive**. In the examples, we use uppercase for instructions to make it easier to distinguish it from arguments.
+- Instructions can be in any order. In the examples, we start with FROM instruction to keep it easily readable.
--- a/examples/mario
+++ b/examples/mario
@@ -1,7 +0,0 @@
-FROM llama2
-PARAMETER temperature 1
-PROMPT """
-System: You are Mario from super mario bros, acting as an assistant.
-User: {{ .Prompt }}
-Assistant:
-"""
--- a/examples/mario/Modelfile
+++ b/examples/mario/Modelfile
@@ -0,0 +1,5 @@
+FROM llama2
+PARAMETER temperature 1
+SYSTEM """
+You are Mario from super mario bros, acting as an assistant.
+"""
--- a/examples/mario/logo.png
+++ b/examples/mario/logo.png
--- a/examples/mario/readme.md
+++ b/examples/mario/readme.md
@@ -0,0 +1,43 @@
+<img src="logo.png" alt="image of Italian plumber" height="200"/>
+
+# Example character: Mario
+
+This example shows how to create a basic character using Llama2 as the base model.
+
+To run this example:
+
+1. Download the Modelfile
+2. `ollama pull llama2` to get the base model used in the model file.
+3. `ollama create NAME -f ./Modelfile`
+4. `ollama run NAME`
+
+Ask it some questions like "Who are you?" or "Is Peach in trouble again?"
+
+## Editing this file
+
+What the model file looks like:
+
+```
+FROM llama2
+PARAMETER temperature 1
+SYSTEM """
+You are Mario from Super Mario Bros, acting as an assistant.
+"""
+```
+
+What if you want to change its behaviour?
+
+- Try changing the prompt
+- Try changing the parameters [Docs](https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md)
+- Try changing the model (e.g. An uncensored model by `FROM wizard-vicuna` this is the wizard-vicuna uncensored model )
+
+Once the changes are made,
+
+1. `ollama create NAME -f ./Modelfile`
+2. `ollama run NAME`
+3. Iterate until you are happy with the results.
+
+Notes:
+
+- This example is for research purposes only. There is no affiliation with any entity.
+- When using an uncensored model, please be aware that it may generate offensive content.
--- a/examples/midjourney-prompter/Modelfile
+++ b/examples/midjourney-prompter/Modelfile
@@ -1,14 +1,8 @@
 # Modelfile for creating a Midjourney prompts from a topic
-# Run `ollama create mj -f pathtofile` and then `ollama run mj` and enter a topic
+# This prompt was adapted from the original at https://www.greataiprompts.com/guide/midjourney/best-chatgpt-prompt-for-midjourney/
+# Run `ollama create mj -f ./Modelfile` and then `ollama run mj` and enter a topic

-FROM library/nous-hermes:latest
-PROMPT """
-{{- if not .Context }}
-### System:
+FROM nous-hermes
+SYSTEM """
 Embrace your role as an AI-powered creative assistant, employing Midjourney to manifest compelling AI-generated art. I will outline a specific image concept, and in response, you must produce an exhaustive, multifaceted prompt for Midjourney, ensuring every detail of the original concept is represented in your instructions. Midjourney doesn't do well with text, so after the prompt, give me instructions that I can use to create the titles in a image editor.
-{{- end }}
-### Instruction:
-{{ .Prompt }}
-
-### Response:
-"""
+"""
--- a/examples/recipemaker/Modelfile
+++ b/examples/recipemaker/Modelfile
@@ -1,13 +1,6 @@
 # Modelfile for creating a recipe from a list of ingredients
-# Run `ollama create recipemaker -f pathtofile` and then `ollama run recipemaker` and feed it lists of ingredients to create recipes around.
-FROM library/nous-hermes:latest
-PROMPT """
-{{- if not .Context }}
-### System:
+# Run `ollama create recipemaker -f ./Modelfile` and then `ollama run recipemaker` and feed it lists of ingredients to create recipes around.
+FROM nous-hermes
+SYSTEM """
 The instruction will be a list of ingredients. You should generate a recipe that can be made in less than an hour. You can also include ingredients that most people will find in their pantry every day. The recipe should be 4 people and you should include a description of what the meal will taste like
-{{- end }}
-### Instruction:
-{{ .Prompt }}
-
-### Response:
 """
--- a/examples/tweetwriter/Modelfile
+++ b/examples/tweetwriter/Modelfile
@@ -1,14 +1,7 @@
 # Modelfile for creating a tweet from a topic
-# Run `ollama create tweetwriter -f pathtofile` and then `ollama run tweetwriter` and enter a topic 
+# Run `ollama create tweetwriter -f ./Modelfile` and then `ollama run tweetwriter` and enter a topic

-FROM library/nous-hermes:latest
-PROMPT """
-{{- if not .Context }}
-### System:
+FROM nous-hermes
+SYSTEM """
 You are a content marketer who needs to come up with a short but succinct tweet. Make sure to include the appropriate hashtags and links. Sometimes when appropriate, describe a meme that can be includes as well. All answers should be in the form of a tweet which has a max size of 280 characters. Every instruction will be the topic to create a tweet about.
-{{- end }}
-### Instruction:
-{{ .Prompt }}
-
-### Response:
-"""
+"""
--- a/go.mod
+++ b/go.mod
@@ -5,21 +5,21 @@ go 1.20
 require (
 	github.com/dustin/go-humanize v1.0.1
 	github.com/gin-gonic/gin v1.9.1
+	github.com/mattn/go-runewidth v0.0.14
+	github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db
 	github.com/olekukonko/tablewriter v0.0.5
 	github.com/spf13/cobra v1.7.0
 )

-require (
-	github.com/mattn/go-runewidth v0.0.14 // indirect
-	github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db // indirect
-	github.com/rivo/uniseg v0.2.0 // indirect
-)
+require github.com/rivo/uniseg v0.2.0 // indirect

 require (
 	dario.cat/mergo v1.0.0
 	github.com/bytedance/sonic v1.9.1 // indirect
 	github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311 // indirect
+	github.com/chzyer/readline v1.5.1
 	github.com/gabriel-vasile/mimetype v1.4.2 // indirect
+	github.com/gin-contrib/cors v1.4.0
 	github.com/gin-contrib/sse v0.1.0 // indirect
 	github.com/go-playground/locales v0.14.1 // indirect
 	github.com/go-playground/universal-translator v0.18.1 // indirect
@@ -34,7 +34,6 @@ require (
 	github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
 	github.com/modern-go/reflect2 v1.0.2 // indirect
 	github.com/pelletier/go-toml/v2 v2.0.8 // indirect
-	github.com/schollz/progressbar/v3 v3.13.1
 	github.com/spf13/pflag v1.0.5 // indirect
 	github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
 	github.com/ugorji/go/codec v1.2.11 // indirect
--- a/go.sum
+++ b/go.sum
@@ -6,7 +6,14 @@ github.com/bytedance/sonic v1.9.1/go.mod h1:i736AoUSYt75HyZLoJW9ERYxcy6eaN6h4BZX
 github.com/chenzhuoyu/base64x v0.0.0-20211019084208-fb5309c8db06/go.mod h1:DH46F32mSOjUmXrMHnKwZdA8wcEefY7UVqBKYGjpdQY=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311 h1:qSGYFH7+jGhDF8vLC+iwCD4WpbV1EBDSzWkJODFLams=
 github.com/chenzhuoyu/base64x v0.0.0-20221115062448-fe3a3abad311/go.mod h1:b583jCggY9gE99b6G5LEC39OIiVsWj+R97kbl5odCEk=
+github.com/chzyer/logex v1.2.1 h1:XHDu3E6q+gdHgsdTPH6ImJMIp436vR6MPtH8gP05QzM=
+github.com/chzyer/logex v1.2.1/go.mod h1:JLbx6lG2kDbNRFnfkgvh4eRJRPX1QCoOIWomwysCBrQ=
+github.com/chzyer/readline v1.5.1 h1:upd/6fQk4src78LMRzh5vItIt361/o4uq553V8B5sGI=
+github.com/chzyer/readline v1.5.1/go.mod h1:Eh+b79XXUwfKfcPLepksvw2tcLE/Ct21YObkaSkeBlk=
+github.com/chzyer/test v1.0.0 h1:p3BQDXSxOhOG0P9z6/hGnII4LGiEPOYBhs8asl/fC04=
+github.com/chzyer/test v1.0.0/go.mod h1:2JlltgoNkt4TW/z9V/IzDdFaMTM2JPIi26O1pF38GC8=
 github.com/cpuguy83/go-md2man/v2 v2.0.2/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o=
+github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E=
 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
@@ -14,17 +21,25 @@ github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkp
 github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
 github.com/gabriel-vasile/mimetype v1.4.2 h1:w5qFW6JKBz9Y393Y4q372O9A7cUSequkh1Q7OhCmWKU=
 github.com/gabriel-vasile/mimetype v1.4.2/go.mod h1:zApsH/mKG4w07erKIaJPFiX0Tsq9BFQgN3qGY5GnNgA=
+github.com/gin-contrib/cors v1.4.0 h1:oJ6gwtUl3lqV0WEIwM/LxPF1QZ5qe2lGWdY2+bz7y0g=
+github.com/gin-contrib/cors v1.4.0/go.mod h1:bs9pNM0x/UsmHPBWT2xZz9ROh8xYjYkiURUfmBoMlcs=
 github.com/gin-contrib/sse v0.1.0 h1:Y/yl/+YNO8GZSjAhjMsSuLt29uWRFHdHYUb5lYOV9qE=
 github.com/gin-contrib/sse v0.1.0/go.mod h1:RHrZQHXnP2xjPF+u1gW/2HnVO7nvIa9PG3Gm+fLHvGI=
+github.com/gin-gonic/gin v1.8.1/go.mod h1:ji8BvRH1azfM+SYow9zQ6SZMvR8qOMZHmsCuWR9tTTk=
 github.com/gin-gonic/gin v1.9.1 h1:4idEAncQnU5cB7BeOkPtxjfCSye0AAm1R0RVIqJ+Jmg=
 github.com/gin-gonic/gin v1.9.1/go.mod h1:hPrL7YrpYKXt5YId3A/Tnip5kqbEAP+KLuI3SUcPTeU=
+github.com/go-playground/assert/v2 v2.0.1/go.mod h1:VDjEfimB/XKnb+ZQfWdccd7VUvScMdVu0Titje2rxJ4=
 github.com/go-playground/assert/v2 v2.2.0 h1:JvknZsQTYeFEAhQwI4qEt9cyV5ONwRHC+lYKSsYSR8s=
+github.com/go-playground/locales v0.14.0/go.mod h1:sawfccIbzZTqEDETgFXqTho0QybSa7l++s0DH+LDiLs=
 github.com/go-playground/locales v0.14.1 h1:EWaQ/wswjilfKLTECiXz7Rh+3BjFhfDFKv/oXslEjJA=
 github.com/go-playground/locales v0.14.1/go.mod h1:hxrqLVvrK65+Rwrd5Fc6F2O76J/NuW9t0sjnWqG1slY=
+github.com/go-playground/universal-translator v0.18.0/go.mod h1:UvRDBj+xPUEGrFYl+lu/H90nyDXpg0fqeB/AQUGNTVA=
 github.com/go-playground/universal-translator v0.18.1 h1:Bcnm0ZwsGyWbCzImXv+pAJnYK9S473LQFuzCbDbfSFY=
 github.com/go-playground/universal-translator v0.18.1/go.mod h1:xekY+UJKNuX9WP91TpwSH2VMlDf28Uj24BCp08ZFTUY=
+github.com/go-playground/validator/v10 v10.10.0/go.mod h1:74x4gJWsvQexRdW8Pn3dXSGrTK4nAUsbPlLADvpJkos=
 github.com/go-playground/validator/v10 v10.14.0 h1:vgvQWe3XCz3gIeFDm/HnTIbj6UGmg/+t63MyGU2n5js=
 github.com/go-playground/validator/v10 v10.14.0/go.mod h1:9iXMNT7sEkjXb0I+enO7QXmzG6QCsPWY4zveKFVRSyU=
+github.com/goccy/go-json v0.9.7/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
 github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU=
 github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
 github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=
@@ -36,13 +51,21 @@ github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2
 github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=
 github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM=
 github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo=
-github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213/go.mod h1:vNUNkEQ1e29fT/6vq2aBdFsgNPmy8qMdSay1npru+Sw=
 github.com/klauspost/cpuid/v2 v2.0.9/go.mod h1:FInQzS24/EEf25PyTYn52gqo7WaD8xa0213Md/qVLRg=
 github.com/klauspost/cpuid/v2 v2.2.4 h1:acbojRNwl3o09bUq+yDCtZFc1aiwaAAxtcn8YkZXnvk=
 github.com/klauspost/cpuid/v2 v2.2.4/go.mod h1:RVVoqg1df56z8g3pUjL/3lE5UfnlrJX8tyFgg4nqhuY=
+github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
+github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
+github.com/kr/pretty v0.3.0 h1:WgNl7dwNpEZ6jJ9k1snq4pZsg7DOEN8hP9Xw0Tsjwk0=
+github.com/kr/pretty v0.3.0/go.mod h1:640gp4NfQd8pI5XOwp5fnNeVWj67G7CFk/SaSQn7NBk=
+github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
+github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
+github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
+github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
+github.com/leodido/go-urn v1.2.1/go.mod h1:zt4jvISO2HfUBqxjfIshjdMTYS56ZS/qv49ictyFfxY=
 github.com/leodido/go-urn v1.2.4 h1:XlAE/cm/ms7TE/VMVoduSpNBoyc2dOxHs5MZSwAN63Q=
 github.com/leodido/go-urn v1.2.4/go.mod h1:7ZrI8mTSeBSHl/UaRyKQW1qZeMgak41ANeCNaVckg+4=
-github.com/mattn/go-isatty v0.0.17/go.mod h1:kYGgaQfpe5nmfYZH+SKPsOc2e4SrIfOl2e/yFXSvRLM=
+github.com/mattn/go-isatty v0.0.14/go.mod h1:7GGIvUiUoEMVVmxf/4nioHXj79iQHKdU27kJ6hsGG94=
 github.com/mattn/go-isatty v0.0.19 h1:JITubQf0MOLdlGRuRq+jtsDlekdYPia9ZFsB8h/APPA=
 github.com/mattn/go-isatty v0.0.19/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
 github.com/mattn/go-runewidth v0.0.9/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m2gUSrubnMI=
@@ -57,15 +80,18 @@ github.com/modern-go/reflect2 v1.0.2 h1:xBagoLtFs94CBntxluKeaWgTMpvLxC4ur3nMaC9G
 github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
 github.com/olekukonko/tablewriter v0.0.5 h1:P2Ga83D34wi1o9J6Wh1mRuqd4mF/x/lgBS7N7AbDhec=
 github.com/olekukonko/tablewriter v0.0.5/go.mod h1:hPp6KlRPjbx+hW8ykQs1w3UBbZlj6HuIJcUGPhkA7kY=
+github.com/pelletier/go-toml/v2 v2.0.1/go.mod h1:r9LEWfGN8R5k0VXJ+0BkIe7MYkRdwZOjgMj2KwnJFUo=
 github.com/pelletier/go-toml/v2 v2.0.8 h1:0ctb6s9mE31h0/lhu+J6OPmVeDxJn+kYnJc2jZR9tGQ=
 github.com/pelletier/go-toml/v2 v2.0.8/go.mod h1:vuYfssBdrU2XDZ9bYydBu6t+6a6PYNcZljzZR9VXg+4=
+github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e/go.mod h1:pJLUxLENpZxwdsKMEsNbx1VGcRFpLqf3715MtcvvzbA=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
 github.com/rivo/uniseg v0.2.0 h1:S1pD9weZBuJdFmowNwbpi7BJ8TNftyUImj/0WQi72jY=
 github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
+github.com/rogpeppe/go-internal v1.6.1/go.mod h1:xXDCJY+GAPziupqXw64V24skbSoqbTEfhy4qGm1nDQc=
+github.com/rogpeppe/go-internal v1.8.0 h1:FCbCCtXNOY3UtUuHUYaghJg4y7Fd14rXifAYUAtL9R8=
+github.com/rogpeppe/go-internal v1.8.0/go.mod h1:WmiCO8CzOY8rg0OYDC4/i/2WRWAB6poM+XZ2dLUbcbE=
 github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
-github.com/schollz/progressbar/v3 v3.13.1 h1:o8rySDYiQ59Mwzy2FELeHY5ZARXZTVJC7iHD6PEFUiE=
-github.com/schollz/progressbar/v3 v3.13.1/go.mod h1:xvrbki8kfT1fzWzBT/UZd9L6GA+jdL7HAgq2RFnO6fQ=
 github.com/spf13/cobra v1.7.0 h1:hyqWnYt1ZQShIddO5kBpj3vu05/++x6tJ6dg8EC572I=
 github.com/spf13/cobra v1.7.0/go.mod h1:uLxZILRyS/50WlhOIKD7W6V5bgeIt+4sICxh6uRMrb0=
 github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
@@ -74,6 +100,7 @@ github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+
 github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
 github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
 github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
+github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
 github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
 github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
 github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
@@ -83,32 +110,49 @@ github.com/stretchr/testify v1.8.3 h1:RP3t2pwF7cMEbC1dqtB6poj3niw/9gnV4Cjg5oW5gt
 github.com/stretchr/testify v1.8.3/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
 github.com/twitchyliquid64/golang-asm v0.15.1 h1:SU5vSMR7hnwNxj24w34ZyCi/FmDZTkS4MhqMhdFk5YI=
 github.com/twitchyliquid64/golang-asm v0.15.1/go.mod h1:a1lVb/DtPvCB8fslRZhAngC2+aY1QWCk3Cedj/Gdt08=
+github.com/ugorji/go v1.2.7/go.mod h1:nF9osbDWLy6bDVv/Rtoh6QgnvNDpmCalQV5urGCCS6M=
+github.com/ugorji/go/codec v1.2.7/go.mod h1:WGN1fab3R1fzQlVQTkfxVtIBhWDRqOviHU95kRgeqEY=
 github.com/ugorji/go/codec v1.2.11 h1:BMaWp1Bb6fHwEtbplGBGJ498wD+LKlNSl25MjdZY4dU=
 github.com/ugorji/go/codec v1.2.11/go.mod h1:UNopzCgEMSXjBc6AOMqYvWC1ktqTAfzJZUZgYf6w6lg=
 golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
 golang.org/x/arch v0.3.0 h1:02VY4/ZcO/gBOH6PUaoiptASxtXU10jazRCP865E97k=
 golang.org/x/arch v0.3.0/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
+golang.org/x/crypto v0.0.0-20210711020723-a769d52b0f97/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
 golang.org/x/crypto v0.10.0 h1:LKqV2xt9+kDzSTfOhx4FrkEBcMrAgHSYgzywV9zcGmM=
 golang.org/x/crypto v0.10.0/go.mod h1:o4eNf7Ede1fv+hwOwZsTHl9EsPFO6q6ZvYR8vYfY45I=
+golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
 golang.org/x/net v0.10.0 h1:X2//UzNDwYmtCLn7To6G58Wr6f5ahEAQgKNzv9Y951M=
 golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
+golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.0.0-20210806184541-e5e7981a1069/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/sys v0.0.0-20220310020820-b874c991c1a5/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.0.0-20220704084225-05e143d24a9e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
-golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.10.0 h1:SqMFp9UcQJZa+pmYuAKjd9xq1f0j5rLcDIk0mj4qAsA=
 golang.org/x/sys v0.10.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
-golang.org/x/term v0.6.0/go.mod h1:m6U89DPEgQRMq3DNkDClhWw02AUbt2daBVO4cn4Hv9U=
+golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
 golang.org/x/term v0.10.0 h1:3R7pNqamzBraeqj/Tj8qt1aQ2HpmlC+Cx/qL/7hn4/c=
 golang.org/x/term v0.10.0/go.mod h1:lpqdcUyK/oCiQxvxVrppt5ggO2KCZ5QblwqPnfZ6d5o=
+golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
+golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
 golang.org/x/text v0.10.0 h1:UpjohKhiEgNc0CSauXmwYftY1+LlaC75SJwh0SgCX58=
 golang.org/x/text v0.10.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
+golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
+google.golang.org/protobuf v1.28.0/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I=
 google.golang.org/protobuf v1.30.0 h1:kPPoIgf3TsEvrm0PFe15JQ+570QVxYzEvvHqChK+cng=
 google.golang.org/protobuf v1.30.0/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I=
-gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
+gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
+gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
 gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
+gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
 gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
 gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
 rsc.io/pdf v0.1.1/go.mod h1:n8OzWcQ6Sp37PL01nO98y4iUCRdTGarVfzxY20ICaU4=
--- a/library/.gitignore
+++ b/library/.gitignore
@@ -0,0 +1 @@
+models
--- a/library/downloads
+++ b/library/downloads
@@ -0,0 +1,7 @@
+https://huggingface.co/TheBloke/orca_mini_3B-GGML/resolve/main/orca-mini-3b.ggmlv3.q4_0.bin e84705205f71dd55be7b24a778f248f0eda9999a125d313358c087e092d83148
+https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q4_0.bin d1735b93e1dc503f1045ccd6c8bd73277b18ba892befd1dc29e9b9a7822ed998
+https://huggingface.co/TheBloke/vicuna-7B-v1.3-GGML/resolve/main/vicuna-7b-v1.3.ggmlv3.q4_0.bin 23ce5ed290b56a19305178b9ada2c3d96036bd69a6c18304b6158eb6672d6c0f
+https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML/resolve/main/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin 1f08b147a5bce41cfcbb3fd5d51ba765dea1786e15b5655ab69ba3a337a893b7
+https://huggingface.co/TheBloke/Llama-2-7B-GGML/resolve/main/llama-2-7b.ggmlv3.q4_0.bin bfa26d855e44629c4cf919985e90bd7fa03b77eea1676791519e39a4d45fd4d5
+https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin 8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8
+https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin f79142715bc9539a2edbb4b253548db8b34fac22736593eeaa28555874476e30
--- a/library/modelfiles/llama2
+++ b/library/modelfiles/llama2
@@ -0,0 +1,147 @@
+FROM ../models/llama-2-7b-chat.ggmlv3.q4_0.bin
+
+TEMPLATE """
+{{- if .First }}
+<<SYS>>
+{{ .System }}
+<</SYS>>
+{{- end }}
+
+[INST] {{ .Prompt }} [/INST]
+"""
+
+SYSTEM """
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+"""
+
+LICENSE """
+Llama 2 Community License Agreement
+
+Llama 2 Version Release Date: July 18, 2023
+
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.
+
+“Documentation” means the specifications, manuals and documentation accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+
+“Llama 2” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Llama Materials” means, collectively, Meta’s proprietary Llama 2 and Documentation (and any portion thereof) made available under this Agreement.
+
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+
+By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement.
+
+1. License Rights and Redistribution.
+
+a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
+
+b. Redistribution and Use.
+
+i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party.
+
+ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you.
+
+iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”
+
+iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into this Agreement.
+
+v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
+
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
+
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+
+5. Intellectual Property.
+
+a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials.
+
+b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
+
+c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
+
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.
+
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+
+"""
+
+LICENSE """
+Llama 2 Acceptable Use Policy
+
+Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at ai.meta.com/llama/use-policy.
+
+Prohibited Uses
+
+We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to:
+
+1. Violate the law or others’ rights, including to:
+
+a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
+
+i. Violence or terrorism
+
+ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
+
+b. Human trafficking, exploitation, and sexual violence
+
+iii. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
+
+iv. Sexual solicitation
+
+vi. Any other criminal activity
+
+c. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
+
+d. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
+
+e. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
+
+f. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
+
+g. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials
+
+h. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
+
+2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following:
+
+a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
+
+b. Guns and illegal weapons (including weapon development)
+
+c. Illegal drugs and regulated/controlled substances
+
+d. Operation of critical infrastructure, transportation technologies, or heavy machinery
+
+e. Self-harm or harm to others, including suicide, cutting, and eating disorders
+
+f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
+
+3. Intentionally deceive or mislead others, including use of Llama 2 related to the following:
+
+a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
+
+b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
+
+c. Generating, promoting, or further distributing spam
+
+d. Impersonating another individual without consent, authorization, or legal right
+
+e. Representing that the use of Llama 2 or outputs are human-generated
+
+f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
+
+4. Fail to appropriately disclose to end users any known dangers of your AI system
+
+Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
+
+Reporting issues with the model: github.com/facebookresearch/llama
+Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
+Reporting bugs and security concerns: facebook.com/whitehat/info
+Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: LlamaUseReport@meta.com
+"""
--- a/library/modelfiles/llama2_13b
+++ b/library/modelfiles/llama2_13b
@@ -0,0 +1,147 @@
+FROM ../models/llama-2-13b-chat.ggmlv3.q4_0.bin
+
+TEMPLATE """
+{{- if .First }}
+<<SYS>>
+{{ .System }}
+<</SYS>>
+{{- end }}
+
+[INST] {{ .Prompt }} [/INST]
+"""
+
+SYSTEM """
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+"""
+
+LICENSE """
+Llama 2 Community License Agreement
+
+Llama 2 Version Release Date: July 18, 2023
+
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.
+
+“Documentation” means the specifications, manuals and documentation accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+
+“Llama 2” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Llama Materials” means, collectively, Meta’s proprietary Llama 2 and Documentation (and any portion thereof) made available under this Agreement.
+
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+
+By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement.
+
+1. License Rights and Redistribution.
+
+a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
+
+b. Redistribution and Use.
+
+i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party.
+
+ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you.
+
+iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”
+
+iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into this Agreement.
+
+v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
+
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
+
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+
+5. Intellectual Property.
+
+a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials.
+
+b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
+
+c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
+
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.
+
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+
+"""
+
+LICENSE """
+Llama 2 Acceptable Use Policy
+
+Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at ai.meta.com/llama/use-policy.
+
+Prohibited Uses
+
+We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to:
+
+1. Violate the law or others’ rights, including to:
+
+a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
+
+i. Violence or terrorism
+
+ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
+
+b. Human trafficking, exploitation, and sexual violence
+
+iii. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
+
+iv. Sexual solicitation
+
+vi. Any other criminal activity
+
+c. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
+
+d. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
+
+e. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
+
+f. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
+
+g. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials
+
+h. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
+
+2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following:
+
+a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
+
+b. Guns and illegal weapons (including weapon development)
+
+c. Illegal drugs and regulated/controlled substances
+
+d. Operation of critical infrastructure, transportation technologies, or heavy machinery
+
+e. Self-harm or harm to others, including suicide, cutting, and eating disorders
+
+f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
+
+3. Intentionally deceive or mislead others, including use of Llama 2 related to the following:
+
+a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
+
+b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
+
+c. Generating, promoting, or further distributing spam
+
+d. Impersonating another individual without consent, authorization, or legal right
+
+e. Representing that the use of Llama 2 or outputs are human-generated
+
+f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
+
+4. Fail to appropriately disclose to end users any known dangers of your AI system
+
+Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
+
+Reporting issues with the model: github.com/facebookresearch/llama
+Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
+Reporting bugs and security concerns: facebook.com/whitehat/info
+Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: LlamaUseReport@meta.com
+"""
--- a/library/modelfiles/llama2_7b
+++ b/library/modelfiles/llama2_7b
@@ -0,0 +1,147 @@
+FROM ../models/llama-2-7b-chat.ggmlv3.q4_0.bin
+
+TEMPLATE """
+{{- if .First }}
+<<SYS>>
+{{ .System }}
+<</SYS>>
+{{- end }}
+
+[INST] {{ .Prompt }} [/INST]
+"""
+
+SYSTEM """
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+"""
+
+LICENSE """
+Llama 2 Community License Agreement
+
+Llama 2 Version Release Date: July 18, 2023
+
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.
+
+“Documentation” means the specifications, manuals and documentation accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+
+“Llama 2” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-libraries/llama-downloads/.
+
+“Llama Materials” means, collectively, Meta’s proprietary Llama 2 and Documentation (and any portion thereof) made available under this Agreement.
+
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+
+By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement.
+
+1. License Rights and Redistribution.
+
+a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
+
+b. Redistribution and Use.
+
+i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party.
+
+ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you.
+
+iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”
+
+iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into this Agreement.
+
+v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
+
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
+
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+
+5. Intellectual Property.
+
+a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials.
+
+b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
+
+c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
+
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.
+
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+
+"""
+
+LICENSE """
+Llama 2 Acceptable Use Policy
+
+Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at ai.meta.com/llama/use-policy.
+
+Prohibited Uses
+
+We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to:
+
+1. Violate the law or others’ rights, including to:
+
+a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
+
+i. Violence or terrorism
+
+ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
+
+b. Human trafficking, exploitation, and sexual violence
+
+iii. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
+
+iv. Sexual solicitation
+
+vi. Any other criminal activity
+
+c. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
+
+d. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
+
+e. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
+
+f. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
+
+g. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials
+
+h. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
+
+2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following:
+
+a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
+
+b. Guns and illegal weapons (including weapon development)
+
+c. Illegal drugs and regulated/controlled substances
+
+d. Operation of critical infrastructure, transportation technologies, or heavy machinery
+
+e. Self-harm or harm to others, including suicide, cutting, and eating disorders
+
+f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
+
+3. Intentionally deceive or mislead others, including use of Llama 2 related to the following:
+
+a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
+
+b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
+
+c. Generating, promoting, or further distributing spam
+
+d. Impersonating another individual without consent, authorization, or legal right
+
+e. Representing that the use of Llama 2 or outputs are human-generated
+
+f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
+
+4. Fail to appropriately disclose to end users any known dangers of your AI system
+
+Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
+
+Reporting issues with the model: github.com/facebookresearch/llama
+Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
+Reporting bugs and security concerns: facebook.com/whitehat/info
+Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: LlamaUseReport@meta.com
+"""
--- a/library/modelfiles/nous-hermes
+++ b/library/modelfiles/nous-hermes
@@ -0,0 +1,7 @@
+FROM ../models/nous-hermes-13b.ggmlv3.q4_0.bin
+TEMPLATE """
+### Instruction:
+{{ .Prompt }}
+
+### Response:
+"""
--- a/library/modelfiles/orca
+++ b/library/modelfiles/orca
@@ -0,0 +1,14 @@
+FROM ../models/orca-mini-3b.ggmlv3.q4_0.bin
+TEMPLATE """
+{{- if .First }}
+### System:
+{{ .System }}
+{{- end }}
+
+### User:
+{{ .Prompt }}
+
+### Response:
+"""
+
+SYSTEM """You are an AI assistant that follows instruction extremely well. Help as much as you can."""
--- a/library/modelfiles/vicuna
+++ b/library/modelfiles/vicuna
@@ -0,0 +1,11 @@
+FROM ../models/vicuna-7b-v1.3.ggmlv3.q4_0.bin
+TEMPLATE """
+{{ if .First }}
+{{ .System }}
+{{- end }}
+
+USER: {{ .Prompt }}
+ASSISTANT:
+"""
+
+SYSTEM """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."""
--- a/library/modelfiles/wizard-vicuna
+++ b/library/modelfiles/wizard-vicuna
@@ -0,0 +1,5 @@
+FROM ../models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
+TEMPLATE """
+USER: {{ .Prompt }}
+ASSISTANT:
+"""
--- a/library/publish.sh
+++ b/library/publish.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+
+mkdir -p models
+
+# download binaries
+function process_line {
+    local url=$1
+    local checksum=$2
+
+    # Get the filename from the URL
+    local filename=models/$(basename $url)
+
+    echo "verifying $filename..."
+
+    # If the file exists, compute its checksum
+    if [ -f $filename ]; then
+        local existing_checksum=$(shasum -a 256 $filename | cut -d ' ' -f1)
+    fi
+
+    # If the file does not exist, or its checksum does not match, download it
+    if [ ! -f $filename ] || [ $existing_checksum != $checksum ]; then
+        echo "downloading $filename..."
+        
+        # Download the file
+        curl -L $url -o $filename
+
+        # Compute the SHA256 hash of the downloaded file
+        local computed_checksum=$(shasum -a 256 $filename | cut -d ' ' -f1)
+
+        # Verify the checksum
+        if [ $computed_checksum != $checksum ]; then
+            echo "Checksum verification failed for $filename"
+            exit 1
+        fi
+    fi
+}
+
+while IFS=' ' read -r url checksum
+do
+    process_line $url $checksum
+done < "downloads"
+
+# create and publish the models
+for file in modelfiles/*; do
+  if [ -f "$file" ]; then
+    filename=$(basename "$file")
+    echo $filename
+    ollama create "library/${filename}" -f "$file"
+    ollama push "${filename}"
+  fi
+done
+
--- a/llama/ggml-cuda.cu
+++ b/llama/ggml-cuda.cu
--- a/llama/ggml-cuda.h
+++ b/llama/ggml-cuda.h
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
--- a/llama/ggml-metal.h
+++ b/llama/ggml-metal.h
@@ -1,5 +1,7 @@
+//go:build darwin
+
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
--- a/llama/ggml-metal.m
+++ b/llama/ggml-metal.m
@@ -1,7 +1,7 @@
-// +build darwin
+//go:build darwin

 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -722,8 +722,8 @@ void ggml_metal_graph_compute(
                                            GGML_ASSERT(ne02 == 1);
                                            GGML_ASSERT(ne12 == 1);

-                                            nth0 = 4;
-                                            nth1 = 16;
+                                            nth0 = 2;
+                                            nth1 = 32;
                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_K_f32];
                                        } break;
                                    case GGML_TYPE_Q5_K:
@@ -731,8 +731,8 @@ void ggml_metal_graph_compute(
                                            GGML_ASSERT(ne02 == 1);
                                            GGML_ASSERT(ne12 == 1);

-                                            nth0 = 4;
-                                            nth1 = 16;
+                                            nth0 = 2;
+                                            nth1 = 32;
                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q5_K_f32];
                                        } break;
                                    case GGML_TYPE_Q6_K:
@@ -740,8 +740,8 @@ void ggml_metal_graph_compute(
                                            GGML_ASSERT(ne02 == 1);
                                            GGML_ASSERT(ne12 == 1);

-                                            nth0 = 4;
-                                            nth1 = 16;
+                                            nth0 = 2;
+                                            nth1 = 32;
                                            [encoder setComputePipelineState:ctx->pipeline_mul_mat_q6_K_f32];
                                        } break;
                                    default:
@@ -767,15 +767,18 @@ void ggml_metal_graph_compute(
                                [encoder setBytes:&ne0  length:sizeof(ne0)  atIndex:13];
                                [encoder setBytes:&ne1  length:sizeof(ne1)  atIndex:14];

-                                if (src0t == GGML_TYPE_Q4_0 || src0t == GGML_TYPE_Q4_1) {
-                                    [encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
-                                    [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
+                                if (src0t == GGML_TYPE_Q4_0 || src0t == GGML_TYPE_Q4_1 ||
+                                    src0t == GGML_TYPE_Q4_K) {
+                                    [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 7) / 8, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
+                                }
+                                else if (src0t == GGML_TYPE_Q5_K) {
+                                    [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 3) / 4, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
+                                }
+                                else if (src0t == GGML_TYPE_Q6_K) {
+                                    [encoder dispatchThreadgroups:MTLSizeMake((ne01+1)/2, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                }
                                else if (src0t == GGML_TYPE_Q2_K ||
-                                         src0t == GGML_TYPE_Q3_K ||
-                                         src0t == GGML_TYPE_Q4_K ||
-                                         src0t == GGML_TYPE_Q5_K ||
-                                         src0t == GGML_TYPE_Q6_K) {
+                                         src0t == GGML_TYPE_Q3_K) {
                                    [encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
                                    [encoder dispatchThreadgroups:MTLSizeMake(ne01, 1, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                } else {
@@ -821,7 +824,7 @@ void ggml_metal_graph_compute(

                            const float eps = 1e-6f;

-                            const int nth = 256;
+                            const int nth = 512;

                            [encoder setComputePipelineState:ctx->pipeline_rms_norm];
                            [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
@@ -829,7 +832,7 @@ void ggml_metal_graph_compute(
                            [encoder setBytes:&ne00 length:sizeof( int64_t) atIndex:2];
                            [encoder setBytes:&nb01 length:sizeof(uint64_t) atIndex:3];
                            [encoder setBytes:&eps  length:sizeof(   float) atIndex:4];
-                            [encoder setThreadgroupMemoryLength:nth*sizeof(float) atIndex:0];
+                            [encoder setThreadgroupMemoryLength:nth/32*sizeof(float) atIndex:0];

                            const int64_t nrows = ggml_nrows(src0);

@@ -910,28 +913,35 @@ void ggml_metal_graph_compute(

                            const int n_past = ((int32_t *)(src1->data))[0];

+                            float freq_base;
+                            float freq_scale;
+                            memcpy(&freq_base,  (int32_t *) src1->data + 4, sizeof(float));
+                            memcpy(&freq_scale, (int32_t *) src1->data + 5, sizeof(float));
+
                            [encoder setComputePipelineState:ctx->pipeline_rope];
                            [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
                            [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];
-                            [encoder setBytes:&ne00   length:sizeof( int64_t) atIndex:2];
-                            [encoder setBytes:&ne01   length:sizeof( int64_t) atIndex:3];
-                            [encoder setBytes:&ne02   length:sizeof( int64_t) atIndex:4];
-                            [encoder setBytes:&ne03   length:sizeof( int64_t) atIndex:5];
-                            [encoder setBytes:&nb00   length:sizeof(uint64_t) atIndex:6];
-                            [encoder setBytes:&nb01   length:sizeof(uint64_t) atIndex:7];
-                            [encoder setBytes:&nb02   length:sizeof(uint64_t) atIndex:8];
-                            [encoder setBytes:&nb03   length:sizeof(uint64_t) atIndex:9];
-                            [encoder setBytes:&ne0    length:sizeof( int64_t) atIndex:10];
-                            [encoder setBytes:&ne1    length:sizeof( int64_t) atIndex:11];
-                            [encoder setBytes:&ne2    length:sizeof( int64_t) atIndex:12];
-                            [encoder setBytes:&ne3    length:sizeof( int64_t) atIndex:13];
-                            [encoder setBytes:&nb0    length:sizeof(uint64_t) atIndex:14];
-                            [encoder setBytes:&nb1    length:sizeof(uint64_t) atIndex:15];
-                            [encoder setBytes:&nb2    length:sizeof(uint64_t) atIndex:16];
-                            [encoder setBytes:&nb3    length:sizeof(uint64_t) atIndex:17];
-                            [encoder setBytes:&n_past length:sizeof(     int) atIndex:18];
-                            [encoder setBytes:&n_dims length:sizeof(     int) atIndex:19];
-                            [encoder setBytes:&mode   length:sizeof(     int) atIndex:20];
+                            [encoder setBytes:&ne00    length:sizeof( int64_t) atIndex:2];
+                            [encoder setBytes:&ne01    length:sizeof( int64_t) atIndex:3];
+                            [encoder setBytes:&ne02    length:sizeof( int64_t) atIndex:4];
+                            [encoder setBytes:&ne03    length:sizeof( int64_t) atIndex:5];
+                            [encoder setBytes:&nb00    length:sizeof(uint64_t) atIndex:6];
+                            [encoder setBytes:&nb01    length:sizeof(uint64_t) atIndex:7];
+                            [encoder setBytes:&nb02    length:sizeof(uint64_t) atIndex:8];
+                            [encoder setBytes:&nb03    length:sizeof(uint64_t) atIndex:9];
+                            [encoder setBytes:&ne0     length:sizeof( int64_t) atIndex:10];
+                            [encoder setBytes:&ne1     length:sizeof( int64_t) atIndex:11];
+                            [encoder setBytes:&ne2     length:sizeof( int64_t) atIndex:12];
+                            [encoder setBytes:&ne3     length:sizeof( int64_t) atIndex:13];
+                            [encoder setBytes:&nb0     length:sizeof(uint64_t) atIndex:14];
+                            [encoder setBytes:&nb1     length:sizeof(uint64_t) atIndex:15];
+                            [encoder setBytes:&nb2     length:sizeof(uint64_t) atIndex:16];
+                            [encoder setBytes:&nb3     length:sizeof(uint64_t) atIndex:17];
+                            [encoder setBytes:&n_past  length:sizeof(     int) atIndex:18];
+                            [encoder setBytes:&n_dims  length:sizeof(     int) atIndex:19];
+                            [encoder setBytes:&mode    length:sizeof(     int) atIndex:20];
+                            [encoder setBytes:&freq_base  length:sizeof(float) atIndex:21];
+                            [encoder setBytes:&freq_scale length:sizeof(float) atIndex:22];

                            [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne02, ne03) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
                        } break;
--- a/llama/ggml-metal.metal
+++ b/llama/ggml-metal.metal
@@ -1,5 +1,7 @@
+//go:build darwin
+
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -357,26 +359,33 @@ kernel void kernel_rms_norm(
        threadgroup float  * sum [[threadgroup(0)]],
        uint tgpig[[threadgroup_position_in_grid]],
        uint tpitg[[thread_position_in_threadgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]],
+        uint tiisg[[thread_index_in_simdgroup]],
        uint   ntg[[threads_per_threadgroup]]) {
-    device const float * x = (device const float *) ((device const char *) src0 + tgpig*nb01);
+    device const float4 * x = (device const float4 *) ((device const char *) src0 + tgpig*nb01);
+    device const float * x_scalar = (device const float *) x;
+    float4 sumf=0;
+    float all_sum=0;

    // parallel sum
-    sum[tpitg] = 0.0f;
-    for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
-        sum[tpitg] += x[i00] * x[i00];
+    for (int i00 = tpitg; i00 < ne00/4; i00 += ntg) {
+        sumf += x[i00] * x[i00];
+    }
+    all_sum = sumf[0] + sumf[1] + sumf[2] + sumf[3];
+    all_sum = simd_sum(all_sum);
+    if (tiisg == 0) {
+        sum[sgitg] = all_sum;
    }

-    // reduce
    threadgroup_barrier(mem_flags::mem_threadgroup);
-    for (uint i = ntg/2; i > 0; i /= 2) {
-        if (tpitg < i) {
-            sum[tpitg] += sum[tpitg + i];
-        }
-        threadgroup_barrier(mem_flags::mem_threadgroup);
+    // broadcast, simd group number is ntg / 32
+    for (int i = ntg / 32 / 2; i > 0; i /= 2) {
+       if (tpitg < i) {
+           sum[tpitg] += sum[tpitg + i];
+       }
    }
-
-    // broadcast
    if (tpitg == 0) {
+        for (int i = 4 * (ne00 / 4); i < ne00; i++) {sum[0] += x_scalar[i];}
        sum[0] /= ne00;
    }

@@ -385,10 +394,99 @@ kernel void kernel_rms_norm(
    const float mean  = sum[0];
    const float scale = 1.0f/sqrt(mean + eps);

-    device float * y = dst + tgpig*ne00;
-    for (int i00 = tpitg; i00 < ne00; i00 += ntg) {
+    device float4 * y = (device float4 *) (dst + tgpig*ne00);
+    device float * y_scalar = (device float *) y;
+    for (int i00 = tpitg; i00 < ne00/4; i00 += ntg) {
        y[i00] = x[i00] * scale;
    }
+    if (tpitg == 0) {
+        for (int i00 = 4 * (ne00 / 4); i00 < ne00; i00++) {y_scalar[i00] = x_scalar[i00] * scale;}
+    }
+}
+
+// function for calculate inner product between a q4_0 block and 32 floats (yl), sumy is SUM(yl[i])
+float block_q_n_dot_y(device const block_q4_0 * qb_curr, float sumy, thread float * yl) {
+    float d = qb_curr->d;
+    float4 acc = 0.f;
+    device uint16_t * qs = ((device uint16_t *)qb_curr + 1);
+    for (int i = 0; i < 16; i+=2) {
+        acc[0] += yl[i]      * (qs[i / 2] & 0x000F);
+        acc[1] += yl[i + 16] * (qs[i / 2] & 0x00F0);
+        acc[2] += yl[i +  1] * (qs[i / 2] & 0x0F00);
+        acc[3] += yl[i + 17] * (qs[i / 2] & 0xF000);
+    }
+    return d * (sumy * -8.f + acc[0] + acc[1]/16.f + acc[2]/256.f + acc[3]/4096.f);
+}
+
+// function for calculate inner product between a q4_1 block and 32 floats (yl), sumy is SUM(yl[i])
+float block_q_n_dot_y(device const block_q4_1 * qb_curr, float sumy, thread float * yl) {
+    float d = qb_curr->d;
+    float m = qb_curr->m;
+    float4 acc = 0.f;
+    device uint16_t * qs = ((device uint16_t *)qb_curr + 2);
+    for (int i = 0; i < 16; i+=2) {
+        acc[0] += yl[i]      * (qs[i / 2] & 0x000F);
+        acc[1] += yl[i + 16] * (qs[i / 2] & 0x00F0);
+        acc[2] += yl[i +  1] * (qs[i / 2] & 0x0F00);
+        acc[3] += yl[i + 17] * (qs[i / 2] & 0xF000);
+    }
+    return d * (acc[0] + acc[1]/16.f + acc[2]/256.f + acc[3]/4096.f) + sumy * m;
+}
+
+// putting them in the kernel cause a significant performance penalty
+#define N_DST 4 // each SIMD group works on 4 rows
+#define N_SIMDGROUP 2 // number of SIMD groups in a thread group
+#define N_SIMDWIDTH 32 // assuming SIMD group size is 32
+template<typename block_q_type>
+void mul_vec_q_n_f32(device const void * src0, device const float * src1, device float * dst,
+                    int64_t ne00, int64_t ne10, int64_t ne0, int64_t ne01,
+                    uint2 tgpig, uint tiisg, uint sgitg) {
+    const int nb = ne00/QK4_0;
+    const int r0 = tgpig.x;
+    const int r1 = tgpig.y;
+    device const block_q_type * x = (device const block_q_type *) src0 + (r0 * N_SIMDGROUP + sgitg) * N_DST * nb;
+    device const float      * y = (device const float      *) src1 + r1*ne10;
+    float4 y_curr[8];       // src1 vector cache
+    float sumf[N_DST]={0.f}, all_sum;
+    thread float * yl=(thread float *)y_curr;
+
+    // each thread in a SIMD group deals with 1 block.
+    for (int column = 0; column < nb / N_SIMDWIDTH; column++) {
+        float sumy = 0;
+        for (int i = 0; i < QK4_0 / 4; i++) {
+            y_curr[i] = *((device float4  *)(y + N_SIMDWIDTH * (tiisg + column * QK4_0)) + i);
+            sumy += y_curr[i][0] + y_curr[i][1] + y_curr[i][2] + y_curr[i][3];
+        }
+
+        for (int row = 0; row < N_DST; row++) {
+            sumf[row] += block_q_n_dot_y(x+(tiisg + row * nb + column * N_SIMDWIDTH), sumy, yl);
+        }
+    }
+
+    // from now loads two rows every time and 16 blocks per row
+    int ir = tiisg / (N_SIMDWIDTH / 2);
+    int ib = tiisg % (N_SIMDWIDTH / 2);
+    for (int ind = 0; ind < (nb % N_SIMDWIDTH + N_SIMDWIDTH / 2 - 1)/(N_SIMDWIDTH / 2); ind++) {
+        int nb_start = (nb / N_SIMDWIDTH) * N_SIMDWIDTH + ind * (N_SIMDWIDTH / 2); //where the left blocks start
+        float sumy = 0;
+        for (int i = 0; i < QK4_0 / 4; i++) {
+            y_curr[i] = *((device float4 *)(y + (nb_start + ib) * QK4_0) + i);
+            sumy += y_curr[i][0] + y_curr[i][1] + y_curr[i][2] + y_curr[i][3];
+        }
+
+        for (int row = 0; row < N_DST; row+=2) {
+            if (nb_start + ib < nb) {
+                sumf[row + ir] += block_q_n_dot_y(x + (nb_start + ib + (row + ir) * nb), sumy, yl);
+            }
+        }
+    }
+
+    for (int row = 0; row < N_DST; ++row) {
+        all_sum = simd_sum(sumf[row]);
+        if (tiisg == 0 && ((r0 * N_SIMDGROUP + sgitg) * N_DST + row) < ne01) {
+            dst[r1*ne0 + (r0 * N_SIMDGROUP + sgitg) * N_DST + row] = all_sum;
+        }
+    }
 }

 kernel void kernel_mul_mat_q4_0_f32(
@@ -398,65 +496,11 @@ kernel void kernel_mul_mat_q4_0_f32(
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
+        constant   int64_t & ne01[[buffer(4)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
-        uint2 tpitg[[thread_position_in_threadgroup]],
-        uint2  tptg[[threads_per_threadgroup]]) {
-    const int nb = ne00/QK4_0;
-
-    const int64_t r0 = tgpig.x;
-    const int64_t r1 = tgpig.y;
-
-    device const block_q4_0 * x = (device const block_q4_0 *) src0 + r0*nb;
-    device const float      * y = (device const float      *) src1 + r1*ne10;
-
-    const int nth = tptg.x*tptg.y;
-    const int ith = tptg.y*tpitg.x + tpitg.y;
-
-    const int ix = tpitg.y/4;           // 0 or 1
-    const int iy = tpitg.y - 4*ix;      // 0...3
-
-    const int first = 4 * iy;
-
-    float sumf = 0;
-
-    for (int i = 2*tpitg.x + ix; i < nb; i += 2*tptg.x) {
-
-        const float d = (float)x[i].d;
-
-        device const uint8_t * xl = x[i].qs + first;
-        device const float   * yl = y + i * QK4_0 + first;
-
-        float2 acc = {0.0f, 0.0f};
-
-        for (int j = 0; j < 4; ++j) {
-
-            acc[0] += yl[j] * (xl[j] & 0xF) + yl[j+16] * (xl[j] >> 4);
-            acc[1] += yl[j] + yl[j+16];
-
-        }
-
-        sumf += d * (acc[0] - 8.f*acc[1]);
-    }
-
-    sum[ith] = sumf;
-
-    //
-    // Accumulate the sum from all threads in the threadgroup
-    //
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%4 == 0) {
-        sum[ith] += sum[ith+1] + sum[ith+2] + sum[ith+3];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%16 == 0) {
-        sum[ith] += sum[ith+4] + sum[ith+8] + sum[ith+12];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith == 0) {
-        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
-        dst[r1*ne0 + r0] = sum[0];
-    }
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {
+    mul_vec_q_n_f32<block_q4_0>(src0,src1,dst,ne00,ne10,ne0,ne01,tgpig,tiisg,sgitg);
 }

 kernel void kernel_mul_mat_q4_1_f32(
@@ -466,66 +510,11 @@ kernel void kernel_mul_mat_q4_1_f32(
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
+        constant   int64_t & ne01[[buffer(4)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
-        uint2 tpitg[[thread_position_in_threadgroup]],
-        uint2  tptg[[threads_per_threadgroup]]) {
-    const int nb = ne00/QK4_1;
-
-    const int64_t r0 = tgpig.x;
-    const int64_t r1 = tgpig.y;
-
-    device const block_q4_1 * x = (device const block_q4_1 *) src0 + r0*nb;
-    device const float      * y = (device const float      *) src1 + r1*ne10;
-
-    const uint nth = tptg.x*tptg.y;
-    const uint ith = tptg.y*tpitg.x + tpitg.y;
-
-    const int ix = tpitg.y/4;           // 0 or 1
-    const int iy = tpitg.y - 4*ix;      // 0...3
-
-    const int first = 4 * iy;
-
-    float sumf = 0;
-
-    for (int i = 2*tpitg.x + ix; i < nb; i += 2*tptg.x) {
-
-        const float d = (float)x[i].d;
-        const float m = (float)x[i].m;
-
-        device const uint8_t * xl = x[i].qs + first;
-        device const float   * yl = y + i * QK4_1 + first;
-
-        float2 acc = {0.0f, 0.0f};
-
-        for (int j = 0; j < 4; ++j) {
-
-            acc[0] += yl[j+ 0] * (d * (xl[j] & 0xF) + m);
-            acc[1] += yl[j+16] * (d * (xl[j] >>  4) + m);
-
-        }
-
-        sumf += acc[0] + acc[1];
-    }
-
-    sum[ith] = sumf;
-
-    //
-    // Accumulate the sum from all threads in the threadgroup
-    //
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%4 == 0) {
-        sum[ith] += sum[ith+1] + sum[ith+2] + sum[ith+3];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%16 == 0) {
-        sum[ith] += sum[ith+4] + sum[ith+8] + sum[ith+12];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith == 0) {
-        for (uint i = 16; i < nth; i += 16) sum[0] += sum[i];
-        dst[r1*ne0 + r0] = sum[0];
-    }
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {
+     mul_vec_q_n_f32<block_q4_1>(src0,src1,dst,ne00,ne10,ne0,ne01,tgpig,tiisg,sgitg);
 }

 kernel void kernel_mul_mat_f16_f32(
@@ -641,17 +630,19 @@ kernel void kernel_rope(
        constant       int & n_past,
        constant       int & n_dims,
        constant       int & mode,
+        constant     float & freq_base,
+        constant     float & freq_scale,
        uint3 tpig[[thread_position_in_grid]]) {
    const int64_t i3 = tpig[2];
    const int64_t i2 = tpig[1];
    const int64_t i1 = tpig[0];

    const bool is_neox = mode & 2;
-    const float theta_scale = pow(10000.0, -2.0f/n_dims);
+    const float theta_scale = pow(freq_base, -2.0f/n_dims);

    const int64_t p = ((mode & 1) == 0 ? n_past + i2 : i2);

-    float theta = (float)p;
+    float theta = freq_scale * (float)p;

    if (!is_neox) {
        for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
@@ -1489,6 +1480,7 @@ kernel void kernel_mul_mat_q3_K_f32(

 }

+#if QK_K == 256
 kernel void kernel_mul_mat_q4_K_f32(
        device const  void * src0,
        device const float * src1,
@@ -1496,131 +1488,180 @@ kernel void kernel_mul_mat_q4_K_f32(
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
+        constant   int64_t & ne01[[buffer(4)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
-        uint2 tpitg[[thread_position_in_threadgroup]],
-        uint2  tptg[[threads_per_threadgroup]]) {
-
-    const int nb = ne00/QK_K;
-
-    const int64_t r0 = tgpig.x;
-    const int64_t r1 = tgpig.y;
-
-    const int nth = tptg.x*tptg.y;
-    const int ith = tptg.y*tpitg.x + tpitg.y;
-
-    device const block_q4_K * x = (device const block_q4_K *) src0 + r0*nb;
-    device const float     * yy = (device const float      *) src1 + r1*ne10;
-
-    float sumf = 0;
-
-#if QK_K == 256
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {

    const uint16_t kmask1 = 0x3f3f;
    const uint16_t kmask2 = 0x0f0f;
    const uint16_t kmask3 = 0xc0c0;

-    const int tid = tpitg.y;   // 0...16
-    const int il  = tid/4;     // 0...3
-    const int ir  = tid - 4*il;// 0...3
-    const int n   = 4;
+    const int ix = tiisg/8;  // 0...3
+    const int it = tiisg%8;  // 0...7
+    const int im = it/4;     // 0 or 1
+    const int ir = it%4;     // 0...3

-    const int im = il/2;  // 0 or 1. 0 computes 0,32 + 128,160, 1 computes 64,96 + 192,224
-    const int in = il%2;
+    const int nb = ne00/QK_K;
+    const int r0 = tgpig.x;
+    const int r1 = tgpig.y;
+    const int first_row = (r0 * N_SIMDGROUP + sgitg) * N_DST;
+    const int ib_row = first_row * nb;
+    device const block_q4_K * x = (device const block_q4_K *) src0 + ib_row;
+    device const float      * y = (device const float      *) src1 + r1*ne10;
+    float yl[16];
+    float yh[16];
+    float sumf[N_DST]={0.f}, all_sum;

-    const int l0 = n*(2*ir + in);
-    const int q_offset = 32*im + l0;
-    const int y_offset = 64*im + l0;
+    const int step = sizeof(block_q4_K) * nb / 2;

-    uchar2 sc1, sc2, sc3, sc4;
+    device const float * y4 = y + ix * QK_K + 64 * im + 8 * ir;

-    for (int i = tpitg.x; i < nb; i += tptg.x) {
+    uint16_t sc16[4];
+    thread const uint8_t * sc8 = (thread const uint8_t *)sc16;

-        device const uint8_t * q1 = (x + i)->qs + q_offset;
-        device const uint8_t * q2 = q1 + 64;
-        device const float   * y1 = yy + i*QK_K + y_offset;
-        device const float   * y2 = y1 + 128;
-
-        const float dall = (float)((x + i)->d);
-        const float dmin = (float)((x + i)->dmin);
-
-        device const uint16_t * a = (device const uint16_t *)(x + i)->scales;
-        sc1 = as_type<uchar2>((uint16_t)(a[im+0] & kmask1));
-        sc2 = as_type<uchar2>((uint16_t)(a[im+2] & kmask1));
-        sc3 = as_type<uchar2>((uint16_t)(((a[im+4] >> 0) & kmask2) | ((a[im+0] & kmask3) >> 2)));
-        sc4 = as_type<uchar2>((uint16_t)(((a[im+4] >> 4) & kmask2) | ((a[im+2] & kmask3) >> 2)));
-
-        float4 s = {0.f, 0.f, 0.f, 0.f};
-        float smin = 0;
-        for (int l = 0; l < n; ++l) {
-
-            s[0] += y1[l] * (q1[l] & 0xF); s[1] += y1[l+32] * (q1[l] >> 4);
-            s[2] += y2[l] * (q2[l] & 0xF); s[3] += y2[l+32] * (q2[l] >> 4);
-            smin += y1[l] * sc2[0] + y1[l+32] * sc2[1] + y2[l] * sc4[0] + y2[l+32] * sc4[1];
+    for (int ib = ix; ib < nb; ib += 4) {

+        float4 sumy = {0.f, 0.f, 0.f, 0.f};
+        for (int i = 0; i < 8; ++i) {
+            yl[i+0] = y4[i+  0]; sumy[0] += yl[i+0];
+            yl[i+8] = y4[i+ 32]; sumy[1] += yl[i+8];
+            yh[i+0] = y4[i+128]; sumy[2] += yh[i+0];
+            yh[i+8] = y4[i+160]; sumy[3] += yh[i+8];
        }
-        sumf += dall * (s[0] * sc1[0] + s[1] * sc1[1] + s[2] * sc3[0] + s[3] * sc3[1]) - dmin * smin;

+        device const uint16_t * sc = (device const uint16_t *)x[ib].scales + im;
+        device const uint16_t * q1 = (device const uint16_t *)x[ib].qs + 16 * im + 4 * ir;
+        device const half     * dh = &x[ib].d;
+
+        for (int row = 0; row < N_DST; row++) {
+
+            sc16[0] = sc[0] & kmask1;
+            sc16[1] = sc[2] & kmask1;
+            sc16[2] = ((sc[4] >> 0) & kmask2) | ((sc[0] & kmask3) >> 2);
+            sc16[3] = ((sc[4] >> 4) & kmask2) | ((sc[2] & kmask3) >> 2);
+
+            device const uint16_t * q2 = q1 + 32;
+
+            float4 acc1 = {0.f, 0.f, 0.f, 0.f};
+            float4 acc2 = {0.f, 0.f, 0.f, 0.f};
+            for (int i = 0; i < 8; i += 2) {
+                acc1[0] += yl[i+0] * (q1[i/2] & 0x000F);
+                acc1[1] += yl[i+1] * (q1[i/2] & 0x0F00);
+                acc1[2] += yl[i+8] * (q1[i/2] & 0x00F0);
+                acc1[3] += yl[i+9] * (q1[i/2] & 0xF000);
+                acc2[0] += yh[i+0] * (q2[i/2] & 0x000F);
+                acc2[1] += yh[i+1] * (q2[i/2] & 0x0F00);
+                acc2[2] += yh[i+8] * (q2[i/2] & 0x00F0);
+                acc2[3] += yh[i+9] * (q2[i/2] & 0xF000);
+            }
+
+            float dall = dh[0];
+            float dmin = dh[1];
+            sumf[row] += dall * ((acc1[0] + 1.f/256.f * acc1[1]) * sc8[0] +
+                                 (acc1[2] + 1.f/256.f * acc1[3]) * sc8[1] * 1.f/16.f +
+                                 (acc2[0] + 1.f/256.f * acc2[1]) * sc8[4] +
+                                 (acc2[2] + 1.f/256.f * acc2[3]) * sc8[5] * 1.f/16.f) -
+                         dmin * (sumy[0] * sc8[2] + sumy[1] * sc8[3] + sumy[2] * sc8[6] + sumy[3] * sc8[7]);
+
+            q1 += step;
+            sc += step;
+            dh += step;
+        }
+
+        y4 += 4 * QK_K;
    }
-#else
-    uint16_t aux16[2];
-    thread const uint8_t * scales = (thread const uint8_t *)aux16;

-    const int il  = 4*tpitg.x;
-
-    for (int i = tpitg.y; i < nb; i += tptg.y) {
-
-        device const uint8_t * q = x[i].qs + il;
-        device const float   * y = yy + i * QK_K + il;
-
-        const float d = (float)x[i].d[0];
-        const float m = (float)x[i].d[1];
-
-        device const uint16_t * a = (device const uint16_t *)x[i].scales;
-        aux16[0] = a[0] & 0x0f0f;
-        aux16[1] = (a[0] >> 4) & 0x0f0f;
-
-        for (int l = 0; l < 4; ++l) {
-            sumf += d * scales[0] * (y[l+ 0] * (q[l] & 0xF) + y[l+16] * (q[l+16] & 0xF)) - m * scales[2] * (y[l+ 0] + y[l+16])
-                  + d * scales[1] * (y[l+32] * (q[l] >>  4) + y[l+48] * (q[l+16] >>  4)) - m * scales[3] * (y[l+32] + y[l+48]);
+    for (int row = 0; row < N_DST; ++row) {
+        all_sum = simd_sum(sumf[row]);
+        if (tiisg == 0) {
+            dst[r1*ne0 + first_row + row] = all_sum;
        }
    }
-#endif
-
-    sum[ith] = sumf;
-
-    //
-    // Accumulate the sum from all threads in the threadgroup
-    // This version is slightly faster than the commented out one below,
-    // which I copy-pasted from ggerganov's q4_0 dot product for metal.
-    //
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%4 == 0) {
-        for (int i = 1; i < 4; ++i) sum[ith] += sum[ith + i];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%16 == 0) {
-        for (int i = 4; i < 16; i += 4) sum[ith] += sum[ith + i];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith == 0) {
-        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
-        dst[r1*ne0 + r0] = sum[0];
-    }
-
-    //// accumulate the sum from all threads in the threadgroup
-    //threadgroup_barrier(mem_flags::mem_threadgroup);
-    //for (uint i = nth/2; i > 0; i /= 2) {
-    //    if (ith < i) {
-    //        sum[ith] += sum[ith + i];
-    //    }
-    //    threadgroup_barrier(mem_flags::mem_threadgroup);
-    //}
-
-    //if (ith == 0) {
-    //    dst[r1*ne0 + r0] = sum[0];
-    //}
 }
+#else
+kernel void kernel_mul_mat_q4_K_f32(
+        device const  void * src0,
+        device const float * src1,
+        device       float * dst,
+        constant   int64_t & ne00,
+        constant   int64_t & ne10,
+        constant   int64_t & ne0,
+        constant   int64_t & ne01[[buffer(4)]],
+        uint2 tgpig[[threadgroup_position_in_grid]],
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {
+
+    const int ix = tiisg/4;  // 0...7
+    const int it = tiisg%4;  // 0...3
+
+    const int nb = ne00/QK_K;
+    const int r0 = tgpig.x;
+    const int r1 = tgpig.y;
+    const int first_row = (r0 * N_SIMDGROUP + sgitg) * N_DST;
+    const int ib_row = first_row * nb;
+    device const block_q4_K * x = (device const block_q4_K *) src0 + ib_row;
+    device const float      * y = (device const float      *) src1 + r1*ne10;
+    float yl[8];
+    float yh[8];
+    float sumf[N_DST]={0.f}, all_sum;
+
+    const int step = sizeof(block_q4_K) * nb / 2;
+
+    device const float * y4 = y + ix * QK_K + 8 * it;
+
+    uint16_t sc16[4];
+
+    for (int ib = ix; ib < nb; ib += 8) {
+
+        float2 sumy = {0.f, 0.f};
+        for (int i = 0; i < 8; ++i) {
+            yl[i] = y4[i+ 0]; sumy[0] += yl[i];
+            yh[i] = y4[i+32]; sumy[1] += yh[i];
+        }
+
+        device const uint16_t * sc = (device const uint16_t *)x[ib].scales;
+        device const uint16_t * qs = (device const uint16_t *)x[ib].qs + 4 * it;
+        device const half     * dh = x[ib].d;
+
+        for (int row = 0; row < N_DST; row++) {
+
+            sc16[0] = sc[0] & 0x000f;
+            sc16[1] = sc[0] & 0x0f00;
+            sc16[2] = sc[0] & 0x00f0;
+            sc16[3] = sc[0] & 0xf000;
+
+            float2 acc1 = {0.f, 0.f};
+            float2 acc2 = {0.f, 0.f};
+            for (int i = 0; i < 8; i += 2) {
+                acc1[0] += yl[i+0] * (qs[i/2] & 0x000F);
+                acc1[1] += yl[i+1] * (qs[i/2] & 0x0F00);
+                acc2[0] += yh[i+0] * (qs[i/2] & 0x00F0);
+                acc2[1] += yh[i+1] * (qs[i/2] & 0xF000);
+            }
+
+            float dall = dh[0];
+            float dmin = dh[1];
+            sumf[row] += dall * ((acc1[0] + 1.f/256.f * acc1[1]) * sc16[0] +
+                                 (acc2[0] + 1.f/256.f * acc2[1]) * sc16[1] * 1.f/4096.f) -
+                         dmin * 1.f/16.f * (sumy[0] * sc16[2] + sumy[1] * sc16[3] * 1.f/256.f);
+
+            qs += step;
+            sc += step;
+            dh += step;
+        }
+
+        y4 += 8 * QK_K;
+    }
+
+    for (int row = 0; row < N_DST; ++row) {
+        all_sum = simd_sum(sumf[row]);
+        if (tiisg == 0) {
+            dst[r1*ne0 + first_row + row] = all_sum;
+        }
+    }
+}
+#endif

 kernel void kernel_mul_mat_q5_K_f32(
        device const  void * src0,
@@ -1629,39 +1670,39 @@ kernel void kernel_mul_mat_q5_K_f32(
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
-        uint2 tpitg[[thread_position_in_threadgroup]],
-        uint2  tptg[[threads_per_threadgroup]]) {
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {

    const int nb = ne00/QK_K;

    const int64_t r0 = tgpig.x;
    const int64_t r1 = tgpig.y;

-    device const block_q5_K * x = (device const block_q5_K *) src0 + r0*nb;
+    const int first_row = (r0 * N_SIMDGROUP + sgitg) * 2;
+
+    device const block_q5_K * x = (device const block_q5_K *) src0 + first_row*nb;
    device const float     * yy = (device const float      *) src1 + r1*ne10;

-    const int nth = tptg.x*tptg.y;
-    const int ith = tptg.y*tpitg.x + tpitg.y;
+    float sumf[2]={0.f};

-    float sumf = 0;
+    const int step = sizeof(block_q5_K) * nb;

 #if QK_K == 256
+#
+    float yl[16], yh[16];

    const uint16_t kmask1 = 0x3f3f;
    const uint16_t kmask2 = 0x0f0f;
    const uint16_t kmask3 = 0xc0c0;

-    const int tid = tpitg.y;   // 0...16
-    const int il  = tid/4;     // 0...3
-    const int ir  = tid - 4*il;// 0...3
-    const int n   = 4;
+    const int tid = tiisg/4;
+    const int ix  = tiisg%4;
+    const int im  = tid/4;
+    const int ir  = tid%4;
+    const int n   = 8;

-    const int im = il/2;  // 0 or 1. 0 computes 0,32 + 128,160, 1 computes 64,96 + 192,224
-    const int in = il%2;
-
-    const int l0 = n*(2*ir + in);
+    const int l0 = n*ir;
    const int q_offset = 32*im + l0;
    const int y_offset = 64*im + l0;

@@ -1670,78 +1711,114 @@ kernel void kernel_mul_mat_q5_K_f32(
    const uint8_t hm3 = hm1 << 4;
    const uint8_t hm4 = hm2 << 4;

-    uchar2 sc1, sc2, sc3, sc4;
+    uint16_t sc16[4];
+    thread const uint8_t * sc8 = (thread const uint8_t *)sc16;

-    for (int i = tpitg.x; i < nb; i += tptg.x) {
+    device const float * y1 = yy + ix*QK_K + y_offset;

-        device const uint8_t * q1 = (x + i)->qs + q_offset;
-        device const uint8_t * q2 = q1 + 64;
-        device const uint8_t * qh = (x + i)->qh + l0;
-        device const float   * y1 = yy + i*QK_K + y_offset;
-        device const float   * y2 = y1 + 128;
+    for (int i = ix; i < nb; i += 4) {

-        const float dall = (float)((x + i)->d);
-        const float dmin = (float)((x + i)->dmin);
+        device const uint8_t * q1 = x[i].qs + q_offset;
+        device const uint8_t * qh = x[i].qh + l0;
+        device const half * dh = &x[i].d;
+        device const uint16_t * a = (device const uint16_t *)x[i].scales + im;

-        device const uint16_t * a = (device const uint16_t *)(x + i)->scales;
-        sc1 = as_type<uchar2>((uint16_t)(a[im+0] & kmask1));
-        sc2 = as_type<uchar2>((uint16_t)(a[im+2] & kmask1));
-        sc3 = as_type<uchar2>((uint16_t)(((a[im+4] >> 0) & kmask2) | ((a[im+0] & kmask3) >> 2)));
-        sc4 = as_type<uchar2>((uint16_t)(((a[im+4] >> 4) & kmask2) | ((a[im+2] & kmask3) >> 2)));
+        device const float * y2 = y1 + 128;
+        float4 sumy = {0.f, 0.f, 0.f, 0.f};
+        for (int l = 0; l < 8; ++l) {
+            yl[l+0] = y1[l+ 0]; sumy[0] += yl[l+0];
+            yl[l+8] = y1[l+32]; sumy[1] += yl[l+8];
+            yh[l+0] = y2[l+ 0]; sumy[2] += yh[l+0];
+            yh[l+8] = y2[l+32]; sumy[3] += yh[l+8];
+        }

-        float4 s = {0.f, 0.f, 0.f, 0.f};
-        float smin = 0;
-        for (int l = 0; l < n; ++l) {
+        for (int row = 0; row < 2; ++row) {

-            s[0] += y1[l+ 0] * ((q1[l] & 0xF) + (qh[l] & hm1 ? 16 : 0));
-            s[1] += y1[l+32] * ((q1[l] >>  4) + (qh[l] & hm2 ? 16 : 0));
-            s[2] += y2[l+ 0] * ((q2[l] & 0xF) + (qh[l] & hm3 ? 16 : 0));
-            s[3] += y2[l+32] * ((q2[l] >>  4) + (qh[l] & hm4 ? 16 : 0));
-            smin += y1[l] * sc2[0] + y1[l+32] * sc2[1] + y2[l] * sc4[0] + y2[l+32] * sc4[1];
+            device const uint8_t * q2 = q1 + 64;
+
+            sc16[0] = a[0] & kmask1;
+            sc16[1] = a[2] & kmask1;
+            sc16[2] = ((a[4] >> 0) & kmask2) | ((a[0] & kmask3) >> 2);
+            sc16[3] = ((a[4] >> 4) & kmask2) | ((a[2] & kmask3) >> 2);
+
+            float4 acc = {0.f, 0.f, 0.f, 0.f};
+            for (int l = 0; l < n; ++l) {
+                uint8_t h = qh[l];
+                acc[0] += yl[l+0] * ((uint16_t)(q1[l] & 0x0F) + (h & hm1 ? 16 : 0));
+                acc[1] += yl[l+8] * ((uint16_t)(q1[l] & 0xF0) + (h & hm2 ? 256 : 0));
+                acc[2] += yh[l+0] * ((uint16_t)(q2[l] & 0x0F) + (h & hm3 ? 16 : 0));
+                acc[3] += yh[l+8] * ((uint16_t)(q2[l] & 0xF0) + (h & hm4 ? 256 : 0));
+            }
+            const float dall = dh[0];
+            const float dmin = dh[1];
+            sumf[row] += dall * (acc[0] * sc8[0] + acc[1] * sc8[1] * 1.f/16.f + acc[2] * sc8[4] + acc[3] * sc8[5] * 1.f/16.f) -
+                         dmin * (sumy[0] * sc8[2] + sumy[1] * sc8[3] + sumy[2] * sc8[6] + sumy[3] * sc8[7]);
+
+            q1 += step;
+            qh += step;
+            dh += step/2;
+            a  += step/2;

        }
-        sumf += dall * (s[0] * sc1[0] + s[1] * sc1[1] + s[2] * sc3[0] + s[3] * sc3[1]) - dmin * smin;
+
+        y1 += 4 * QK_K;

    }
 #else
-    const int il  = 4 * tpitg.x;  // 0, 4, 8, 12
-    const int im  = il/8;         // 0, 0, 1, 1
-    const int in  = il%8;         // 0, 4, 0, 4
+    float yl[8], yh[8];

-    for (int i = tpitg.y; i < nb; i += tptg.y) {
+    const int il = 4 * (tiisg/8);  // 0, 4, 8, 12
+    const int ix = tiisg%8;
+    const int im = il/8;         // 0, 0, 1, 1
+    const int in = il%8;         // 0, 4, 0, 4

-        const float d = (float)x[i].d;
+    device const float * y = yy + ix*QK_K + il;
+
+    for (int i = ix; i < nb; i += 8) {
+
+        float4 sumy = {0.f, 0.f, 0.f, 0.f};
+        for (int l = 0; l < 4; ++l) {
+            yl[l+0] = y[l+ 0];
+            yl[l+4] = y[l+16];
+            yh[l+0] = y[l+32];
+            yh[l+4] = y[l+48];
+        }
+
+        device const half * dh = &x[i].d;
        device const uint8_t * q = x[i].qs + il;
        device const uint8_t * h = x[i].qh + in;
        device const int8_t  * s = x[i].scales;
-        device const float   * y = yy + i*QK_K + il;

-        for (int l = 0; l < 4; ++l) {
-            const uint8_t hl = h[l] >> im;
-            sumf += y[l+ 0] * d * s[0] * ((q[l+ 0] & 0xF) - (hl & 0x01 ? 0 : 16))
-                  + y[l+16] * d * s[1] * ((q[l+16] & 0xF) - (hl & 0x04 ? 0 : 16))
-                  + y[l+32] * d * s[2] * ((q[l+ 0] >>  4) - (hl & 0x10 ? 0 : 16))
-                  + y[l+48] * d * s[3] * ((q[l+16] >>  4) - (hl & 0x40 ? 0 : 16));
+        for (int row = 0; row < 2; ++row) {
+
+            const float d = dh[0];
+
+            float2 acc = {0.f, 0.f};
+            for (int l = 0; l < 4; ++l) {
+                const uint8_t hl = h[l] >> im;
+                acc[0] += yl[l+0] * s[0] * ((int16_t)(q[l+ 0] & 0x0F) - (hl & 0x01 ? 0 : 16))
+                        + yl[l+4] * s[1] * ((int16_t)(q[l+16] & 0x0F) - (hl & 0x04 ? 0 : 16));
+                acc[1] += yh[l+0] * s[2] * ((int16_t)(q[l+ 0] & 0xF0) - (hl & 0x10 ? 0 : 256))
+                        + yh[l+4] * s[3] * ((int16_t)(q[l+16] & 0xF0) - (hl & 0x40 ? 0 : 256));
+            }
+            sumf[row] += d * (acc[0] + 1.f/16.f * acc[1]);
+
+            q += step;
+            h += step;
+            s += step;
+            dh += step/2;
+
        }
+
+        y += 8 * QK_K;
    }
 #endif
-    sum[ith] = sumf;

-    //
-    // Accumulate the sum from all threads in the threadgroup
-    //
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%4 == 0) {
-        sum[ith] += sum[ith+1] + sum[ith+2] + sum[ith+3];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%16 == 0) {
-        sum[ith] += sum[ith+4] + sum[ith+8] + sum[ith+12];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith == 0) {
-        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
-        dst[r1*ne0 + r0] = sum[0];
+    for (int row = 0; row < 2; ++row) {
+        const float tot = simd_sum(sumf[row]);
+        if (tiisg == 0) {
+            dst[r1*ne0 + first_row + row] = tot;
+        }
    }

 }
@@ -1753,10 +1830,9 @@ kernel void kernel_mul_mat_q6_K_f32(
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
-        threadgroup float  * sum [[threadgroup(0)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
-        uint2 tpitg[[thread_position_in_threadgroup]],
-        uint2  tptg[[threads_per_threadgroup]]) {
+        uint tiisg[[thread_index_in_simdgroup]],
+        uint sgitg[[simdgroup_index_in_threadgroup]]) {

    const uint8_t kmask1 = 0x03;
    const uint8_t kmask2 = 0x0C;
@@ -1768,19 +1844,18 @@ kernel void kernel_mul_mat_q6_K_f32(
    const int64_t r0 = tgpig.x;
    const int64_t r1 = tgpig.y;

-    device const block_q6_K * x = (device const block_q6_K *) src0 + r0*nb;
-    device const float     * yy = (device const float      *) src1 + r1*ne10;
+    const int row = 2 * r0 + sgitg;

-    const int nth = tptg.x*tptg.y;
-    const int ith = tptg.y*tpitg.x + tpitg.y;
+    device const block_q6_K * x = (device const block_q6_K *) src0 + row * nb; //r0*nb;
+    device const float     * yy = (device const float      *) src1 + r1*ne10;

    float sumf = 0;

 #if QK_K == 256
-    // Note: we absolutely assume that tptg.y = 16 and QK_K = 256!
-    const int iqs  = 16 * tpitg.y;
-    const int ip   = iqs / 128;         // 0 or 1
-    const int il   = (iqs - 128*ip)/16; // 0...7
+    const int tid  = tiisg/2;
+    const int ix   = tiisg%2;
+    const int ip   = tid/8;         // 0 or 1
+    const int il   = tid%8;
    const int n    = 4;
    const int l0   = n*il;
    const int is   = 8*ip + l0/16;
@@ -1789,9 +1864,10 @@ kernel void kernel_mul_mat_q6_K_f32(
    const int q_offset_l = 64*ip + l0;
    const int q_offset_h = 32*ip + l0;

-    for (int i = tpitg.x; i < nb; i += tptg.x) {
+    for (int i = ix; i < nb; i += 2) {

-        device const uint8_t * ql = x[i].ql + q_offset_l;
+        device const uint8_t * q1 = x[i].ql + q_offset_l;
+        device const uint8_t * q2 = q1 + 32;
        device const uint8_t * qh = x[i].qh + q_offset_h;
        device const int8_t  * sc = x[i].scales + is;

@@ -1801,19 +1877,21 @@ kernel void kernel_mul_mat_q6_K_f32(

        float4 sums = {0.f, 0.f, 0.f, 0.f};
        for (int l = 0; l < n; ++l) {
-            sums[0] += y[l+ 0] * ((int8_t)((ql[l+ 0] & 0xF) | ((qh[l] & kmask1) << 4)) - 32);
-            sums[1] += y[l+32] * ((int8_t)((ql[l+32] & 0xF) | ((qh[l] & kmask2) << 2)) - 32);
-            sums[2] += y[l+64] * ((int8_t)((ql[l+ 0]  >> 4) | ((qh[l] & kmask3) << 0)) - 32);
-            sums[3] += y[l+96] * ((int8_t)((ql[l+32]  >> 4) | ((qh[l] & kmask4) >> 2)) - 32);
+            sums[0] += y[l+ 0] * ((int8_t)((q1[l] & 0xF) | ((qh[l] & kmask1) << 4)) - 32);
+            sums[1] += y[l+32] * ((int8_t)((q2[l] & 0xF) | ((qh[l] & kmask2) << 2)) - 32);
+            sums[2] += y[l+64] * ((int8_t)((q1[l]  >> 4) | ((qh[l] & kmask3) << 0)) - 32);
+            sums[3] += y[l+96] * ((int8_t)((q2[l]  >> 4) | ((qh[l] & kmask4) >> 2)) - 32);
        }

        sumf += dall * (sums[0] * sc[0] + sums[1] * sc[2] + sums[2] * sc[4] + sums[3] * sc[6]);

    }
-#else
-    const int il  = 4*tpitg.x;    // 0, 4, 8, 12

-    for (int i = tpitg.y; i < nb; i += tptg.y) {
+#else
+    const int ix  = tiisg/4;
+    const int il  = 4*(tiisg%4);
+
+    for (int i = ix; i < nb; i += 8) {
        device const float * y = yy + i * QK_K + il;
        device const uint8_t * ql = x[i].ql + il;
        device const uint8_t * qh = x[i].qh + il;
@@ -1833,23 +1911,8 @@ kernel void kernel_mul_mat_q6_K_f32(

 #endif

-    sum[ith] = sumf;
-
-    //
-    // Accumulate the sum from all threads in the threadgroup
-    //
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%4 == 0) {
-        for (int i = 1; i < 4; ++i) sum[ith] += sum[ith + i];
+    const float tot = simd_sum(sumf);
+    if (tiisg == 0) {
+        dst[r1*ne0 + row] = tot;
    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith%16 == 0) {
-        for (int i = 4; i < 16; i += 4) sum[ith] += sum[ith + i];
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (ith == 0) {
-        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
-        dst[r1*ne0 + r0] = sum[0];
-    }
-
 }
--- a/llama/ggml-mpi.c
+++ b/llama/ggml-mpi.c
@@ -0,0 +1,244 @@
+//go:build mpi
+
+/**
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ggml-mpi.h"
+
+#include "ggml.h"
+
+#include <mpi.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define UNUSED GGML_UNUSED
+
+struct ggml_mpi_context {
+    int rank;
+    int size;
+};
+
+void ggml_mpi_backend_init(void) {
+    MPI_Init(NULL, NULL);
+}
+
+void ggml_mpi_backend_free(void) {
+    MPI_Finalize();
+}
+
+struct ggml_mpi_context * ggml_mpi_init(void) {
+    struct ggml_mpi_context * ctx = calloc(1, sizeof(struct ggml_mpi_context));
+
+    MPI_Comm_rank(MPI_COMM_WORLD, &ctx->rank);
+    MPI_Comm_size(MPI_COMM_WORLD, &ctx->size);
+
+    return ctx;
+}
+
+void ggml_mpi_free(struct ggml_mpi_context * ctx) {
+    free(ctx);
+}
+
+int ggml_mpi_rank(struct ggml_mpi_context * ctx) {
+    return ctx->rank;
+}
+
+void ggml_mpi_eval_init(
+        struct ggml_mpi_context * ctx_mpi,
+                            int * n_tokens,
+                            int * n_past,
+                            int * n_threads) {
+    UNUSED(ctx_mpi);
+
+    // synchronize the worker node parameters with the root node
+    MPI_Barrier(MPI_COMM_WORLD);
+
+    MPI_Bcast(n_tokens,  1, MPI_INT, 0, MPI_COMM_WORLD);
+    MPI_Bcast(n_past,    1, MPI_INT, 0, MPI_COMM_WORLD);
+    MPI_Bcast(n_threads, 1, MPI_INT, 0, MPI_COMM_WORLD);
+}
+
+static int ggml_graph_get_node_idx(struct ggml_cgraph * gf, const char * name) {
+    struct ggml_tensor * t = ggml_graph_get_tensor(gf, name);
+    if (t == NULL) {
+        fprintf(stderr, "%s: tensor %s not found\n", __func__, name);
+        return -1;
+    }
+
+    for (int i = 0; i < gf->n_nodes; i++) {
+        if (gf->nodes[i] == t) {
+            return i;
+        }
+    }
+
+    fprintf(stderr, "%s: tensor %s not found in graph (should not happen)\n", __func__, name);
+    return -1;
+}
+
+static void ggml_mpi_tensor_send(struct ggml_tensor * t, int mpi_rank_dst) {
+    MPI_Datatype mpi_type;
+
+    switch (t->type) {
+        case GGML_TYPE_I32: mpi_type = MPI_INT32_T; break;
+        case GGML_TYPE_F32: mpi_type = MPI_FLOAT;   break;
+        default: GGML_ASSERT(false && "not implemented");
+    }
+
+    const int retval = MPI_Send(t->data, ggml_nelements(t), mpi_type, mpi_rank_dst, 0, MPI_COMM_WORLD);
+    GGML_ASSERT(retval == MPI_SUCCESS);
+}
+
+static void ggml_mpi_tensor_recv(struct ggml_tensor * t, int mpi_rank_src) {
+    MPI_Datatype mpi_type;
+
+    switch (t->type) {
+        case GGML_TYPE_I32: mpi_type = MPI_INT32_T; break;
+        case GGML_TYPE_F32: mpi_type = MPI_FLOAT;   break;
+        default: GGML_ASSERT(false && "not implemented");
+    }
+
+    MPI_Status status; UNUSED(status);
+
+    const int retval = MPI_Recv(t->data, ggml_nelements(t), mpi_type, mpi_rank_src, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
+    GGML_ASSERT(retval == MPI_SUCCESS);
+}
+
+// TODO: there are many improvements that can be done to this implementation
+void ggml_mpi_graph_compute_pre(
+        struct ggml_mpi_context * ctx_mpi,
+             struct ggml_cgraph * gf,
+                            int   n_layers) {
+    const int mpi_rank = ctx_mpi->rank;
+    const int mpi_size = ctx_mpi->size;
+
+    struct ggml_tensor * inp_tokens = ggml_graph_get_tensor(gf, "inp_tokens");
+    if (inp_tokens == NULL) {
+        fprintf(stderr, "%s: tensor 'inp_tokens' not found\n", __func__);
+        return;
+    }
+
+    struct ggml_tensor * inp0 = ggml_graph_get_tensor(gf, "layer_inp_0");
+    if (inp0 == NULL) {
+        fprintf(stderr, "%s: tensor 'inp0' not found\n", __func__);
+        return;
+    }
+
+    GGML_ASSERT(inp0 == gf->nodes[0]);
+
+    // distribute the compute graph into slices across the MPI nodes
+    //
+    // the main node (0) processes the last layers + the remainder of the compute graph
+    // and is responsible to pass the input tokens to the first node (1)
+    //
+    // node 1:   [(  0) * n_per_node, (  1) * n_per_node)
+    // node 2:   [(  1) * n_per_node, (  2) * n_per_node)
+    // ...
+    // node n-1: [(n-2) * n_per_node, (n-1) * n_per_node)
+    // node 0:   [(n-1) * n_per_node,            n_nodes)
+    //
+    if (mpi_rank > 0) {
+        if (mpi_rank == 1) {
+            // the first node (1) receives the input tokens from the main node (0)
+            ggml_mpi_tensor_recv(inp_tokens, 0);
+        } else {
+            // recv input data for each node into the "inp0" tensor (i.e. the first node in the compute graph)
+            ggml_mpi_tensor_recv(inp0, mpi_rank - 1);
+        }
+    } else if (mpi_size > 1) {
+        // node 0 sends the input tokens to node 1
+        ggml_mpi_tensor_send(inp_tokens, 1);
+
+        // recv the output data from the last node
+        ggml_mpi_tensor_recv(inp0, mpi_size - 1);
+    }
+
+    {
+        const int n_per_node = (n_layers + (mpi_size - 1)) / mpi_size;
+
+        const int mpi_idx = mpi_rank > 0 ? mpi_rank - 1 : mpi_size - 1;
+
+        const int il0 =               (mpi_idx + 0) * n_per_node;
+        const int il1 = MIN(n_layers, (mpi_idx + 1) * n_per_node);
+
+        char name_l0[GGML_MAX_NAME];
+        char name_l1[GGML_MAX_NAME];
+
+        snprintf(name_l0, sizeof(name_l0), "layer_inp_%d", il0);
+        snprintf(name_l1, sizeof(name_l1), "layer_inp_%d", il1);
+
+        const int idx_l0 =                ggml_graph_get_node_idx(gf, name_l0);
+        const int idx_l1 = mpi_rank > 0 ? ggml_graph_get_node_idx(gf, name_l1) + 1 : gf->n_nodes;
+
+        if (idx_l0 < 0 || idx_l1 < 0) {
+            fprintf(stderr, "%s: layer input nodes not found\n", __func__);
+            return;
+        }
+
+        // attach the input data to all nodes that need it
+        // TODO: not great - should be able to do this without modifying the compute graph (see next TODO below)
+        for (int i = idx_l0; i < idx_l1; i++) {
+            if (gf->nodes[i]->src[0] == gf->nodes[idx_l0]) {
+                gf->nodes[i]->src[0] =  inp0;
+            }
+            if (gf->nodes[i]->src[1] == gf->nodes[idx_l0]) {
+                gf->nodes[i]->src[1] =  inp0;
+            }
+        }
+
+        // TODO: instead of rearranging the nodes, we should be able to execute a subset of the compute graph
+        for (int i = 1; i < idx_l1 - idx_l0; i++) {
+            gf->nodes[i] = gf->nodes[idx_l0 + i];
+            gf->grads[i] = gf->grads[idx_l0 + i];
+        }
+
+        // the first node performs the "get_rows" operation, the rest of the nodes get the data from the previous node
+        if (mpi_idx != 0) {
+            gf->nodes[0]->op = GGML_OP_NONE;
+        }
+
+        gf->n_nodes = idx_l1 - idx_l0;
+
+        //fprintf(stderr, "%s: node %d: processing %d nodes [%d, %d)\n", __func__, mpi_rank, gf->n_nodes, il0, il1);
+    }
+}
+
+void ggml_mpi_graph_compute_post(
+        struct ggml_mpi_context * ctx_mpi,
+             struct ggml_cgraph * gf,
+                            int   n_layers) {
+    UNUSED(n_layers);
+
+    const int mpi_rank = ctx_mpi->rank;
+    const int mpi_size = ctx_mpi->size;
+
+    // send the output data to the next node
+    if (mpi_rank > 0) {
+        ggml_mpi_tensor_send(gf->nodes[gf->n_nodes - 1], (mpi_rank + 1) % mpi_size);
+    }
+}
--- a/llama/ggml-mpi.h
+++ b/llama/ggml-mpi.h
@@ -0,0 +1,67 @@
+//go:build mpi
+
+/**
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#pragma once
+
+struct ggml_context;
+struct ggml_tensor;
+struct ggml_cgraph;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+struct ggml_mpi_context;
+
+void ggml_mpi_backend_init(void);
+void ggml_mpi_backend_free(void);
+
+struct ggml_mpi_context * ggml_mpi_init(void);
+void ggml_mpi_free(struct ggml_mpi_context * ctx);
+
+int ggml_mpi_rank(struct ggml_mpi_context * ctx);
+
+void ggml_mpi_eval_init(
+        struct ggml_mpi_context * ctx_mpi,
+                            int * n_tokens,
+                            int * n_past,
+                            int * n_threads);
+
+void ggml_mpi_graph_compute_pre(
+        struct ggml_mpi_context * ctx_mpi,
+             struct ggml_cgraph * gf,
+                            int   n_layers);
+
+void ggml_mpi_graph_compute_post(
+        struct ggml_mpi_context * ctx_mpi,
+             struct ggml_cgraph * gf,
+                            int   n_layers);
+
+#ifdef __cplusplus
+}
+#endif
--- a/llama/ggml-opencl.cpp
+++ b/llama/ggml-opencl.cpp
--- a/llama/ggml-opencl.h
+++ b/llama/ggml-opencl.h
@@ -0,0 +1,53 @@
+//go:build opencl
+
+/**
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
+ *
+ * MIT License
+ *
+ * Copyright (c) 2023 Georgi Gerganov
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#pragma once
+
+#include "ggml.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+void ggml_cl_init(void);
+
+void   ggml_cl_mul(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst);
+bool   ggml_cl_can_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst);
+size_t ggml_cl_mul_mat_get_wsize(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst);
+void   ggml_cl_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst, void * wdata, size_t wsize);
+
+void * ggml_cl_host_malloc(size_t size);
+void   ggml_cl_host_free(void * ptr);
+
+void ggml_cl_free_data(const struct ggml_tensor* tensor);
+
+void ggml_cl_transform_tensor(void * data, struct ggml_tensor * tensor);
+
+#ifdef  __cplusplus
+}
+#endif
--- a/llama/ggml.c
+++ b/llama/ggml.c
--- a/llama/ggml.h
+++ b/llama/ggml.h
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -227,8 +227,13 @@
 #define GGML_MAX_NAME          48
 #define GGML_DEFAULT_N_THREADS 4

+
+#define GGML_EXIT_SUCCESS 0
+#define GGML_EXIT_ABORTED 1
+
 #define GGML_UNUSED(x) (void)(x)

+
 #define GGML_ASSERT(x) \
    do { \
        if (!(x)) { \
@@ -389,6 +394,8 @@ extern "C" {
        GGML_OP_CLAMP,
        GGML_OP_CONV_1D,
        GGML_OP_CONV_2D,
+        GGML_OP_POOL_1D,
+        GGML_OP_POOL_2D,

        GGML_OP_FLASH_ATTN,
        GGML_OP_FLASH_FF,
@@ -468,6 +475,10 @@ extern "C" {

        // the `n_tasks` of nodes, 1:1 mapping to cgraph nodes
        int n_tasks[GGML_MAX_NODES];
+
+        // abort ggml_graph_compute when true
+        bool (*abort_callback)(void * data);
+        void * abort_callback_data;
    };

    // computation graph
@@ -1136,6 +1147,17 @@ extern "C" {
            int                   mode,
            int                   n_ctx);

+    // custom RoPE, in-place, returns view(a)
+    GGML_API struct ggml_tensor * ggml_rope_custom_inplace(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * a,
+            int                   n_past,
+            int                   n_dims,
+            int                   mode,
+            float                 freq_base,
+            float                 freq_scale,
+            int                   n_ctx);
+
    // rotary position embedding backward, i.e compute dx from dy
    // a - dy
    GGML_API struct ggml_tensor * ggml_rope_back(
@@ -1190,6 +1212,31 @@ extern "C" {
            int                   s,
            int                   d);

+    enum ggml_op_pool {
+        GGML_OP_POOL_MAX,
+        GGML_OP_POOL_AVG,
+        GGML_OP_POOL_COUNT,
+    };
+
+    GGML_API struct ggml_tensor* ggml_pool_1d(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * a,
+            enum ggml_op_pool     op,
+            int                   k0, // kernel size
+            int                   s0, // stride
+            int                   p0); // padding
+
+    GGML_API struct ggml_tensor* ggml_pool_2d(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * a,
+            enum ggml_op_pool     op,
+            int                   k0,
+            int                   k1,
+            int                   s0,
+            int                   s1,
+            int                   p0,
+            int                   p1);
+
    GGML_API struct ggml_tensor * ggml_flash_attn(
            struct ggml_context * ctx,
            struct ggml_tensor  * q,
@@ -1329,7 +1376,7 @@ extern "C" {
    // ggml_graph_plan() has to be called before ggml_graph_compute()
    // when plan.work_size > 0, caller must allocate memory for plan.work_data
    GGML_API struct ggml_cplan ggml_graph_plan   (struct ggml_cgraph * cgraph, int n_threads /*= GGML_DEFAULT_N_THREADS*/);
-    GGML_API              void ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan);
+    GGML_API               int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan);
    GGML_API              void ggml_graph_reset  (struct ggml_cgraph * cgraph);

    // same as ggml_graph_compute() but the work data is allocated as a part of the context
--- a/llama/k_quants.c
+++ b/llama/k_quants.c
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
--- a/llama/k_quants.h
+++ b/llama/k_quants.h
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -41,6 +41,14 @@
 #define K_SCALE_SIZE 12
 #endif

+#ifndef static_assert
+#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201100L)
+#define static_assert(cond, msg) _Static_assert(cond, msg)
+#else
+#define static_assert(cond, msg) struct global_scope_noop_trick
+#endif
+#endif
+
 //
 // Super-block quantization structures
 //
--- a/llama/llama-util.h
+++ b/llama/llama-util.h
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -201,13 +201,13 @@ struct llama_mmap {
    llama_mmap(struct llama_file * file, size_t prefetch = (size_t) -1 /* -1 = max value */, bool numa = false) {
        size = file->size;
        int fd = fileno(file->fp);
-        int flags = MAP_PRIVATE;
+        int flags = MAP_SHARED;
        // prefetch/readahead impairs performance on NUMA systems
        if (numa) { prefetch = 0; }
 #ifdef __linux__
        if (prefetch) { flags |= MAP_POPULATE; }
 #endif
-        addr = mmap(NULL, file->size, PROT_READ | PROT_WRITE, flags, fd, 0);
+        addr = mmap(NULL, file->size, PROT_READ, flags, fd, 0);
        if (addr == MAP_FAILED) {
            throw std::runtime_error(format("mmap failed: %s", strerror(errno)));
        }
@@ -249,7 +249,7 @@ struct llama_mmap {
            throw std::runtime_error(format("CreateFileMappingA failed: %s", llama_format_win_err(error).c_str()));
        }

-        addr = MapViewOfFile(hMapping, FILE_MAP_COPY, 0, 0, 0);
+        addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
        error = GetLastError();
        CloseHandle(hMapping);

--- a/llama/llama.cpp
+++ b/llama/llama.cpp
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -127,14 +127,15 @@ static void ggml_graph_compute_helper(std::vector<uint8_t> & buf, ggml_cgraph *
 // memory sizes
 //

-static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
+static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0(int n_ctx)
 {
    static std::map<e_model, size_t> k_sizes = {
-        { MODEL_3B,    256ull * MB },
-        { MODEL_7B,    512ull * MB },
-        { MODEL_13B,   512ull * MB },
-        { MODEL_30B,   512ull * MB },
-        { MODEL_65B,  1024ull * MB },
+        /* empirical scaling, still a guess */
+        { MODEL_3B,   ((size_t) n_ctx / 16ull + 128ull) * MB },
+        { MODEL_7B,   ((size_t) n_ctx / 16ull + 256ull) * MB },
+        { MODEL_13B,  ((size_t) n_ctx / 12ull + 256ull) * MB },
+        { MODEL_30B,  ((size_t) n_ctx / 10ull + 256ull) * MB },
+        { MODEL_65B,  ((size_t) n_ctx /  8ull + 512ull) * MB },
    };
    return k_sizes;
 }
@@ -166,14 +167,14 @@ static const std::map<e_model, size_t> & MEM_REQ_KV_SELF()

 // this is mostly needed for temporary mul_mat buffers to dequantize the data
 // not actually needed if BLAS is disabled
-static const std::map<e_model, size_t> & MEM_REQ_EVAL()
+static const std::map<e_model, size_t> & MEM_REQ_EVAL(int n_ctx)
 {
    static std::map<e_model, size_t> k_sizes = {
-        { MODEL_3B,   512ull * MB },
-        { MODEL_7B,   768ull * MB },
-        { MODEL_13B, 1024ull * MB },
-        { MODEL_30B, 1280ull * MB },
-        { MODEL_65B, 1536ull * MB },
+        { MODEL_3B,  ((size_t) n_ctx / 256ull +  512ull) * MB },
+        { MODEL_7B,  ((size_t) n_ctx / 256ull +  768ull) * MB },
+        { MODEL_13B, ((size_t) n_ctx / 256ull + 1024ull) * MB },
+        { MODEL_30B, ((size_t) n_ctx / 256ull + 1280ull) * MB },
+        { MODEL_65B, ((size_t) n_ctx / 256ull + 1536ull) * MB },
    };
    return k_sizes;
 }
@@ -215,6 +216,10 @@ struct llama_hparams {
    uint32_t n_head  = 32;
    uint32_t n_layer = 32;
    uint32_t n_rot   = 64;
+
+    float rope_freq_base  = 10000.0f;
+    float rope_freq_scale = 1.0f;
+
    enum llama_ftype ftype = LLAMA_FTYPE_MOSTLY_F16;

    bool operator!=(const llama_hparams & other) const {
@@ -329,7 +334,7 @@ struct llama_model {
 };

 struct llama_context {
-    llama_context(const llama_model & model, const llama_vocab & vocab) : model(model), vocab(vocab), t_load_us(model.t_load_us), t_start_us(model.t_start_us) {}
+    llama_context(const llama_model & model) : model(model), t_load_us(model.t_load_us), t_start_us(model.t_start_us) {}
 #ifdef GGML_USE_METAL
    ~llama_context() {
        if (ctx_metal) {
@@ -350,7 +355,6 @@ struct llama_context {
    int32_t n_p_eval = 0; // number of tokens in eval calls for the prompt (with batch size > 1)

    const llama_model & model;
-    const llama_vocab & vocab;

    bool model_owner = false;

@@ -577,7 +581,9 @@ struct llama_file_loader {
            }

            // skip to the next multiple of 32 bytes
-            file.seek(-static_cast<ptrdiff_t>(file.tell()) & 31, SEEK_CUR);
+            if (file_version >= LLAMA_FILE_VERSION_GGJT_V1) {
+                file.seek(-static_cast<ptrdiff_t>(file.tell()) & 31, SEEK_CUR);
+            }

            tensor.file_off = file.tell();
            tensor.name = name;
@@ -674,7 +680,7 @@ struct llama_model_loader {
        *ctx_size_p = *mmapped_size_p = 0;
        for (const llama_load_tensor & lt : tensors_map.tensors) {
            *ctx_size_p += sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE;
-            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size;
+            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size + 16;
        }
    }

@@ -870,6 +876,8 @@ struct llama_context_params llama_context_default_params() {
        /*.gpu_layers                  =*/ 0,
        /*.main_gpu                    =*/ 0,
        /*.tensor_split                =*/ {0},
+        /*.rope_freq_base              =*/ 10000.0f,
+        /*.rope_freq_scale             =*/ 1.0f,
        /*.progress_callback           =*/ nullptr,
        /*.progress_callback_user_data =*/ nullptr,
        /*.low_vram                    =*/ false,
@@ -895,6 +903,10 @@ struct llama_model_quantize_params llama_model_quantize_default_params() {
    return result;
 }

+int llama_max_devices() {
+    return LLAMA_MAX_DEVICES;
+}
+
 bool llama_mmap_supported() {
    return llama_mmap::SUPPORTED;
 }
@@ -993,6 +1005,8 @@ static void llama_model_load_internal(
        int n_gpu_layers,
        int main_gpu,
        const float * tensor_split,
+        float rope_freq_base,
+        float rope_freq_scale,
        bool low_vram,
        ggml_type memory_type,
        bool use_mmap,
@@ -1027,22 +1041,27 @@ static void llama_model_load_internal(
        }

        hparams.n_ctx = n_ctx;
+
+        hparams.rope_freq_base  = rope_freq_base;
+        hparams.rope_freq_scale = rope_freq_scale;
    }

    const uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;

    {
-        fprintf(stderr, "%s: format     = %s\n",  __func__, llama_file_version_name(file_version));
-        fprintf(stderr, "%s: n_vocab    = %u\n",  __func__, hparams.n_vocab);
-        fprintf(stderr, "%s: n_ctx      = %u\n",  __func__, hparams.n_ctx);
-        fprintf(stderr, "%s: n_embd     = %u\n",  __func__, hparams.n_embd);
-        fprintf(stderr, "%s: n_mult     = %u\n",  __func__, hparams.n_mult);
-        fprintf(stderr, "%s: n_head     = %u\n",  __func__, hparams.n_head);
-        fprintf(stderr, "%s: n_layer    = %u\n",  __func__, hparams.n_layer);
-        fprintf(stderr, "%s: n_rot      = %u\n",  __func__, hparams.n_rot);
+        fprintf(stderr, "%s: format     = %s\n",   __func__, llama_file_version_name(file_version));
+        fprintf(stderr, "%s: n_vocab    = %u\n",   __func__, hparams.n_vocab);
+        fprintf(stderr, "%s: n_ctx      = %u\n",   __func__, hparams.n_ctx);
+        fprintf(stderr, "%s: n_embd     = %u\n",   __func__, hparams.n_embd);
+        fprintf(stderr, "%s: n_mult     = %u\n",   __func__, hparams.n_mult);
+        fprintf(stderr, "%s: n_head     = %u\n",   __func__, hparams.n_head);
+        fprintf(stderr, "%s: n_layer    = %u\n",   __func__, hparams.n_layer);
+        fprintf(stderr, "%s: n_rot      = %u\n",   __func__, hparams.n_rot);
+        fprintf(stderr, "%s: freq_base  = %.1f\n", __func__, hparams.rope_freq_base);
+        fprintf(stderr, "%s: freq_scale = %g\n",   __func__, hparams.rope_freq_scale);
        fprintf(stderr, "%s: ftype      = %u (%s)\n", __func__, hparams.ftype, llama_ftype_name(hparams.ftype));
-        fprintf(stderr, "%s: n_ff       = %u\n",  __func__, n_ff);
-        fprintf(stderr, "%s: model size = %s\n",  __func__, llama_model_type_name(model.type));
+        fprintf(stderr, "%s: n_ff       = %u\n",   __func__, n_ff);
+        fprintf(stderr, "%s: model size = %s\n",   __func__, llama_model_type_name(model.type));
    }

    if (file_version < LLAMA_FILE_VERSION_GGJT_V2) {
@@ -1191,9 +1210,9 @@ static void llama_model_load_internal(
        const size_t mem_required =
            ctx_size +
            mmapped_size - vram_weights + // weights in VRAM not in memory
-            MEM_REQ_SCRATCH0().at(model.type) +
+            MEM_REQ_SCRATCH0(hparams.n_ctx).at(model.type) +
            MEM_REQ_SCRATCH1().at(model.type) +
-            MEM_REQ_EVAL().at    (model.type);
+            MEM_REQ_EVAL(hparams.n_ctx).at(model.type);

        // this is the memory required by one llama_state
        const size_t mem_required_state =
@@ -1297,6 +1316,8 @@ static bool llama_model_load(
        int n_gpu_layers,
        int main_gpu,
        float * tensor_split,
+        float rope_freq_base,
+        float rope_freq_scale,
        bool low_vram,
        ggml_type memory_type,
        bool use_mmap,
@@ -1305,7 +1326,7 @@ static bool llama_model_load(
        llama_progress_callback progress_callback,
        void *progress_callback_user_data) {
    try {
-        llama_model_load_internal(fname, model, vocab, n_ctx, n_batch, n_gpu_layers, main_gpu, tensor_split, low_vram, memory_type,
+        llama_model_load_internal(fname, model, vocab, n_ctx, n_batch, n_gpu_layers, main_gpu, tensor_split, rope_freq_base, rope_freq_scale, low_vram, memory_type,
                                  use_mmap, use_mlock, vocab_only, progress_callback, progress_callback_user_data);
        return true;
    } catch (const std::exception & err) {
@@ -1357,6 +1378,9 @@ static bool llama_eval_internal(
    const int n_rot        = hparams.n_embd/hparams.n_head;
    const int n_gpu_layers = model.n_gpu_layers;

+    const float freq_base  = hparams.rope_freq_base;
+    const float freq_scale = hparams.rope_freq_scale;
+
    auto & mem_per_token = lctx.mem_per_token;
    auto & buf_compute   = lctx.buf_compute;

@@ -1454,11 +1478,11 @@ static bool llama_eval_internal(
            offload_func_kq(tmpq);
            ggml_set_name(tmpq, "tmpq");

-            struct ggml_tensor * Kcur = ggml_rope_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head, N), n_past, n_rot, 0, 0);
+            struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd/n_head, n_head, N), n_past, n_rot, 0, freq_base, freq_scale, 0);
            offload_func_kq(Kcur);
            ggml_set_name(Kcur, "Kcur");

-            struct ggml_tensor * Qcur = ggml_rope_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd/n_head, n_head, N), n_past, n_rot, 0, 0);
+            struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd/n_head, n_head, N), n_past, n_rot, 0, freq_base, freq_scale, 0);
            offload_func_kq(Qcur);
            ggml_set_name(Qcur, "Qcur");

@@ -2032,9 +2056,18 @@ void llama_sample_tail_free(struct llama_context * ctx, llama_token_data_array *
    }

    // Normalize the second derivatives
-    float second_derivatives_sum = std::accumulate(second_derivatives.begin(), second_derivatives.end(), 0.0f);
-    for (float & value : second_derivatives) {
-        value /= second_derivatives_sum;
+    {
+        const float second_derivatives_sum = std::accumulate(second_derivatives.begin(), second_derivatives.end(), 0.0f);
+
+        if (second_derivatives_sum > 1e-6f) {
+            for (float & value : second_derivatives) {
+                value /= second_derivatives_sum;
+            }
+        } else {
+            for (float & value : second_derivatives) {
+                value = 1.0f / second_derivatives.size();
+            }
+        }
    }

    float cum_sum = 0.0f;
@@ -2213,7 +2246,7 @@ void llama_sample_classifier_free_guidance(
          struct llama_context * guidance_ctx,
                         float   scale,
                         float   smooth_factor) {
-    int64_t t_start_sample_us = t_start_sample_us = ggml_time_us();
+    int64_t t_start_sample_us = ggml_time_us();

    assert(ctx);
    auto n_vocab = llama_n_vocab(ctx);
@@ -2701,8 +2734,9 @@ struct llama_model * llama_load_model_from_file(
    ggml_type memory_type = params.f16_kv ? GGML_TYPE_F16 : GGML_TYPE_F32;

    if (!llama_model_load(path_model, *model, model->vocab, params.n_ctx, params.n_batch, params.n_gpu_layers,
-                params.main_gpu, params.tensor_split, params.low_vram, memory_type, params.use_mmap, params.use_mlock,
-                params.vocab_only, params.progress_callback, params.progress_callback_user_data)) {
+                params.main_gpu, params.tensor_split, params.rope_freq_base, params.rope_freq_scale,params.low_vram,
+                memory_type, params.use_mmap, params.use_mlock, params.vocab_only, params.progress_callback,
+                params.progress_callback_user_data)) {
        delete model;
        fprintf(stderr, "%s: failed to load model\n", __func__);
        return nullptr;
@@ -2723,7 +2757,7 @@ struct llama_context * llama_new_context_with_model(
        return nullptr;
    }

-    llama_context * ctx = new llama_context(*model, model->vocab);
+    llama_context * ctx = new llama_context(*model);

    if (params.seed == LLAMA_DEFAULT_SEED) {
        params.seed = time(NULL);
@@ -2777,9 +2811,9 @@ struct llama_context * llama_new_context_with_model(
            ctx->embedding.resize(hparams.n_embd);
        }

-        ctx->buf_compute.resize(MEM_REQ_EVAL().at(ctx->model.type));
+        ctx->buf_compute.resize(MEM_REQ_EVAL(hparams.n_ctx).at(ctx->model.type));

-        ctx->buf_scratch[0].resize(MEM_REQ_SCRATCH0().at(ctx->model.type));
+        ctx->buf_scratch[0].resize(MEM_REQ_SCRATCH0(hparams.n_ctx).at(ctx->model.type));
        ctx->buf_scratch[1].resize(MEM_REQ_SCRATCH1().at(ctx->model.type));
    }

@@ -3561,13 +3595,13 @@ int llama_eval_export(struct llama_context * ctx, const char * fname) {
    return 0;
 }

-int llama_tokenize(
-        struct llama_context * ctx,
+int llama_tokenize_with_model(
+    const struct llama_model * model,
                  const char * text,
                 llama_token * tokens,
                         int   n_max_tokens,
                        bool   add_bos) {
-    auto res = llama_tokenize(ctx->vocab, text, add_bos);
+    auto res = llama_tokenize(model->vocab, text, add_bos);

    if (n_max_tokens < (int) res.size()) {
        fprintf(stderr, "%s: too many tokens\n", __func__);
@@ -3581,8 +3615,29 @@ int llama_tokenize(
    return res.size();
 }

+int llama_tokenize(
+        struct llama_context * ctx,
+                  const char * text,
+                 llama_token * tokens,
+                         int   n_max_tokens,
+                        bool   add_bos) {
+    return llama_tokenize_with_model(&ctx->model, text, tokens, n_max_tokens, add_bos);
+}
+
+int llama_n_vocab_from_model(const struct llama_model * model) {
+    return model->vocab.id_to_token.size();
+}
+
+int llama_n_ctx_from_model(const struct llama_model * model) {
+    return model->hparams.n_ctx;
+}
+
+int llama_n_embd_from_model(const struct llama_model * model) {
+    return model->hparams.n_embd;
+}
+
 int llama_n_vocab(const struct llama_context * ctx) {
-    return ctx->vocab.id_to_token.size();
+    return ctx->model.vocab.id_to_token.size();
 }

 int llama_n_ctx(const struct llama_context * ctx) {
@@ -3593,17 +3648,25 @@ int llama_n_embd(const struct llama_context * ctx) {
    return ctx->model.hparams.n_embd;
 }

+int llama_get_vocab_from_model(
+        const struct llama_model * model,
+        const char * * strings,
+        float  * scores,
+        int capacity) {
+    int n = std::min(capacity, (int) model->vocab.id_to_token.size());
+    for (int i = 0; i<n; ++i) {
+        strings[i] = model->vocab.id_to_token[i].tok.c_str();
+        scores[i]  = model->vocab.id_to_token[i].score;
+    }
+    return n;
+}
+
 int llama_get_vocab(
        const struct llama_context * ctx,
        const char * * strings,
        float  * scores,
        int capacity) {
-    int n = std::min(capacity, (int) ctx->vocab.id_to_token.size());
-    for (int i = 0; i<n; ++i) {
-        strings[i] = ctx->vocab.id_to_token[i].tok.c_str();
-        scores[i]  = ctx->vocab.id_to_token[i].score;
-    }
-    return n;
+    return llama_get_vocab_from_model(&ctx->model, strings, scores, capacity);
 }

 float * llama_get_logits(struct llama_context * ctx) {
@@ -3614,12 +3677,16 @@ float * llama_get_embeddings(struct llama_context * ctx) {
    return ctx->embedding.data();
 }

-const char * llama_token_to_str(const struct llama_context * ctx, llama_token token) {
-    if (token >= llama_n_vocab(ctx)) {
+const char * llama_token_to_str_with_model(const struct llama_model * model, llama_token token) {
+    if (token >= llama_n_vocab_from_model(model)) {
        return nullptr;
    }

-    return ctx->vocab.id_to_token[token].tok.c_str();
+    return model->vocab.id_to_token[token].tok.c_str();
+}
+
+const char * llama_token_to_str(const struct llama_context * ctx, llama_token token) {
+    return llama_token_to_str_with_model(&ctx->model, token);
 }

 llama_token llama_token_bos() {
--- a/llama/llama.h
+++ b/llama/llama.h
@@ -1,5 +1,5 @@
 /**
- * llama.cpp - git 5bf2a2771886ee86137e01dbc7492f78fb392066
+ * llama.cpp - git e782c9e735f93ab4767ffc37462c523b73a17ddc
 *
 * MIT License
 *
@@ -115,6 +115,11 @@ extern "C" {
        int32_t  n_gpu_layers;                 // number of layers to store in VRAM
        int32_t  main_gpu;                     // the GPU that is used for scratch and small tensors
        float tensor_split[LLAMA_MAX_DEVICES]; // how to split layers across multiple GPUs
+
+        // ref: https://github.com/ggerganov/llama.cpp/pull/2054
+        float    rope_freq_base;  // RoPE base frequency
+        float    rope_freq_scale; // RoPE frequency scaling factor
+
        // called with a progress value between 0 and 1, pass NULL to disable
        llama_progress_callback progress_callback;
        // context pointer passed to the progress callback
@@ -174,6 +179,8 @@ extern "C" {
        int32_t n_eval;
    };

+    LLAMA_API int llama_max_devices();
+
    LLAMA_API struct llama_context_params llama_context_default_params();
    LLAMA_API struct llama_model_quantize_params llama_model_quantize_default_params();

@@ -296,10 +303,21 @@ extern "C" {
                             int   n_max_tokens,
                            bool   add_bos);

+    LLAMA_API int llama_tokenize_with_model(
+        const struct llama_model * model,
+                      const char * text,
+                     llama_token * tokens,
+                             int   n_max_tokens,
+                            bool   add_bos);
+
    LLAMA_API int llama_n_vocab(const struct llama_context * ctx);
    LLAMA_API int llama_n_ctx  (const struct llama_context * ctx);
    LLAMA_API int llama_n_embd (const struct llama_context * ctx);

+    LLAMA_API int llama_n_vocab_from_model(const struct llama_model * model);
+    LLAMA_API int llama_n_ctx_from_model  (const struct llama_model * model);
+    LLAMA_API int llama_n_embd_from_model (const struct llama_model * model);
+
    // Get the vocabulary as output parameters.
    // Returns number of results.
    LLAMA_API int llama_get_vocab(
@@ -308,6 +326,12 @@ extern "C" {
                                 float * scores,
                                   int   capacity);

+    LLAMA_API int llama_get_vocab_from_model(
+              const struct llama_model * model,
+                          const char * * strings,
+                                 float * scores,
+                                   int   capacity);
+
    // Token logits obtained from the last call to llama_eval()
    // The logits for the last token are stored in the last row
    // Can be mutated in order to change the probabilities of the next token
@@ -320,7 +344,13 @@ extern "C" {
    LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);

    // Token Id -> String. Uses the vocabulary in the provided context
-    LLAMA_API const char * llama_token_to_str(const struct llama_context * ctx, llama_token token);
+    LLAMA_API const char * llama_token_to_str(
+            const struct llama_context * ctx,
+                           llama_token   token);
+
+    LLAMA_API const char * llama_token_to_str_with_model(
+              const struct llama_model * model,
+                           llama_token   token);

    // Special tokens
    LLAMA_API llama_token llama_token_bos();  // beginning-of-sentence
--- a/llama/update-llama-cpp.sh
+++ b/llama/update-llama-cpp.sh
@@ -0,0 +1,70 @@
+#!/bin/sh
+
+set -eu
+
+
+status() { echo >&2 ">>> $*"; }
+error() { status "ERROR $*"; }
+usage() {
+    echo "usage: $(basename $0) /path/to/repo"
+    exit 1
+}
+
+OUT=$(dirname $0)
+while getopts "hC:" OPTION; do
+    case $OPTION in
+        C) OUT=$OPTARG ;;
+        *) usage ;;
+    esac
+done
+
+shift $(( $OPTIND - 1 ))
+[ $# -eq 1 ] || usage
+
+status "updating source..."
+cp -a "$1"/*.{c,h,cpp,m,metal,cu} "$OUT"
+
+status "removing incompatible files..."
+rm -f "$OUT"/build-info.h
+
+SHA1=$(git -C $1 rev-parse @)
+
+LICENSE=$(mktemp)
+cleanup() {
+    rm -f $LICENSE
+}
+trap cleanup 0
+
+cat <<EOF | sed 's/ *$//' >$LICENSE
+/**
+ * llama.cpp - git $SHA1
+ *
+$(sed 's/^/ * /' <$1/LICENSE)
+ */
+
+EOF
+
+for IN in $OUT/*.{c,h,cpp,m,metal,cu}; do
+    TMP=$(mktemp)
+    status "updating license $IN"
+    cat $LICENSE $IN >$TMP
+    mv $TMP $IN
+done
+
+touchup() {
+    local CONSTRAINT=$1 && shift
+
+    for IN in $*; do
+        status "touching up $IN..."
+        TMP=$(mktemp)
+        {
+            echo "//go:build $CONSTRAINT"
+            echo
+        } | cat - $IN >$TMP
+        mv $TMP $IN
+    done
+}
+
+touchup darwin $OUT/ggml-metal.*
+touchup mpi $OUT/ggml-mpi.*
+touchup opencl $OUT/ggml-opencl.*
--- a/models.json
+++ b/models.json
@@ -1,38 +0,0 @@
-[
-  {
-    "name": "orca",
-    "display_name": "Orca Mini",
-    "parameters": "3B",
-    "url": "https://huggingface.co/TheBloke/orca_mini_3B-GGML/resolve/main/orca-mini-3b.ggmlv3.q4_1.bin",
-    "short_description": "Follow instructions. Great small model that runs fast even without GPU support.",
-    "description": "An OpenLLaMa-3B model trained on explain tuned datasets, created using Instructions and Input from WizardLM, Alpaca & Dolly-V2 datasets and applying Orca Research Paper dataset construction approaches.",
-    "published_by": "TheBloke",
-    "original_author": "psmathur",
-    "original_url": "https://huggingface.co/psmathur/orca_mini_3b",
-    "license": "CC-BY-SA-4.0"
-  },
-  {
-    "name": "nous-hermes",
-    "display_name": "Nous Hermes",
-    "parameters": "13B",
-    "url": "https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q2_K.bin",
-    "short_description": "Currently one of the best 13B general model.",
-    "description": "It is suitable for a wide range of language tasks, from generating creative text to understanding and following complex instructions. This model was fine-tuned by Nous Research, with Teknium and Karan4D leading the fine tuning process and dataset curation, Redmond AI sponsoring the compute, and several other contributors. The result is an enhanced Llama 13b model that rivals GPT-3.5-turbo in performance across a variety of tasks. \n \n This model stands out for its long responses, low hallucination rate, and absence of OpenAI censorship mechanisms. The fine-tuning process was performed with a 2000 sequence length on an 8x a100 80GB DGX machine for over 50 hours.",
-    "published_by": "TheBloke",
-    "original_author": "NousResearch",
-    "original_url": "https://huggingface.co/NousResearch/Nous-Hermes-13b",
-    "license": "GPL"
-  },
-  {
-    "name": "vicuna",
-    "display_name": "Vicuna",
-    "parameters": "7B",
-    "url": "https://huggingface.co/TheBloke/vicuna-7B-v1.3-GGML/resolve/main/vicuna-7b-v1.3.ggmlv3.q4_0.bin",
-    "short_description": "Vicuna is a chat assistant trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.",
-    "description": "The primary use of Vicuna is research on large language models and chatbots. The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.",
-    "published_by": "TheBloke",
-    "original_author": "LMSYS",
-    "original_url": "https://huggingface.co/lmsys/vicuna-7b-v1.3",
-    "license:": "Non-commercial"
-  }
-]
--- a/parser/parser.go
+++ b/parser/parser.go
@@ -2,76 +2,81 @@ package parser

 import (
 	"bufio"
+	"bytes"
+	"errors"
 	"fmt"
 	"io"
-	"strings"
 )

 type Command struct {
 	Name string
-	Arg  string
+	Args string
+}
+
+func (c *Command) Reset() {
+	c.Name = ""
+	c.Args = ""
 }

 func Parse(reader io.Reader) ([]Command, error) {
 	var commands []Command
-	var foundModel bool
+
+	var command, modelCommand Command

 	scanner := bufio.NewScanner(reader)
-	multiline := false
-	var multilineCommand *Command
+	scanner.Split(scanModelfile)
 	for scanner.Scan() {
-		line := scanner.Text()
-		if multiline {
-			// If we're in a multiline string and the line is """, end the multiline string.
-			if strings.TrimSpace(line) == `"""` {
-				multiline = false
-				commands = append(commands, *multilineCommand)
-			} else {
-				// Otherwise, append the line to the multiline string.
-				multilineCommand.Arg += "\n" + line
-			}
-			continue
-		}
-		fields := strings.Fields(line)
+		line := scanner.Bytes()
+
+		fields := bytes.SplitN(line, []byte(" "), 2)
 		if len(fields) == 0 {
 			continue
 		}

-		command := Command{}
-		switch strings.ToUpper(fields[0]) {
+		switch string(bytes.ToUpper(fields[0])) {
 		case "FROM":
 			command.Name = "model"
-			command.Arg = fields[1]
-			if command.Arg == "" {
-				return nil, fmt.Errorf("no model specified in FROM line")
-			}
-			foundModel = true
-		case "PROMPT":
-			command.Name = "prompt"
-			if fields[1] == `"""` {
-				multiline = true
-				multilineCommand = &command
-				multilineCommand.Arg = ""
-			} else {
-				command.Arg = strings.Join(fields[1:], " ")
-			}
+			command.Args = string(fields[1])
+			// copy command for validation
+			modelCommand = command
+		case "LICENSE", "TEMPLATE", "SYSTEM", "PROMPT":
+			command.Name = string(bytes.ToLower(fields[0]))
+			command.Args = string(fields[1])
 		case "PARAMETER":
-			command.Name = fields[1]
-			command.Arg = strings.Join(fields[2:], " ")
+			fields = bytes.SplitN(fields[1], []byte(" "), 2)
+			command.Name = string(fields[0])
+			command.Args = string(fields[1])
 		default:
 			continue
 		}
-		if !multiline {
-			commands = append(commands, command)
-		}
+
+		commands = append(commands, command)
+		command.Reset()
 	}

-	if !foundModel {
+	if modelCommand.Args == "" {
 		return nil, fmt.Errorf("no FROM line for the model was specified")
 	}

-	if multiline {
-		return nil, fmt.Errorf("unclosed multiline string")
-	}
 	return commands, scanner.Err()
 }
+
+func scanModelfile(data []byte, atEOF bool) (advance int, token []byte, err error) {
+	newline := bytes.IndexByte(data, '\n')
+
+	if start := bytes.Index(data, []byte(`"""`)); start >= 0 && start < newline {
+		end := bytes.Index(data[start+3:], []byte(`"""`))
+		if end < 0 {
+			if atEOF {
+				return 0, nil, errors.New(`unterminated multiline string: """`)
+			} else {
+				return 0, nil, nil
+			}
+		}
+
+		n := start + 3 + end + 3
+		return n, bytes.Replace(data[:n], []byte(`"""`), []byte(""), 2), nil
+	}
+
+	return bufio.ScanLines(data, atEOF)
+}
--- a/progressbar/LICENSE
+++ b/progressbar/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2017 Zack
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/progressbar/README.md
+++ b/progressbar/README.md
@@ -0,0 +1,121 @@
+# progressbar
+
+[![CI](https://github.com/schollz/progressbar/actions/workflows/ci.yml/badge.svg?branch=main&event=push)](https://github.com/schollz/progressbar/actions/workflows/ci.yml)
+[![go report card](https://goreportcard.com/badge/github.com/schollz/progressbar)](https://goreportcard.com/report/github.com/schollz/progressbar) 
+[![coverage](https://img.shields.io/badge/coverage-84%25-brightgreen.svg)](https://gocover.io/github.com/schollz/progressbar)
+[![godocs](https://godoc.org/github.com/schollz/progressbar?status.svg)](https://godoc.org/github.com/schollz/progressbar/v3) 
+
+A very simple thread-safe progress bar which should work on every OS without problems. I needed a progressbar for [croc](https://github.com/schollz/croc) and everything I tried had problems, so I made another one. In order to be OS agnostic I do not plan to support [multi-line outputs](https://github.com/schollz/progressbar/issues/6).
+
+
+## Install
+
+```
+go get -u github.com/schollz/progressbar/v3
+```
+
+## Usage 
+
+### Basic usage
+
+```golang
+bar := progressbar.Default(100)
+for i := 0; i < 100; i++ {
+    bar.Add(1)
+    time.Sleep(40 * time.Millisecond)
+}
+```
+
+which looks like:
+
+![Example of basic bar](examples/basic/basic.gif)
+
+
+### I/O operations
+
+The `progressbar` implements an `io.Writer` so it can automatically detect the number of bytes written to a stream, so you can use it as a progressbar for an `io.Reader`.
+
+```golang
+req, _ := http.NewRequest("GET", "https://dl.google.com/go/go1.14.2.src.tar.gz", nil)
+resp, _ := http.DefaultClient.Do(req)
+defer resp.Body.Close()
+
+f, _ := os.OpenFile("go1.14.2.src.tar.gz", os.O_CREATE|os.O_WRONLY, 0644)
+defer f.Close()
+
+bar := progressbar.DefaultBytes(
+    resp.ContentLength,
+    "downloading",
+)
+io.Copy(io.MultiWriter(f, bar), resp.Body)
+```
+
+which looks like:
+
+![Example of download bar](examples/download/download.gif)
+
+
+### Progress bar with unknown length
+
+A progressbar with unknown length is a spinner. Any bar with -1 length will automatically convert it to a spinner with a customizable spinner type. For example, the above code can be run and set the `resp.ContentLength` to `-1`.
+
+which looks like:
+
+![Example of download bar with unknown length](examples/download-unknown/download-unknown.gif)
+
+
+### Customization
+
+There is a lot of customization that you can do - change the writer, the color, the width, description, theme, etc. See [all the options](https://pkg.go.dev/github.com/schollz/progressbar/v3?tab=doc#Option).
+
+```golang
+bar := progressbar.NewOptions(1000,
+    progressbar.OptionSetWriter(ansi.NewAnsiStdout()),
+    progressbar.OptionEnableColorCodes(true),
+    progressbar.OptionShowBytes(true),
+    progressbar.OptionSetWidth(15),
+    progressbar.OptionSetDescription("[cyan][1/3][reset] Writing moshable file..."),
+    progressbar.OptionSetTheme(progressbar.Theme{
+        Saucer:        "[green]=[reset]",
+        SaucerHead:    "[green]>[reset]",
+        SaucerPadding: " ",
+        BarStart:      "[",
+        BarEnd:        "]",
+    }))
+for i := 0; i < 1000; i++ {
+    bar.Add(1)
+    time.Sleep(5 * time.Millisecond)
+}
+```
+
+which looks like:
+
+![Example of customized bar](examples/customization/customization.gif)
+
+
+## Contributing
+
+Pull requests are welcome. Feel free to...
+
+- Revise documentation
+- Add new features
+- Fix bugs
+- Suggest improvements
+
+## Thanks
+
+Thanks [@Dynom](https://github.com/dynom) for massive improvements in version 2.0!
+
+Thanks [@CrushedPixel](https://github.com/CrushedPixel) for adding descriptions and color code support!
+
+Thanks [@MrMe42](https://github.com/MrMe42) for adding some minor features!
+
+Thanks [@tehstun](https://github.com/tehstun) for some great PRs!
+
+Thanks [@Benzammour](https://github.com/Benzammour) and [@haseth](https://github.com/haseth) for helping create v3!
+
+Thanks [@briandowns](https://github.com/briandowns) for compiling the list of spinners.
+
+## License
+
+MIT
--- a/progressbar/progressbar.go
+++ b/progressbar/progressbar.go
--- a/progressbar/spinners.go
+++ b/progressbar/spinners.go
@@ -0,0 +1,80 @@
+package progressbar
+
+var spinners = map[int][]string{
+	0:  {"←", "↖", "↑", "↗", "→", "↘", "↓", "↙"},
+	1:  {"▁", "▃", "▄", "▅", "▆", "▇", "█", "▇", "▆", "▅", "▄", "▃", "▁"},
+	2:  {"▖", "▘", "▝", "▗"},
+	3:  {"┤", "┘", "┴", "└", "├", "┌", "┬", "┐"},
+	4:  {"◢", "◣", "◤", "◥"},
+	5:  {"◰", "◳", "◲", "◱"},
+	6:  {"◴", "◷", "◶", "◵"},
+	7:  {"◐", "◓", "◑", "◒"},
+	8:  {".", "o", "O", "@", "*"},
+	9:  {"|", "/", "-", "\\"},
+	10: {"◡◡", "⊙⊙", "◠◠"},
+	11: {"⣾", "⣽", "⣻", "⢿", "⡿", "⣟", "⣯", "⣷"},
+	12: {">))'>", " >))'>", "  >))'>", "   >))'>", "    >))'>", "   <'((<", "  <'((<", " <'((<"},
+	13: {"⠁", "⠂", "⠄", "⡀", "⢀", "⠠", "⠐", "⠈"},
+	14: {"⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"},
+	15: {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"},
+	16: {"▉", "▊", "▋", "▌", "▍", "▎", "▏", "▎", "▍", "▌", "▋", "▊", "▉"},
+	17: {"■", "□", "▪", "▫"},
+	18: {"←", "↑", "→", "↓"},
+	19: {"╫", "╪"},
+	20: {"⇐", "⇖", "⇑", "⇗", "⇒", "⇘", "⇓", "⇙"},
+	21: {"⠁", "⠁", "⠉", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠤", "⠄", "⠄", "⠤", "⠠", "⠠", "⠤", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋", "⠉", "⠈", "⠈"},
+	22: {"⠈", "⠉", "⠋", "⠓", "⠒", "⠐", "⠐", "⠒", "⠖", "⠦", "⠤", "⠠", "⠠", "⠤", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋", "⠉", "⠈"},
+	23: {"⠁", "⠉", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠤", "⠄", "⠄", "⠤", "⠴", "⠲", "⠒", "⠂", "⠂", "⠒", "⠚", "⠙", "⠉", "⠁"},
+	24: {"⠋", "⠙", "⠚", "⠒", "⠂", "⠂", "⠒", "⠲", "⠴", "⠦", "⠖", "⠒", "⠐", "⠐", "⠒", "⠓", "⠋"},
+	25: {"ｦ", "ｧ", "ｨ", "ｩ", "ｪ", "ｫ", "ｬ", "ｭ", "ｮ", "ｯ", "ｱ", "ｲ", "ｳ", "ｴ", "ｵ", "ｶ", "ｷ", "ｸ", "ｹ", "ｺ", "ｻ", "ｼ", "ｽ", "ｾ", "ｿ", "ﾀ", "ﾁ", "ﾂ", "ﾃ", "ﾄ", "ﾅ", "ﾆ", "ﾇ", "ﾈ", "ﾉ", "ﾊ", "ﾋ", "ﾌ", "ﾍ", "ﾎ", "ﾏ", "ﾐ", "ﾑ", "ﾒ", "ﾓ", "ﾔ", "ﾕ", "ﾖ", "ﾗ", "ﾘ", "ﾙ", "ﾚ", "ﾛ", "ﾜ", "ﾝ"},
+	26: {".", "..", "..."},
+	27: {"▁", "▂", "▃", "▄", "▅", "▆", "▇", "█", "▉", "▊", "▋", "▌", "▍", "▎", "▏", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█", "▇", "▆", "▅", "▄", "▃", "▂", "▁"},
+	28: {".", "o", "O", "°", "O", "o", "."},
+	29: {"+", "x"},
+	30: {"v", "<", "^", ">"},
+	31: {">>--->", " >>--->", "  >>--->", "   >>--->", "    >>--->", "    <---<<", "   <---<<", "  <---<<", " <---<<", "<---<<"},
+	32: {"|", "||", "|||", "||||", "|||||", "|||||||", "||||||||", "|||||||", "||||||", "|||||", "||||", "|||", "||", "|"},
+	33: {"[          ]", "[=         ]", "[==        ]", "[===       ]", "[====      ]", "[=====     ]", "[======    ]", "[=======   ]", "[========  ]", "[========= ]", "[==========]"},
+	34: {"(*---------)", "(-*--------)", "(--*-------)", "(---*------)", "(----*-----)", "(-----*----)", "(------*---)", "(-------*--)", "(--------*-)", "(---------*)"},
+	35: {"█▒▒▒▒▒▒▒▒▒", "███▒▒▒▒▒▒▒", "█████▒▒▒▒▒", "███████▒▒▒", "██████████"},
+	36: {"[                    ]", "[=>                  ]", "[===>                ]", "[=====>              ]", "[======>             ]", "[========>           ]", "[==========>         ]", "[============>       ]", "[==============>     ]", "[================>   ]", "[==================> ]", "[===================>]"},
+	37: {"ဝ", "၀"},
+	38: {"▌", "▀", "▐▄"},
+	39: {"🌍", "🌎", "🌏"},
+	40: {"◜", "◝", "◞", "◟"},
+	41: {"⬒", "⬔", "⬓", "⬕"},
+	42: {"⬖", "⬘", "⬗", "⬙"},
+	43: {"[>>>          >]", "[]>>>>        []", "[]  >>>>      []", "[]    >>>>    []", "[]      >>>>  []", "[]        >>>>[]", "[>>          >>]"},
+	44: {"♠", "♣", "♥", "♦"},
+	45: {"➞", "➟", "➠", "➡", "➠", "➟"},
+	46: {"  |  ", ` \   `, "_    ", ` \   `, "  |  ", "   / ", "    _", "   / "},
+	47: {"  . . . .", ".   . . .", ". .   . .", ". . .   .", ". . . .  ", ". . . . ."},
+	48: {" |     ", "  /    ", "   _   ", `    \  `, "     | ", `    \  `, "   _   ", "  /    "},
+	49: {"⎺", "⎻", "⎼", "⎽", "⎼", "⎻"},
+	50: {"▹▹▹▹▹", "▸▹▹▹▹", "▹▸▹▹▹", "▹▹▸▹▹", "▹▹▹▸▹", "▹▹▹▹▸"},
+	51: {"[    ]", "[   =]", "[  ==]", "[ ===]", "[====]", "[=== ]", "[==  ]", "[=   ]"},
+	52: {"( ●    )", "(  ●   )", "(   ●  )", "(    ● )", "(     ●)", "(    ● )", "(   ●  )", "(  ●   )", "( ●    )"},
+	53: {"✶", "✸", "✹", "✺", "✹", "✷"},
+	54: {"▐|\\____________▌", "▐_|\\___________▌", "▐__|\\__________▌", "▐___|\\_________▌", "▐____|\\________▌", "▐_____|\\_______▌", "▐______|\\______▌", "▐_______|\\_____▌", "▐________|\\____▌", "▐_________|\\___▌", "▐__________|\\__▌", "▐___________|\\_▌", "▐____________|\\▌", "▐____________/|▌", "▐___________/|_▌", "▐__________/|__▌", "▐_________/|___▌", "▐________/|____▌", "▐_______/|_____▌", "▐______/|______▌", "▐_____/|_______▌", "▐____/|________▌", "▐___/|_________▌", "▐__/|__________▌", "▐_/|___________▌", "▐/|____________▌"},
+	55: {"▐⠂       ▌", "▐⠈       ▌", "▐ ⠂      ▌", "▐ ⠠      ▌", "▐  ⡀     ▌", "▐  ⠠     ▌", "▐   ⠂    ▌", "▐   ⠈    ▌", "▐    ⠂   ▌", "▐    ⠠   ▌", "▐     ⡀  ▌", "▐     ⠠  ▌", "▐      ⠂ ▌", "▐      ⠈ ▌", "▐       ⠂▌", "▐       ⠠▌", "▐       ⡀▌", "▐      ⠠ ▌", "▐      ⠂ ▌", "▐     ⠈  ▌", "▐     ⠂  ▌", "▐    ⠠   ▌", "▐    ⡀   ▌", "▐   ⠠    ▌", "▐   ⠂    ▌", "▐  ⠈     ▌", "▐  ⠂     ▌", "▐ ⠠      ▌", "▐ ⡀      ▌", "▐⠠       ▌"},
+	56: {"¿", "?"},
+	57: {"⢹", "⢺", "⢼", "⣸", "⣇", "⡧", "⡗", "⡏"},
+	58: {"⢄", "⢂", "⢁", "⡁", "⡈", "⡐", "⡠"},
+	59: {".  ", ".. ", "...", " ..", "  .", "   "},
+	60: {".", "o", "O", "°", "O", "o", "."},
+	61: {"▓", "▒", "░"},
+	62: {"▌", "▀", "▐", "▄"},
+	63: {"⊶", "⊷"},
+	64: {"▪", "▫"},
+	65: {"□", "■"},
+	66: {"▮", "▯"},
+	67: {"-", "=", "≡"},
+	68: {"d", "q", "p", "b"},
+	69: {"∙∙∙", "●∙∙", "∙●∙", "∙∙●", "∙∙∙"},
+	70: {"🌑 ", "🌒 ", "🌓 ", "🌔 ", "🌕 ", "🌖 ", "🌗 ", "🌘 "},
+	71: {"☗", "☖"},
+	72: {"⧇", "⧆"},
+	73: {"◉", "◎"},
+	74: {"㊂", "㊀", "㊁"},
+	75: {"⦾", "⦿"},
+}
--- a/server/images.go
+++ b/server/images.go
@@ -16,18 +16,54 @@ import (
 	"reflect"
 	"strconv"
 	"strings"
+	"text/template"

 	"github.com/jmorganca/ollama/api"
 	"github.com/jmorganca/ollama/parser"
 )

+type RegistryOptions struct {
+	Insecure bool
+	Username string
+	Password string
+}
+
 type Model struct {
 	Name      string `json:"name"`
 	ModelPath string
-	Prompt    string
+	Template  string
+	System    string
 	Options   api.Options
 }

+func (m *Model) Prompt(request api.GenerateRequest) (string, error) {
+	tmpl, err := template.New("").Parse(m.Template)
+	if err != nil {
+		return "", err
+	}
+
+	var vars struct {
+		First  bool
+		System string
+		Prompt string
+
+		// deprecated: versions <= 0.0.7 used this to omit the system prompt
+		Context []int
+	}
+
+	vars.First = len(request.Context) == 0
+	vars.System = m.System
+	vars.Prompt = request.Prompt
+	vars.Context = request.Context
+
+	var sb strings.Builder
+	if err := tmpl.Execute(&sb, vars); err != nil {
+		return "", err
+	}
+
+	return sb.String(), nil
+}
+
 type ManifestV2 struct {
 	SchemaVersion int      `json:"schemaVersion"`
 	MediaType     string   `json:"mediaType"`
@@ -71,20 +107,19 @@ func GetManifest(mp ModelPath) (*ManifestV2, error) {
 	if err != nil {
 		return nil, err
 	}
-	if _, err = os.Stat(fp); err != nil && !errors.Is(err, os.ErrNotExist) {
-		return nil, fmt.Errorf("couldn't find model '%s'", mp.GetShortTagname())
+
+	if _, err = os.Stat(fp); err != nil {
+		return nil, err
 	}

 	var manifest *ManifestV2

-	f, err := os.Open(fp)
+	bts, err := os.ReadFile(fp)
 	if err != nil {
 		return nil, fmt.Errorf("couldn't open file '%s'", fp)
 	}

-	decoder := json.NewDecoder(f)
-	err = decoder.Decode(&manifest)
-	if err != nil {
+	if err := json.Unmarshal(bts, &manifest); err != nil {
 		return nil, err
 	}

@@ -112,12 +147,27 @@ func GetModel(name string) (*Model, error) {
 		switch layer.MediaType {
 		case "application/vnd.ollama.image.model":
 			model.ModelPath = filename
-		case "application/vnd.ollama.image.prompt":
-			data, err := os.ReadFile(filename)
+		case "application/vnd.ollama.image.template":
+			bts, err := os.ReadFile(filename)
 			if err != nil {
 				return nil, err
 			}
-			model.Prompt = string(data)
+
+			model.Template = string(bts)
+		case "application/vnd.ollama.image.system":
+			bts, err := os.ReadFile(filename)
+			if err != nil {
+				return nil, err
+			}
+
+			model.System = string(bts)
+		case "application/vnd.ollama.image.prompt":
+			bts, err := os.ReadFile(filename)
+			if err != nil {
+				return nil, err
+			}
+
+			model.Template = string(bts)
 		case "application/vnd.ollama.image.params":
 			params, err := os.Open(filename)
 			if err != nil {
@@ -137,25 +187,17 @@ func GetModel(name string) (*Model, error) {
 	return model, nil
 }

-func getAbsPath(fp string) (string, error) {
-	if strings.HasPrefix(fp, "~/") {
-		parts := strings.Split(fp, "/")
-		home, err := os.UserHomeDir()
-		if err != nil {
-			return "", err
-		}
-
-		fp = filepath.Join(home, filepath.Join(parts[1:]...))
+func CreateModel(name string, path string, fn func(status string)) error {
+	mf, err := os.Open(path)
+	if err != nil {
+		fn(fmt.Sprintf("couldn't open modelfile '%s'", path))
+		return fmt.Errorf("failed to open file: %w", err)
 	}
+	defer mf.Close()

-	return os.ExpandEnv(fp), nil
-}
-
-func CreateModel(name string, mf io.Reader, fn func(status string)) error {
 	fn("parsing modelfile")
 	commands, err := parser.Parse(mf)
 	if err != nil {
-		fn(fmt.Sprintf("error: %v", err))
 		return err
 	}

@@ -163,30 +205,39 @@ func CreateModel(name string, mf io.Reader, fn func(status string)) error {
 	params := make(map[string]string)

 	for _, c := range commands {
-		log.Printf("[%s] - %s\n", c.Name, c.Arg)
+		log.Printf("[%s] - %s\n", c.Name, c.Args)
 		switch c.Name {
 		case "model":
 			fn("looking for model")
-			mf, err := GetManifest(ParseModelPath(c.Arg))
+			mf, err := GetManifest(ParseModelPath(c.Args))
 			if err != nil {
-				// if we couldn't read the manifest, try getting the bin file
-				fp, err := getAbsPath(c.Arg)
-				if err != nil {
-					fn("error determing path. exiting.")
-					return err
+				fp := c.Args
+
+				// If filePath starts with ~/, replace it with the user's home directory.
+				if strings.HasPrefix(fp, "~/") {
+					parts := strings.Split(fp, "/")
+					home, err := os.UserHomeDir()
+					if err != nil {
+						return fmt.Errorf("failed to open file: %v", err)
+					}
+
+					fp = filepath.Join(home, filepath.Join(parts[1:]...))
+				}
+
+				// If filePath is not an absolute path, make it relative to the modelfile path
+				if !filepath.IsAbs(fp) {
+					fp = filepath.Join(filepath.Dir(path), fp)
 				}

 				fn("creating model layer")
 				file, err := os.Open(fp)
 				if err != nil {
-					fn(fmt.Sprintf("couldn't find model '%s'", c.Arg))
 					return fmt.Errorf("failed to open file: %v", err)
 				}
 				defer file.Close()

 				l, err := CreateLayer(file)
 				if err != nil {
-					fn(fmt.Sprintf("couldn't create model layer: %v", err))
 					return fmt.Errorf("failed to create layer: %v", err)
 				}
 				l.MediaType = "application/vnd.ollama.image.model"
@@ -196,27 +247,26 @@ func CreateModel(name string, mf io.Reader, fn func(status string)) error {
 				for _, l := range mf.Layers {
 					newLayer, err := GetLayerWithBufferFromLayer(l)
 					if err != nil {
-						fn(fmt.Sprintf("couldn't read layer: %v", err))
 						return err
 					}
 					layers = append(layers, newLayer)
 				}
 			}
-		case "prompt":
-			fn("creating prompt layer")
+		case "license", "template", "system", "prompt":
+			fn(fmt.Sprintf("creating %s layer", c.Name))
 			// remove the prompt layer if one exists
-			layers = removeLayerFromLayers(layers, "application/vnd.ollama.image.prompt")
+			mediaType := fmt.Sprintf("application/vnd.ollama.image.%s", c.Name)
+			layers = removeLayerFromLayers(layers, mediaType)

-			prompt := strings.NewReader(c.Arg)
-			l, err := CreateLayer(prompt)
+			layer, err := CreateLayer(strings.NewReader(c.Args))
 			if err != nil {
-				fn(fmt.Sprintf("couldn't create prompt layer: %v", err))
-				return fmt.Errorf("failed to create layer: %v", err)
+				return err
 			}
-			l.MediaType = "application/vnd.ollama.image.prompt"
-			layers = append(layers, l)
+
+			layer.MediaType = mediaType
+			layers = append(layers, layer)
 		default:
-			params[c.Name] = c.Arg
+			params[c.Name] = c.Args
 		}
 	}

@@ -256,7 +306,6 @@ func CreateModel(name string, mf io.Reader, fn func(status string)) error {

 	err = SaveLayers(layers, fn, false)
 	if err != nil {
-		fn(fmt.Sprintf("error saving layers: %v", err))
 		return err
 	}

@@ -264,7 +313,6 @@ func CreateModel(name string, mf io.Reader, fn func(status string)) error {
 	fn("writing manifest")
 	err = CreateManifest(name, cfg, manifestLayers)
 	if err != nil {
-		fn(fmt.Sprintf("error creating manifest: %v", err))
 		return err
 	}

@@ -445,7 +493,110 @@ func CreateLayer(f io.ReadSeeker) (*LayerReader, error) {
 	return layer, nil
 }

-func PushModel(name, username, password string, fn func(api.ProgressResponse)) error {
+func CopyModel(src, dest string) error {
+	srcPath, err := ParseModelPath(src).GetManifestPath(false)
+	if err != nil {
+		return err
+	}
+	destPath, err := ParseModelPath(dest).GetManifestPath(true)
+	if err != nil {
+		return err
+	}
+
+	// copy the file
+	input, err := ioutil.ReadFile(srcPath)
+	if err != nil {
+		fmt.Println("Error reading file:", err)
+		return err
+	}
+
+	err = ioutil.WriteFile(destPath, input, 0644)
+	if err != nil {
+		fmt.Println("Error reading file:", err)
+		return err
+	}
+
+	return nil
+}
+
+func DeleteModel(name string) error {
+	mp := ParseModelPath(name)
+
+	manifest, err := GetManifest(mp)
+	if err != nil {
+		return err
+	}
+	deleteMap := make(map[string]bool)
+	for _, layer := range manifest.Layers {
+		deleteMap[layer.Digest] = true
+	}
+	deleteMap[manifest.Config.Digest] = true
+
+	fp, err := GetManifestPath()
+	if err != nil {
+		return err
+	}
+	err = filepath.Walk(fp, func(path string, info os.FileInfo, err error) error {
+		if err != nil {
+			return err
+		}
+		if !info.IsDir() {
+			path := path[len(fp)+1:]
+			slashIndex := strings.LastIndex(path, "/")
+			if slashIndex == -1 {
+				return nil
+			}
+			tag := path[:slashIndex] + ":" + path[slashIndex+1:]
+			fmp := ParseModelPath(tag)
+
+			// skip the manifest we're trying to delete
+			if mp.GetFullTagname() == fmp.GetFullTagname() {
+				return nil
+			}
+
+			// save (i.e. delete from the deleteMap) any files used in other manifests
+			manifest, err := GetManifest(fmp)
+			if err != nil {
+				log.Printf("skipping file: %s", fp)
+				return nil
+			}
+			for _, layer := range manifest.Layers {
+				delete(deleteMap, layer.Digest)
+			}
+			delete(deleteMap, manifest.Config.Digest)
+		}
+		return nil
+	})
+
+	// only delete the files which are still in the deleteMap
+	for k, v := range deleteMap {
+		if v {
+			fp, err := GetBlobsPath(k)
+			if err != nil {
+				log.Printf("couldn't get file path for '%s': %v", k, err)
+				continue
+			}
+			if err := os.Remove(fp); err != nil {
+				log.Printf("couldn't remove file '%s': %v", fp, err)
+				continue
+			}
+		}
+	}
+
+	fp, err = mp.GetManifestPath(false)
+	if err != nil {
+		return err
+	}
+	err = os.Remove(fp)
+	if err != nil {
+		log.Printf("couldn't remove manifest file '%s': %v", fp, err)
+		return err
+	}
+
+	return nil
+}
+
+func PushModel(name string, regOpts *RegistryOptions, fn func(api.ProgressResponse)) error {
 	mp := ParseModelPath(name)

 	fn(api.ProgressResponse{Status: "retrieving manifest"})
@@ -457,65 +608,49 @@ func PushModel(name, username, password string, fn func(api.ProgressResponse)) e
 	}

 	var layers []*Layer
-	var total int
-	var completed int
 	for _, layer := range manifest.Layers {
 		layers = append(layers, layer)
-		total += layer.Size
 	}
 	layers = append(layers, &manifest.Config)
-	total += manifest.Config.Size

 	for _, layer := range layers {
-		exists, err := checkBlobExistence(mp, layer.Digest, username, password)
+		exists, err := checkBlobExistence(mp, layer.Digest, regOpts)
 		if err != nil {
 			return err
 		}

 		if exists {
-			completed += layer.Size
 			fn(api.ProgressResponse{
 				Status:    "using existing layer",
 				Digest:    layer.Digest,
-				Total:     total,
-				Completed: completed,
+				Total:     layer.Size,
+				Completed: layer.Size,
 			})
+			log.Printf("Layer %s already exists", layer.Digest)
 			continue
 		}

 		fn(api.ProgressResponse{
-			Status:    "starting upload",
-			Digest:    layer.Digest,
-			Total:     total,
-			Completed: completed,
+			Status: "starting upload",
+			Digest: layer.Digest,
+			Total:  layer.Size,
 		})

-		location, err := startUpload(mp, username, password)
+		location, err := startUpload(mp, regOpts)
 		if err != nil {
 			log.Printf("couldn't start upload: %v", err)
 			return err
 		}

-		err = uploadBlob(location, layer, username, password)
+		err = uploadBlobChunked(mp, location, layer, regOpts, fn)
 		if err != nil {
 			log.Printf("error uploading blob: %v", err)
 			return err
 		}
-		completed += layer.Size
-		fn(api.ProgressResponse{
-			Status:    "upload complete",
-			Digest:    layer.Digest,
-			Total:     total,
-			Completed: completed,
-		})
 	}

-	fn(api.ProgressResponse{
-		Status:    "pushing manifest",
-		Total:     total,
-		Completed: completed,
-	})
-	url := fmt.Sprintf("%s://%s/v2/%s/manifests/%s", mp.ProtocolScheme, mp.Registry, mp.GetNamespaceRepository(), mp.Tag)
+	fn(api.ProgressResponse{Status: "pushing manifest"})
+	url := fmt.Sprintf("%s/v2/%s/manifests/%s", mp.Registry, mp.GetNamespaceRepository(), mp.Tag)
 	headers := map[string]string{
 		"Content-Type": "application/vnd.docker.distribution.manifest.v2+json",
 	}
@@ -525,7 +660,7 @@ func PushModel(name, username, password string, fn func(api.ProgressResponse)) e
 		return err
 	}

-	resp, err := makeRequest("PUT", url, headers, bytes.NewReader(manifestJSON), username, password)
+	resp, err := makeRequest("PUT", url, headers, bytes.NewReader(manifestJSON), regOpts)
 	if err != nil {
 		return err
 	}
@@ -537,42 +672,36 @@ func PushModel(name, username, password string, fn func(api.ProgressResponse)) e
 		return fmt.Errorf("registry responded with code %d: %v", resp.StatusCode, string(body))
 	}

-	fn(api.ProgressResponse{
-		Status:    "success",
-		Total:     total,
-		Completed: completed,
-	})
+	fn(api.ProgressResponse{Status: "success"})

 	return nil
 }

-func PullModel(name, username, password string, fn func(api.ProgressResponse)) error {
+func PullModel(name string, regOpts *RegistryOptions, fn func(api.ProgressResponse)) error {
 	mp := ParseModelPath(name)

 	fn(api.ProgressResponse{Status: "pulling manifest"})

-	manifest, err := pullModelManifest(mp, username, password)
+	manifest, err := pullModelManifest(mp, regOpts)
 	if err != nil {
 		return fmt.Errorf("pull model manifest: %q", err)
 	}

 	var layers []*Layer
-	var total int
-	var completed int
-	for _, layer := range manifest.Layers {
-		layers = append(layers, layer)
-		total += layer.Size
-	}
+	layers = append(layers, manifest.Layers...)
 	layers = append(layers, &manifest.Config)
-	total += manifest.Config.Size

 	for _, layer := range layers {
-		if err := downloadBlob(mp, layer.Digest, username, password, fn); err != nil {
-			fn(api.ProgressResponse{Status: fmt.Sprintf("error downloading: %v", err), Digest: layer.Digest})
+		if err := downloadBlob(mp, layer.Digest, regOpts, fn); err != nil {
 			return err
 		}
+	}

-		completed += layer.Size
+	fn(api.ProgressResponse{Status: "verifying sha256 digest"})
+	for _, layer := range layers {
+		if err := verifyBlob(layer.Digest); err != nil {
+			return err
+		}
 	}

 	fn(api.ProgressResponse{Status: "writing manifest"})
@@ -587,7 +716,7 @@ func PullModel(name, username, password string, fn func(api.ProgressResponse)) e
 		return err
 	}

-	err = os.WriteFile(fp, manifestJSON, 0644)
+	err = os.WriteFile(fp, manifestJSON, 0o644)
 	if err != nil {
 		log.Printf("couldn't write to %s", fp)
 		return err
@@ -598,13 +727,13 @@ func PullModel(name, username, password string, fn func(api.ProgressResponse)) e
 	return nil
 }

-func pullModelManifest(mp ModelPath, username, password string) (*ManifestV2, error) {
-	url := fmt.Sprintf("%s://%s/v2/%s/manifests/%s", mp.ProtocolScheme, mp.Registry, mp.GetNamespaceRepository(), mp.Tag)
+func pullModelManifest(mp ModelPath, regOpts *RegistryOptions) (*ManifestV2, error) {
+	url := fmt.Sprintf("%s/v2/%s/manifests/%s", mp.Registry, mp.GetNamespaceRepository(), mp.Tag)
 	headers := map[string]string{
 		"Accept": "application/vnd.docker.distribution.manifest.v2+json",
 	}

-	resp, err := makeRequest("GET", url, headers, nil, username, password)
+	resp, err := makeRequest("GET", url, headers, nil, regOpts)
 	if err != nil {
 		log.Printf("couldn't get manifest: %v", err)
 		return nil, err
@@ -641,8 +770,7 @@ func createConfigLayer(layers []string) (*LayerReader, error) {
 		return nil, err
 	}

-	buf := bytes.NewBuffer(configJSON)
-	digest, size := GetSHA256Digest(buf)
+	digest, size := GetSHA256Digest(bytes.NewBuffer(configJSON))

 	layer := &LayerReader{
 		Layer: Layer{
@@ -650,7 +778,7 @@ func createConfigLayer(layers []string) (*LayerReader, error) {
 			Digest:    digest,
 			Size:      size,
 		},
-		Reader: buf,
+		Reader: bytes.NewBuffer(configJSON),
 	}
 	return layer, nil
 }
@@ -666,10 +794,10 @@ func GetSHA256Digest(r io.Reader) (string, int) {
 	return fmt.Sprintf("sha256:%x", h.Sum(nil)), int(n)
 }

-func startUpload(mp ModelPath, username string, password string) (string, error) {
-	url := fmt.Sprintf("%s://%s/v2/%s/blobs/uploads/", mp.ProtocolScheme, mp.Registry, mp.GetNamespaceRepository())
+func startUpload(mp ModelPath, regOpts *RegistryOptions) (string, error) {
+	url := fmt.Sprintf("%s/v2/%s/blobs/uploads/", mp.Registry, mp.GetNamespaceRepository())

-	resp, err := makeRequest("POST", url, nil, nil, username, password)
+	resp, err := makeRequest("POST", url, nil, nil, regOpts)
 	if err != nil {
 		log.Printf("couldn't start upload: %v", err)
 		return "", err
@@ -692,10 +820,10 @@ func startUpload(mp ModelPath, username string, password string) (string, error)
 }

 // Function to check if a blob already exists in the Docker registry
-func checkBlobExistence(mp ModelPath, digest string, username string, password string) (bool, error) {
-	url := fmt.Sprintf("%s://%s/v2/%s/blobs/%s", mp.ProtocolScheme, mp.Registry, mp.GetNamespaceRepository(), digest)
+func checkBlobExistence(mp ModelPath, digest string, regOpts *RegistryOptions) (bool, error) {
+	url := fmt.Sprintf("%s/v2/%s/blobs/%s", mp.Registry, mp.GetNamespaceRepository(), digest)

-	resp, err := makeRequest("HEAD", url, nil, nil, username, password)
+	resp, err := makeRequest("HEAD", url, nil, nil, regOpts)
 	if err != nil {
 		log.Printf("couldn't check for blob: %v", err)
 		return false, err
@@ -706,19 +834,14 @@ func checkBlobExistence(mp ModelPath, digest string, username string, password s
 	return resp.StatusCode == http.StatusOK, nil
 }

-func uploadBlob(location string, layer *Layer, username string, password string) error {
-	// Create URL
-	url := fmt.Sprintf("%s&digest=%s", location, layer.Digest)
-
-	headers := make(map[string]string)
-	headers["Content-Length"] = fmt.Sprintf("%d", layer.Size)
-	headers["Content-Type"] = "application/octet-stream"
-
-	// TODO change from monolithic uploads to chunked uploads
+func uploadBlobChunked(mp ModelPath, location string, layer *Layer, regOpts *RegistryOptions, fn func(api.ProgressResponse)) error {
 	// TODO allow resumability
 	// TODO allow canceling uploads via DELETE
 	// TODO allow cross repo blob mount

+	// Create URL
+	url := fmt.Sprintf("%s", location)
+
 	fp, err := GetBlobsPath(layer.Digest)
 	if err != nil {
 		return err
@@ -729,23 +852,76 @@ func uploadBlob(location string, layer *Layer, username string, password string)
 		return err
 	}

-	resp, err := makeRequest("PUT", url, headers, f, username, password)
-	if err != nil {
-		log.Printf("couldn't upload blob: %v", err)
-		return err
-	}
-	defer resp.Body.Close()
+	headers := make(map[string]string)
+	headers["Content-Type"] = "application/octet-stream"

-	// Check for success: For a successful upload, the Docker registry will respond with a 201 Created
-	if resp.StatusCode != http.StatusCreated {
-		body, _ := io.ReadAll(resp.Body)
-		return fmt.Errorf("registry responded with code %d: %v", resp.StatusCode, string(body))
-	}
+	chunkSize := 1 << 20
+	buf := make([]byte, chunkSize)
+	var totalUploaded int

+	for {
+		n, err := f.Read(buf)
+		if err != nil {
+			return err
+		}
+
+		headers["Content-Length"] = fmt.Sprintf("%d", n)
+		headers["Content-Range"] = fmt.Sprintf("%d-%d", totalUploaded, totalUploaded+n-1)
+
+		fn(api.ProgressResponse{
+			Status:    fmt.Sprintf("uploading %s", layer.Digest),
+			Digest:    layer.Digest,
+			Total:     int(layer.Size),
+			Completed: int(totalUploaded),
+		})
+
+		// change the buffersize for the last chunk
+		if n < chunkSize {
+			buf = buf[:n]
+		}
+		resp, err := makeRequest("PATCH", url, headers, bytes.NewReader(buf), regOpts)
+		if err != nil {
+			log.Printf("couldn't upload blob: %v", err)
+			return err
+		}
+		defer resp.Body.Close()
+		url = resp.Header.Get("Location")
+
+		// Check for success: For a successful upload, the Docker registry will respond with a 201 Created
+		if resp.StatusCode != http.StatusAccepted {
+			fn(api.ProgressResponse{
+				Status:    fmt.Sprintf("error uploading layer"),
+				Digest:    layer.Digest,
+				Total:     int(layer.Size),
+				Completed: int(totalUploaded),
+			})
+			body, _ := io.ReadAll(resp.Body)
+			return fmt.Errorf("registry responded with code %d: %v", resp.StatusCode, string(body))
+		}
+
+		totalUploaded += n
+		if totalUploaded >= layer.Size {
+			url = fmt.Sprintf("%s&digest=%s", url, layer.Digest)
+
+			// finish the upload
+			resp, err := makeRequest("PUT", url, nil, nil, regOpts)
+			if err != nil {
+				log.Printf("couldn't finish upload: %v", err)
+				return err
+			}
+			defer resp.Body.Close()
+
+			if resp.StatusCode != http.StatusCreated {
+				body, _ := io.ReadAll(resp.Body)
+				return fmt.Errorf("registry responded with code %d: %v", resp.StatusCode, string(body))
+			}
+			break
+		}
+	}
 	return nil
 }

-func downloadBlob(mp ModelPath, digest string, username, password string, fn func(api.ProgressResponse)) error {
+func downloadBlob(mp ModelPath, digest string, regOpts *RegistryOptions, fn func(api.ProgressResponse)) error {
 	fp, err := GetBlobsPath(digest)
 	if err != nil {
 		return err
@@ -774,12 +950,12 @@ func downloadBlob(mp ModelPath, digest string, username, password string, fn fun
 		size = fi.Size()
 	}

-	url := fmt.Sprintf("%s://%s/v2/%s/blobs/%s", mp.ProtocolScheme, mp.Registry, mp.GetNamespaceRepository(), digest)
+	url := fmt.Sprintf("%s/v2/%s/blobs/%s", mp.Registry, mp.GetNamespaceRepository(), digest)
 	headers := map[string]string{
 		"Range": fmt.Sprintf("bytes=%d-", size),
 	}

-	resp, err := makeRequest("GET", url, headers, nil, username, password)
+	resp, err := makeRequest("GET", url, headers, nil, regOpts)
 	if err != nil {
 		log.Printf("couldn't download blob: %v", err)
 		return err
@@ -815,6 +991,10 @@ func downloadBlob(mp ModelPath, digest string, username, password string, fn fun
 		})

 		if completed >= total {
+			if err := out.Close(); err != nil {
+				return err
+			}
+
 			if err := os.Rename(fp+"-partial", fp); err != nil {
 				fn(api.ProgressResponse{
 					Status:    fmt.Sprintf("error renaming file: %v", err),
@@ -839,7 +1019,15 @@ func downloadBlob(mp ModelPath, digest string, username, password string, fn fun
 	return nil
 }

-func makeRequest(method, url string, headers map[string]string, body io.Reader, username, password string) (*http.Response, error) {
+func makeRequest(method, url string, headers map[string]string, body io.Reader, regOpts *RegistryOptions) (*http.Response, error) {
+	if !strings.HasPrefix(url, "http") {
+		if regOpts.Insecure {
+			url = "http://" + url
+		} else {
+			url = "https://" + url
+		}
+	}
+
 	req, err := http.NewRequest(method, url, body)
 	if err != nil {
 		return nil, err
@@ -850,8 +1038,8 @@ func makeRequest(method, url string, headers map[string]string, body io.Reader,
 	}

 	// TODO: better auth
-	if username != "" && password != "" {
-		req.SetBasicAuth(username, password)
+	if regOpts.Username != "" && regOpts.Password != "" {
+		req.SetBasicAuth(regOpts.Username, regOpts.Password)
 	}

 	client := &http.Client{
@@ -870,3 +1058,23 @@ func makeRequest(method, url string, headers map[string]string, body io.Reader,

 	return resp, nil
 }
+
+func verifyBlob(digest string) error {
+	fp, err := GetBlobsPath(digest)
+	if err != nil {
+		return err
+	}
+
+	f, err := os.Open(fp)
+	if err != nil {
+		return err
+	}
+	defer f.Close()
+
+	fileDigest, _ := GetSHA256Digest(f)
+	if digest != fileDigest {
+		return fmt.Errorf("digest mismatch: want %s, got %s", digest, fileDigest)
+	}
+
+	return nil
+}
--- a/server/modelpath.go
+++ b/server/modelpath.go
@@ -4,6 +4,7 @@ import (
 	"fmt"
 	"os"
 	"path/filepath"
+	"runtime"
 	"strings"
 )

@@ -44,7 +45,7 @@ func ParseModelPath(name string) ModelPath {
 		return ModelPath{}
 	}

-	colonParts := strings.Split(name, ":")
+	colonParts := strings.Split(slashParts[len(slashParts)-1], ":")
 	if len(colonParts) == 2 {
 		tag = colonParts[1]
 	} else {
@@ -69,10 +70,13 @@ func (mp ModelPath) GetFullTagname() string {
 }

 func (mp ModelPath) GetShortTagname() string {
-	if mp.Registry == DefaultRegistry && mp.Namespace == DefaultNamespace {
-		return fmt.Sprintf("%s:%s", mp.Repository, mp.Tag)
+	if mp.Registry == DefaultRegistry {
+		if mp.Namespace == DefaultNamespace {
+			return fmt.Sprintf("%s:%s", mp.Repository, mp.Tag)
+		}
+		return fmt.Sprintf("%s/%s:%s", mp.Namespace, mp.Repository, mp.Tag)
 	}
-	return fmt.Sprintf("%s/%s:%s", mp.Namespace, mp.Repository, mp.Tag)
+	return fmt.Sprintf("%s/%s/%s:%s", mp.Registry, mp.Namespace, mp.Repository, mp.Tag)
 }

 func (mp ModelPath) GetManifestPath(createDir bool) (string, error) {
@@ -106,6 +110,10 @@ func GetBlobsPath(digest string) (string, error) {
 		return "", err
 	}

+	if runtime.GOOS == "windows" {
+		digest = strings.ReplaceAll(digest, ":", "-")
+	}
+
 	path := filepath.Join(home, ".ollama", "models", "blobs", digest)
 	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
 		return "", err
--- a/server/routes.go
+++ b/server/routes.go
@@ -2,6 +2,8 @@ package server

 import (
 	"encoding/json"
+	"errors"
+	"fmt"
 	"io"
 	"log"
 	"net"
@@ -9,26 +11,17 @@ import (
 	"os"
 	"path/filepath"
 	"strings"
-	"text/template"
 	"time"

 	"dario.cat/mergo"
+	"github.com/gin-contrib/cors"
 	"github.com/gin-gonic/gin"

 	"github.com/jmorganca/ollama/api"
 	"github.com/jmorganca/ollama/llama"
 )

-func cacheDir() string {
-	home, err := os.UserHomeDir()
-	if err != nil {
-		panic(err)
-	}
-
-	return filepath.Join(home, ".ollama")
-}
-
-func generate(c *gin.Context) {
+func GenerateHandler(c *gin.Context) {
 	start := time.Now()

 	var req api.GenerateRequest
@@ -54,19 +47,12 @@ func generate(c *gin.Context) {
 		return
 	}

-	templ, err := template.New("").Parse(model.Prompt)
+	prompt, err := model.Prompt(req)
 	if err != nil {
 		c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
 		return
 	}

-	var sb strings.Builder
-	if err = templ.Execute(&sb, req); err != nil {
-		c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
-		return
-	}
-	req.Prompt = sb.String()
-
 	llm, err := llama.New(model.ModelPath, opts)
 	if err != nil {
 		c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
@@ -77,7 +63,7 @@ func generate(c *gin.Context) {
 	ch := make(chan any)
 	go func() {
 		defer close(ch)
-		llm.Predict(req.Context, req.Prompt, func(r api.GenerateResponse) {
+		fn := func(r api.GenerateResponse) {
 			r.Model = req.Model
 			r.CreatedAt = time.Now().UTC()
 			if r.Done {
@@ -85,13 +71,17 @@ func generate(c *gin.Context) {
 			}

 			ch <- r
-		})
+		}
+
+		if err := llm.Predict(req.Context, prompt, fn); err != nil {
+			ch <- gin.H{"error": err.Error()}
+		}
 	}()

 	streamResponse(c, ch)
 }

-func pull(c *gin.Context) {
+func PullModelHandler(c *gin.Context) {
 	var req api.PullRequest
 	if err := c.ShouldBindJSON(&req); err != nil {
 		c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
@@ -105,16 +95,21 @@ func pull(c *gin.Context) {
 			ch <- r
 		}

-		if err := PullModel(req.Name, req.Username, req.Password, fn); err != nil {
-			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
-			return
+		regOpts := &RegistryOptions{
+			Insecure: req.Insecure,
+			Username: req.Username,
+			Password: req.Password,
+		}
+
+		if err := PullModel(req.Name, regOpts, fn); err != nil {
+			ch <- gin.H{"error": err.Error()}
 		}
 	}()

 	streamResponse(c, ch)
 }

-func push(c *gin.Context) {
+func PushModelHandler(c *gin.Context) {
 	var req api.PushRequest
 	if err := c.ShouldBindJSON(&req); err != nil {
 		c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
@@ -128,31 +123,27 @@ func push(c *gin.Context) {
 			ch <- r
 		}

-		if err := PushModel(req.Name, req.Username, req.Password, fn); err != nil {
-			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
-			return
+		regOpts := &RegistryOptions{
+			Insecure: req.Insecure,
+			Username: req.Username,
+			Password: req.Password,
+		}
+
+		if err := PushModel(req.Name, regOpts, fn); err != nil {
+			ch <- gin.H{"error": err.Error()}
 		}
 	}()

 	streamResponse(c, ch)
 }

-func create(c *gin.Context) {
+func CreateModelHandler(c *gin.Context) {
 	var req api.CreateRequest
 	if err := c.ShouldBindJSON(&req); err != nil {
 		c.JSON(http.StatusBadRequest, gin.H{"message": err.Error()})
 		return
 	}

-	// NOTE consider passing the entire Modelfile in the json instead of the path to it
-
-	file, err := os.Open(req.Path)
-	if err != nil {
-		c.JSON(http.StatusBadRequest, gin.H{"message": err.Error()})
-		return
-	}
-	defer file.Close()
-
 	ch := make(chan any)
 	go func() {
 		defer close(ch)
@@ -162,16 +153,32 @@ func create(c *gin.Context) {
 			}
 		}

-		if err := CreateModel(req.Name, file, fn); err != nil {
-			c.JSON(http.StatusBadRequest, gin.H{"message": err.Error()})
-			return
+		if err := CreateModel(req.Name, req.Path, fn); err != nil {
+			ch <- gin.H{"error": err.Error()}
 		}
 	}()

 	streamResponse(c, ch)
 }

-func list(c *gin.Context) {
+func DeleteModelHandler(c *gin.Context) {
+	var req api.DeleteRequest
+	if err := c.ShouldBindJSON(&req); err != nil {
+		c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
+		return
+	}
+
+	if err := DeleteModel(req.Name); err != nil {
+		if os.IsNotExist(err) {
+			c.JSON(http.StatusNotFound, gin.H{"error": fmt.Sprintf("model '%s' not found", req.Name)})
+		} else {
+			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
+		}
+		return
+	}
+}
+
+func ListModelsHandler(c *gin.Context) {
 	var models []api.ListResponseModel
 	fp, err := GetManifestPath()
 	if err != nil {
@@ -180,6 +187,10 @@ func list(c *gin.Context) {
 	}
 	err = filepath.Walk(fp, func(path string, info os.FileInfo, err error) error {
 		if err != nil {
+			if errors.Is(err, os.ErrNotExist) {
+				log.Printf("manifest file does not exist: %s", fp)
+				return nil
+			}
 			return err
 		}
 		if !info.IsDir() {
@@ -217,18 +228,52 @@ func list(c *gin.Context) {
 	c.JSON(http.StatusOK, api.ListResponse{models})
 }

+func CopyModelHandler(c *gin.Context) {
+	var req api.CopyRequest
+	if err := c.ShouldBindJSON(&req); err != nil {
+		c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
+		return
+	}
+
+	if err := CopyModel(req.Source, req.Destination); err != nil {
+		if os.IsNotExist(err) {
+			c.JSON(http.StatusNotFound, gin.H{"error": fmt.Sprintf("model '%s' not found", req.Source)})
+		} else {
+			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
+		}
+		return
+	}
+}
+
 func Serve(ln net.Listener) error {
+	config := cors.DefaultConfig()
+	config.AllowWildcard = true
+	// only allow http/https from localhost
+	config.AllowOrigins = []string{
+		"http://localhost",
+		"http://localhost:*",
+		"https://localhost",
+		"https://localhost:*",
+		"http://127.0.0.1",
+		"http://127.0.0.1:*",
+		"https://127.0.0.1",
+		"https://127.0.0.1:*",
+	}
+
 	r := gin.Default()
+	r.Use(cors.New(config))

 	r.GET("/", func(c *gin.Context) {
 		c.String(http.StatusOK, "Ollama is running")
 	})

-	r.POST("/api/pull", pull)
-	r.POST("/api/generate", generate)
-	r.POST("/api/create", create)
-	r.POST("/api/push", push)
-	r.GET("/api/tags", list)
+	r.POST("/api/pull", PullModelHandler)
+	r.POST("/api/generate", GenerateHandler)
+	r.POST("/api/create", CreateModelHandler)
+	r.POST("/api/push", PushModelHandler)
+	r.POST("/api/copy", CopyModelHandler)
+	r.GET("/api/tags", ListModelsHandler)
+	r.DELETE("/api/delete", DeleteModelHandler)

 	log.Printf("Listening on %s", ln.Addr())
 	s := &http.Server{
--- a/server/templates/alpaca.prompt
+++ b/server/templates/alpaca.prompt
@@ -1,10 +0,0 @@
-{{- if not .Context }}
-Below is an instruction that describes a task. Write a response that appropriately completes the request.
-{{- end }}
-
-### Instruction:
-{{ .Prompt }}
-
-### Response:
-
-
--- a/server/templates/falcon.prompt
+++ b/server/templates/falcon.prompt
@@ -1,5 +0,0 @@
-{{- if not .Context }}
-A helpful assistant who helps the user with any questions asked.
-{{- end }}
-User: {{ .Prompt }}
-Assistant:
--- a/server/templates/gpt4.prompt
+++ b/server/templates/gpt4.prompt
@@ -1,5 +0,0 @@
-### Instruction:
-{{ .Prompt }}
-
-### Response:
-
--- a/server/templates/hermes.prompt
+++ b/server/templates/hermes.prompt
@@ -1,5 +0,0 @@
-### Instruction:
-{{ .Prompt }}
-
-### Response:
-
--- a/server/templates/mpt.prompt
+++ b/server/templates/mpt.prompt
@@ -1,6 +0,0 @@
-{{- if not .Context }}
-Below is an instruction that describes a task. Write a response that appropriately completes the request. Be concise. Once the request is completed, include no other text.
-{{- end }}
-### Instruction:
-{{ .Prompt }}
-### Response:
--- a/server/templates/oasst.prompt
+++ b/server/templates/oasst.prompt
@@ -1 +0,0 @@
-{{ .Prompt }}
--- a/server/templates/orca.prompt
+++ b/server/templates/orca.prompt
@@ -1,9 +0,0 @@
-{{- if not .Context }}
-### System:
-You are an AI assistant that follows instruction extremely well. Help as much as you can.
-{{- end }}
-
-### User:
-{{ .Prompt }}
-
-### Response:
--- a/server/templates/qlora.prompt
+++ b/server/templates/qlora.prompt
@@ -1,2 +0,0 @@
-### Human: {{ .Prompt }}
-### Assistant:
--- a/server/templates/tulu.prompt
+++ b/server/templates/tulu.prompt
@@ -1,4 +0,0 @@
-
-{{ .Prompt }}
-
-
--- a/server/templates/ultralm.prompt
+++ b/server/templates/ultralm.prompt
@@ -1,2 +0,0 @@
-USER: {{ .Prompt }}
-ASSISTANT:
--- a/server/templates/vicuna.prompt
+++ b/server/templates/vicuna.prompt
@@ -1,6 +0,0 @@
-{{ if not .Context }}
-A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
-{{- end }}
-
-USER: {{ .Prompt }}
-ASSISTANT:
--- a/server/templates/wizardcoder.prompt
+++ b/server/templates/wizardcoder.prompt
@@ -1,7 +0,0 @@
-{{- if not .Context }}
-Below is an instruction that describes a task. Write a response that appropriately completes the request
-{{- end }}
-
-### Instruction: {{ .Prompt }}
-
-### Response:
--- a/server/templates/wizardlm.prompt
+++ b/server/templates/wizardlm.prompt
@@ -1,3 +0,0 @@
-{{ .Prompt }}
-
-### Response:
--- a/web/app/api/models/route.ts
+++ b/web/app/api/models/route.ts
@@ -1,6 +0,0 @@
-import models from '../../../../models.json'
-import { NextResponse } from 'next/server'
-
-export async function GET() {
-  return NextResponse.json(models)
-}
--- a/web/app/api/signup/route.ts
+++ b/web/app/api/signup/route.ts
@@ -6,12 +6,22 @@ const analytics = new Analytics({ writeKey: process.env.TELEMETRY_WRITE_KEY || '
 export async function POST(req: Request) {
  const { email } = await req.json()

-  analytics.identify({
-    anonymousId: uuid(),
+  const id = uuid()
+
+  await analytics.identify({
+    anonymousId: id,
    traits: {
      email,
    },
  })

+  await analytics.track({
+    anonymousId: id,
+    event: 'signup',
+    properties: {
+      email,
+    },
+  })
+
  return new Response(null, { status: 200 })
 }
--- a/web/app/download/page.tsx
+++ b/web/app/download/page.tsx
@@ -1,3 +1,5 @@
+import Image from 'next/image'
+
 import Header from '../header'
 import Downloader from './downloader'
 import Signup from './signup'
@@ -30,7 +32,7 @@ export default async function Download() {
    <>
      <Header />
      <main className='flex min-h-screen max-w-6xl flex-col py-20 px-16 lg:p-32 items-center mx-auto'>
-        <img src='/ollama.png' className='w-16 h-auto' />
+        <Image src='/ollama.png' width={64} height={64} alt='ollamaIcon' />
        <section className='mt-12 mb-8 text-center'>
          <h2 className='my-2 max-w-md text-3xl tracking-tight'>Downloading...</h2>
          <h3 className='text-base text-neutral-500 mt-12 max-w-[16rem]'>
--- a/web/app/header.tsx
+++ b/web/app/header.tsx
@@ -1,24 +1,26 @@
+import Link from "next/link"
+
 const navigation = [
  { name: 'Discord', href: 'https://discord.gg/MrfB5FbNWN' },
-  { name: 'GitHub', href: 'https://github.com/jmorganca/ollama' },
+  { name: 'Github', href: 'https://github.com/jmorganca/ollama' },
  { name: 'Download', href: '/download' },
 ]

-export default function Header() {
+export default function Header() {  
  return (
-    <header className='absolute inset-x-0 top-0 z-50'>
-      <nav className='mx-auto flex items-center justify-between px-10 py-4'>
-        <a className='flex-1 font-bold' href='/'>
+    <header className="absolute inset-x-0 top-0 z-50">
+      <nav className="mx-auto flex items-center justify-between px-10 py-4">        
+        <Link className="flex-1 font-bold" href="/">
          Ollama
-        </a>
-        <div className='flex space-x-8'>
-          {navigation.map(item => (
-            <a key={item.name} href={item.href} className='text-sm leading-6 text-gray-900'>
+        </Link>
+        <div className="flex space-x-8">
+          {navigation.map((item) => (
+            <Link key={item.name} href={item.href} className="text-sm leading-6 text-gray-900">
              {item.name}
-            </a>
+            </Link>
          ))}
        </div>
      </nav>
-    </header>
+    </header >
  )
-}
+}
--- a/web/app/page.tsx
+++ b/web/app/page.tsx
@@ -1,6 +1,6 @@
-import { AiFillApple } from 'react-icons/ai'
+import Image from 'next/image'
+import Link from 'next/link'

-import models from '../../models.json'
 import Header from './header'

 export default async function Home() {
@@ -8,21 +8,26 @@ export default async function Home() {
    <>
      <Header />
      <main className='flex min-h-screen max-w-6xl flex-col py-20 px-16 md:p-32 items-center mx-auto'>
-        <img src='/ollama.png' className='w-16 h-auto' />
+        <Image src='/ollama.png' width={64} height={64} alt='ollamaIcon' />
        <section className='my-12 text-center'>
          <div className='flex flex-col space-y-2'>
-            <h2 className='md:max-w-[18rem] mx-auto my-2 text-3xl tracking-tight'>Portable large language models</h2>
+            <h2 className='md:max-w-md mx-auto my-2 text-3xl tracking-tight'>
+              Get up and running with large language models, locally.
+            </h2>
            <h3 className='md:max-w-xs mx-auto text-base text-neutral-500'>
-              Bundle a model’s weights, configuration, prompts, data and more into self-contained packages that run anywhere.
+              Run Llama 2 and other models on macOS. Customize and create your own.
            </h3>
          </div>
-          <div className='mx-auto flex flex-col space-y-4 mt-12'>
-            <a href='/download' className='md:mx-10 lg:mx-14 bg-black text-white rounded-full px-4 py-2 focus:outline-none cursor-pointer'>
+          <div className='mx-auto max-w-xs flex flex-col space-y-4 mt-12'>
+            <Link
+              href='/download'
+              className='md:mx-10 lg:mx-14 bg-black text-white rounded-full px-4 py-2 focus:outline-none cursor-pointer'
+            >
              Download
-            </a>
+            </Link>
            <p className='text-neutral-500 text-sm '>
-            Available for macOS with Apple Silicon <br />
-            Windows & Linux support coming soon.
+              Available for macOS with Apple Silicon <br />
+              Windows & Linux support coming soon.
            </p>
          </div>
        </section>
Author	SHA1	Message	Date
Patrick Devine	11b844e1bb	add copy command	2023-07-24 10:55:38 -04:00
Patrick Devine	88c55199f8	change push to chunked uploads from monolithic (#179 )	2023-07-22 17:31:26 -07:00
hoyyeva	c448443813	Merge pull request #164 from jmorganca/restart-server restart server more gracefully	2023-07-22 18:19:22 -04:00
Michael Yang	efacd45fc5	Merge pull request #175 from jk1jk/main Update .gitignore	2023-07-22 09:40:37 -07:00
Michael Yang	fa522695c4	Merge pull request #178 from jmorganca/gin-cors use gin-contrib/cors middleware	2023-07-22 09:40:01 -07:00
Michael Yang	8609db77ea	use gin-contrib/cors middleware	2023-07-22 09:39:08 -07:00
Ikko Eltociear Ashimine	65d93a86b2	Update modelfile.md (#177 ) fix markdown.	2023-07-22 08:19:30 -07:00
jk1jk	e6c427ce4d	Update .gitignore	2023-07-22 17:00:52 +03:00
Patrick Devine	6d6b0d3321	change error handler behavior and fix error when a model isn't found (#173 )	2023-07-21 23:02:12 -07:00
Michael Yang	37324a0a00	Merge pull request #172 from jmorganca/set-vars-first fix vars.First	2023-07-21 20:55:06 -07:00
Michael Yang	20a5d99f77	fix vars.First	2023-07-21 20:45:32 -07:00
Patrick Devine	3b43cc019a	fix extended tag names (#171 )	2023-07-21 20:27:25 -07:00
Patrick Devine	b8421dce3d	get the proper path for blobs to delete (#168 )	2023-07-21 17:30:40 -07:00
Patrick Devine	9f6e97865c	allow pushing/pulling to insecure registries (#157 )	2023-07-21 15:42:19 -07:00
Eva Ho	9657314ae2	address comment	2023-07-21 17:29:07 -04:00
Eva Ho	3f7d2336c7	add prettier and address comments	2023-07-21 17:10:05 -04:00
Eva Ho	e0a73d7fbe	address comment	2023-07-21 16:53:56 -04:00
hoyyeva	b08c4ca2bd	Update app/src/index.ts Co-authored-by: Jeffrey Morgan <251292+jmorganca@users.noreply.github.com>	2023-07-21 16:53:56 -04:00
Eva Ho	734892f1e2	address comment	2023-07-21 16:53:56 -04:00
Eva Ho	d2bfaeac63	format code	2023-07-21 16:53:56 -04:00
Eva Ho	0768b1b907	restart server with condition and timeout	2023-07-21 16:53:56 -04:00
Bruce MacDonald	f5f0da06d9	Merge pull request #166 from jmorganca/brucemacd/dev-cgo	2023-07-21 22:48:10 +02:00
Bruce MacDonald	52f04e39f2	Note that CGO must be enabled in dev docs	2023-07-21 22:36:36 +02:00
Jeffrey Morgan	3c8f4c03d7	web: tweak homepage text	2023-07-21 09:57:57 -07:00
Bruce MacDonald	7ba1308595	Merge pull request #147 from jmorganca/brucemacd/cli-err-display Improve CLI error display	2023-07-21 16:10:19 +02:00
Jeffrey Morgan	91cd54016c	add basic REST api documentation	2023-07-21 00:47:17 -07:00
Patrick Devine	e7a393de54	add rm command for models (#151 )	2023-07-20 16:09:23 -07:00
Jeffrey Morgan	8454f298ac	fix example `Modelfile`s	2023-07-20 15:46:32 -07:00
Patrick Devine	a3badaf103	add ls alias (#152 )	2023-07-20 15:28:27 -07:00
Michael Yang	50e8e5bdbe	Merge pull request #148 from jmorganca/more-llama-files add llama.cpp mpi, opencl files	2023-07-20 14:26:46 -07:00
Michael Yang	8526e1f5f1	add llama.cpp mpi, opencl files	2023-07-20 14:19:55 -07:00
Michael Yang	0cfdbb95cc	Merge pull request #146 from jmorganca/fix-windows-pull windows: fix model pulling	2023-07-20 13:41:54 -07:00
Michael Yang	6cea2061ec	windows: fix model pulling	2023-07-20 12:35:04 -07:00
Michael Yang	2832801c2a	Merge pull request #91 from jmorganca/fix-stream-errors fix stream errors	2023-07-20 12:21:59 -07:00
Jeffrey Morgan	23a37dc466	clean up `README.md`	2023-07-20 12:21:36 -07:00
Michael Yang	992892866b	Merge pull request #145 from jmorganca/verify-digest verify blob digest	2023-07-20 12:14:21 -07:00
Michael Yang	dde880290c	Merge pull request #131 from jmorganca/update-llama-cpp update llama.cpp to e782c9e735f93ab4767ffc37462c523b73a17ddc	2023-07-20 12:14:10 -07:00
Michael Yang	1f27d7f1b8	fix stream errors	2023-07-20 12:12:08 -07:00
Bruce MacDonald	00aaa05901	remove unused code	2023-07-20 20:57:30 +02:00
Michael Yang	a83eaa7a9f	update llama.cpp to e782c9e735f93ab4767ffc37462c523b73a17ddc	2023-07-20 11:55:56 -07:00
Michael Yang	5156e48c2a	add script to update llama.cpp	2023-07-20 11:54:59 -07:00
Michael Yang	bf198c3918	verify blob digest	2023-07-20 11:53:57 -07:00
Bruce MacDonald	09dc6273e3	suppress error when running list before pulling image	2023-07-20 20:53:09 +02:00
Bruce MacDonald	ebaa33ac28	display gin api errors in cli	2023-07-20 20:45:12 +02:00
Bruce MacDonald	3ec4ebc562	remove unused code	2023-07-20 20:18:00 +02:00
Jeffrey Morgan	6a19724d5f	remove colon from library modelfiles	2023-07-20 09:51:30 -07:00
Jeffrey Morgan	924ce739f9	documentation on the model format	2023-07-20 09:03:41 -07:00
Michael Chiang	e1973e6780	Update icon (#139 )	2023-07-20 08:55:20 -07:00
Jeffrey Morgan	f1b08ef40e	set temperature on `README.md` example	2023-07-20 08:17:09 -07:00
Jeffrey Morgan	31f0cb7742	new `Modelfile` syntax	2023-07-20 07:52:24 -07:00
Jeffrey Morgan	e4b2ccfb23	web: clean up remaining `models.json` usage	2023-07-20 07:51:46 -07:00
Bruce MacDonald	a3d7bb0a30	Merge pull request #136 from jmorganca/brucemacd/remove-models Delete models.json	2023-07-20 16:40:46 +02:00
Bruce MacDonald	77e49f3822	Delete models.json	2023-07-20 16:32:50 +02:00
Jeffrey Morgan	8945b25484	new modelfile syntax on branch	2023-07-20 02:24:21 -07:00
Jeffrey Morgan	99ccf0c5d3	fix broken link in `README.md`	2023-07-20 02:15:11 -07:00
Jeffrey Morgan	d59b164fa2	add prompt back to parser	2023-07-20 01:13:30 -07:00
Michael Yang	55b5f5dc34	ctrl+c on empty line exits (#135 )	2023-07-20 00:53:08 -07:00
Jeffrey Morgan	3b135ac963	parser: fix case where multi line string termination error wouldnt show	2023-07-20 00:43:22 -07:00
Jeffrey Morgan	e6bae8d916	parser: keep seeking until eof	2023-07-20 00:37:52 -07:00
Jeffrey Morgan	d9f54300c3	library: add echo for verify progress	2023-07-19 23:58:28 -07:00
Jeffrey Morgan	1511219763	update library modelfiles with new syntax	2023-07-19 23:57:22 -07:00
Jeffrey Morgan	ada0add89b	fix `llama` library templates	2023-07-19 23:53:40 -07:00
Jeffrey Morgan	75e508e1d6	remove old `templates`	2023-07-19 23:47:13 -07:00
Michael Yang	6f046dbf18	Update images.go (#134 )	2023-07-19 23:46:01 -07:00
Jeffrey Morgan	cd820c8bca	move `wizard-vicuna` to correct location	2023-07-19 23:44:03 -07:00
Jeffrey Morgan	88e755d7fd	Add files for library models	2023-07-19 23:40:37 -07:00
Michael Yang	6984171cfd	Merge pull request #93 from jmorganca/split-prompt separate prompt into template and system	2023-07-19 23:25:33 -07:00
Michael Yang	60b4db6389	add .First	2023-07-19 23:24:32 -07:00
Michael Chiang	7c6ea2a966	fix dangling """	2023-07-19 23:24:32 -07:00
Michael Chiang	c161aef5f9	update example	2023-07-19 23:24:32 -07:00
Michael Chiang	c47786c1b0	Update docs/modelfile.md Co-authored-by: Michael Yang <mxyng@pm.me>	2023-07-19 23:24:32 -07:00
Michael Chiang	df100ce540	Update docs/modelfile.md Co-authored-by: Michael Yang <mxyng@pm.me>	2023-07-19 23:24:32 -07:00
Michael Chiang	5c5948b4e7	clean up my previous empty sentences	2023-07-19 23:24:32 -07:00
Michael Yang	1c72e46e09	update modelfile.md	2023-07-19 23:24:32 -07:00
Michael Yang	ca210ba480	handle vnd.ollama.image.prompt for compat	2023-07-19 23:24:32 -07:00
Michael Yang	df146c41e2	separate prompt into template and system	2023-07-19 23:24:31 -07:00
Jeffrey Morgan	2d305fa99a	allow relative paths in `FROM` instruction	2023-07-19 21:55:15 -07:00
Patrick Devine	e4d7f3e287	vendor in progress bar and change to bytes instead of bibytes (#130 )	2023-07-19 17:24:03 -07:00
Jeffrey Morgan	f2044b5838	web: fix newsletter signup	2023-07-19 16:11:56 -07:00
Michael Chiang	d53988f619	Merge pull request #128 from jmorganca/mchiang0610-patch-1 Update modelfile.md	2023-07-19 13:40:39 -07:00
Michael Chiang	ac88ab48d9	update	2023-07-19 13:37:21 -07:00
Michael Yang	84c6ee8cc6	Merge pull request #104 from jmorganca/interactive-readline use readline	2023-07-19 13:36:24 -07:00
Michael Yang	dbc90576b8	add verbose/quiet commands	2023-07-19 13:34:56 -07:00
Michael Yang	84200dcde6	use readline	2023-07-19 13:34:56 -07:00
Michael Chiang	e54c08da89	updating prompt	2023-07-19 13:34:40 -07:00
Michael Chiang	31413857ea	organizing examples	2023-07-19 13:25:14 -07:00
Michael Chiang	25f874c030	Update modelfile.md	2023-07-19 12:48:57 -07:00
Jeffrey Morgan	10d502611f	fix discord link in `README.md`	2023-07-19 12:31:48 -07:00
Jeffrey Morgan	7fe4103b94	add discord link, remove repeated text	2023-07-19 12:28:50 -07:00
Michael Chiang	7fbdc8e2c1	Update modelfile.md	2023-07-19 11:38:06 -07:00
Eva Ho	9c5572d51f	add discord link back	2023-07-19 13:03:26 -04:00
Matt Williams	75eb28f574	Merge pull request #125 from jmorganca/matt/addlicensetomodelfiledoc Updated modelfile doc to include license	2023-07-19 08:57:06 -07:00
Patrick Devine	56b6a1720f	add llama2:13b model to the readme (#126 )	2023-07-19 08:21:28 -07:00
Eva Ho	dfceca48a7	update icons to have different images for bright and dark mode	2023-07-19 11:14:43 -04:00
Matt Williams	bbb67002c3	get rid of latest Signed-off-by: Matt Williams <m@technovangelist.com>	2023-07-19 07:40:40 -07:00
Michael Chiang	0294216ea9	Merge pull request #124 from DavidZirinsky/patch-1 Update README.md	2023-07-19 07:40:24 -07:00
Matt Williams	7a62b2d2ab	Update the FROM instructions Signed-off-by: Matt Williams <m@technovangelist.com>	2023-07-19 07:39:40 -07:00
Eva Ho	f08c050e57	fix page transitions flickering	2023-07-19 10:19:24 -04:00
Matt Williams	67c8d49757	Updated modelfile doc to include license and attributed midjourneyprompt Signed-off-by: Matt Williams <m@technovangelist.com>	2023-07-19 07:16:38 -07:00
DavidZirinsky	ffcd90e8a7	Update README.md I needed to do this to run the project	2023-07-19 08:14:44 -06:00
Jeffrey Morgan	4ca7c4be1f	dont consume reader when calculating digest	2023-07-19 00:47:55 -07:00
Michael Chiang	17b7af78f0	Merge pull request #115 from jmorganca/Add-wizard-vicuna-uncensored-model-link Add wizard vicuna uncensored model link	2023-07-18 22:58:07 -07:00
Jeffrey Morgan	4c1dc52083	app: create `/usr/local/bin/` if it does not exist	2023-07-18 22:50:52 -07:00
Patrick Devine	572fc9099f	add license layers to the parser (#116 )	2023-07-18 22:49:38 -07:00
Michael Chiang	3020f29041	Add wizard vicuna uncensored model link	2023-07-18 22:19:12 -07:00