This was leading to ambiguity and confusion in ollama.com, and is not
used anywhere in ollama at the moment. Once manifests are addressable by
digest, we can add this back in, and in a way that is more tailored to
the concept of addressing a manifest by digest.
This change fixes the handling of keep_alive so that if client
request omits the setting, we only set this on initial load. Once
the model is loaded, if new requests leave this unset, we'll keep
whatever keep_alive was there.
Users may not realize the siny new model they're trying to load
fits on their disk, but can't load into system+GPU memory. Today
we crash, but with this fix, we'll give them a better error message
before even trying to load it.
When ollama is running a long time, tmp cleaners can remove the
runners. This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"
On windows, if the model dir contained unicode characters
clip models would fail to load. This fixes the file name
handling in clip.cpp to support utf16 on windows.
Refine the way we log GPU discovery to improve the non-debug
output, and report more actionable log messages when possible
to help users troubleshoot on their own.
* OpenAI v1 models
* Refactor Writers
* Add Test
Co-Authored-By: Attila Kerekes
* Credit Co-Author
Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com>
* Empty List Testing
* Use Namespace for Ownedby
* Update Test
* Add back envconfig
* v1/models docs
* Use ModelName Parser
* Test Names
* Remove Docs
* Clean Up
* Test name
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Add Middleware for Chat and List
* Completions Endpoint
* Testing Cleanup
* Test with Fatal
* Add functionality to chat test
* Rename function
* float types
* type cleanup
* cleaning
* more cleaning
* Extra test cases
* merge conflicts
* merge conflicts
* merge conflicts
* merge conflicts
* cleaning
* cleaning
---------
Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* OpenAI v1 models
* Refactor Writers
* Add Test
Co-Authored-By: Attila Kerekes
* Credit Co-Author
Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com>
* Empty List Testing
* Use Namespace for Ownedby
* Update Test
* Add back envconfig
* v1/models docs
* Use ModelName Parser
* Test Names
* Remove Docs
* Clean Up
* Test name
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Add Middleware for Chat and List
* Testing Cleanup
* Test with Fatal
* Add functionality to chat test
* OpenAI: /v1/models/{model} compatibility (#5028)
* Retrieve Model
* OpenAI Delete Model
* Retrieve Middleware
* Remove Delete from Branch
* Update Test
* Middleware Test File
* Function name
* Cleanup
* Test Update
* Test Update
---------
Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:
* Too many allocations when decoding strings
* Hitting disk for each read of each key and value, resulting in a
not-okay amount of syscalls/disk I/O.
The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.
This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.
Also, this fixes a broken test that was not encoding valid GGUF.
This commit changes the 'ollama run' command to defer fetching model
information until it really needs it. That is, when in interactive mode.
It also removes one such case where the model information is fetch in
duplicate, just before calling generateInteractive and then again, first
thing, in generateInteractive.
This positively impacts the performance of the command:
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?
./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.
This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.
* API Show Extended
* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>
* Clean Up
* Descriptive arg error messages and other fixes
* Second Draft of Show with Projectors Included
* Remove Chat Template
* Touches
* Prevent wrapping from files
* Verbose functionality
* Docs
* Address Feedback
* Lint
* Resolve Conflicts
* Function Name
* Tests for api/show model info
* Show Test File
* Add Projector Test
* Clean routes
* Projector Check
* Move Show Test
* Touches
* Doc update
---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>
The Digest type in its current form is awkward to work with and presents
challenges with regard to how it serializes via String using the '-'
prefix.
We currently only use this in ollama.com, so we'll move our specific
needs around digest parsing and validation there.
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off. This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
* docs: add missing instruction for powershell build
The powershell script for building Ollama on Windows now requires the `ThreadJob` module. Add this to the instructions and dependency list.
* Update development.md
This implements the release logic we want via gh cli
to support updating releases with rc tags in place and retain
release notes and other community reactions.
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one. This exposes that
tunable behavior.
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
multiple templates may appear in a model if a model is created from
another model that 1) has an autodetected template and 2) defines a
custom template
* Added instructions to easily install specific versions on faq.md
* Small typo
* Moved instructions on how to install specific version to linux.md
* Update docs/linux.md
* Update docs/linux.md
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* envconfig/config.go: Fix wrong description of OLLAMA_LLM_LIBRARY
Signed-off-by: Lei Jitang <leijitang@outlook.com>
* serve: Add more env to help message of ollama serve
Add more enviroment variables to `ollama serve --help`
to let users know what can be configurated.
Signed-off-by: Lei Jitang <leijitang@outlook.com>
---------
Signed-off-by: Lei Jitang <leijitang@outlook.com>
* integrated runewidth for display management - fixed cursor movement for mutli-width char
* updated input and deletion of multi-byte chars
* fixed line history with some exceptions
* improved insert and add
* fixed issues with moving across lines
* end of line extra space tracking'
* saved changes
* fixed end of line issues with empty spaces
* worked some more
* worked on end of line
* fixed failed test
* fixed minor inserting bug
* fixed movement hotkeys
* adjusted hotkeys
* removed comments
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* deleted comments and duplicate code
* removed duplicate code
* added comments, refactored add function to use addChar
* added helper to retrieve lineSpacing, renamed lineFlags for clarity
* fixed remove()
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.
* ensure kernel modules are loaded in `install.sh` script and at startup
* indentation
* use `SUDO` variable
* restart if nouveau is detected
* consistent success message for AMD
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
This test needs to be able to adjust the queue size down from
our default setting for a reliable test, so it needs to skip on
remote test execution mode.
* Fixed the API endpoint /api/tags to return {models: []} instead of {models: null} when the model list is empty.
* Update server/routes.go
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* rename `--quantization` to `--quantize`
* backwards
* Update api/types.go
Co-authored-by: Michael Yang <mxyng@pm.me>
---------
Co-authored-by: Michael Yang <mxyng@pm.me>
The GPU drivers take a while to update their free memory reporting, so we need
to wait until the values converge with what we're expecting before proceeding
to start another runner in order to get an accurate picture.
* Add preflight OPTIONS handling and update CORS config
- Implement early return with HTTP 204 (No Content) for OPTIONS requests in allowedHostsMiddleware to optimize preflight handling.
- Extend CORS configuration to explicitly allow 'Authorization' headers and 'OPTIONS' method when OLLAMA_ORIGINS environment variable is set.
* allow auth, content-type, and user-agent headers
* Update routes.go
* api: fill up API documentation
Followup for #2878
Now that the documentation is more complete, mention it in the README.
Updates #2840
* fix typo/lint
* Update README.md
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
- add test for number formatting
- fix bug where 1B and 1M were not stored correctly
- display 2 decimal points for million param sizes
- display 1 decimal point for billion param sizes
* add details on kubernetes deployment and separate the testing process
* Update examples/kubernetes/README.md
thanks for suggesting this change, I agree with you and let's make this project better together !
Co-authored-by: JonZeolla <Zeolla@gmail.com>
---------
Co-authored-by: QIN Mélony <MQN1@dsone.3ds.com>
Co-authored-by: JonZeolla <Zeolla@gmail.com>
* note on naming restrictions
else push would fail with cryptic
retrieving manifest
Error: file does not exist
==> maybe change that in code too
* Update docs/import.md
---------
Co-authored-by: C-4-5-3 <154636388+C-4-5-3@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Update api.md
Changed the calculation of tps (token/s) in the documentation
* Update docs/api.md
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
- return descriptive error messages when unauthorized to create blob or push a model
- display the local public key associated with the request that was denied
* app: restart server on failure
* fix linter
* address comments
* refactor log directory creation to be where logs are written
* check all log dir creation errors
Prior refactoring passes accidentally removed the logic to bypass VRAM
checks for CPU loads. This adds that back, along with test coverage.
This also fixes loaded map access in the unit test to be behind the mutex which was
likely the cause of various flakes in the tests.
download-artifact path was being used incorrectly. It is where to
extract the zip not the files in the zip to extract. Default is
workspace dir which is what we want, so omit it
This allows users of a valid Digest to know it has a minimum of 2
characters in the hash part for use when sharding.
This is a reasonable restriction as the hash part is a SHA256 hash which
is 64 characters long, which is the common hash used. There is no
anticipation of using a hash with less than 2 characters.
Also, add MustParseDigest.
Also, replace Digest.Type with Digest.Split for getting both the type
and hash parts together, which is most the common case when asking for
either.
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround. Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
* add chat (web UI) for LLM
I have used chat with llama3 in local successfully and the code is MIT licensed.
* Update README.md
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Added MindsDB information
Added more details to MindsDB so that Ollama users can know that they can connect their Ollama model with 200+ databases and apps
* updated text for mindsdb
This commit adds path helpers for working with Names in URL and file
paths. The new helpers are ParseNameFromPath, ParseNameFromFilePath,
Name.Path, and Name.FilePath.
This commit also adds Name.DisplayLongest, and Name.DisplayLong.
Also, be it updates a place where strings.StripPrefix is more consistent
with the surrounding code.
Also, replace Parts with specific methods
This is a quick fix to help users who are stuck on the "pull" step at
99%.
In the near future we're introducing a new registry client that
should/will hopefully be smarter. In the meantime, this should unblock
the users hitting issue #1736.
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.
This script also provides nicer feedback to the user about what is
happening during the build process.
At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
This also moves the checkServerHeartbeat call out of the "RunE" Cobra
stuff (that's the only word I have for that) to on-site where it's after
the check for OLLAMA_MODELS, which allows the helpful error message to
be printed before the server heartbeat check. This also arguably makes
the code more readable without the magic/superfluous "pre" function
caller.
Leaving the cudart library loaded kept ~30m of memory
pinned in the GPU in the main process. This change ensures
we don't hold GPU resources when idle.
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
We had started using rocky linux 8, but they've updated to GCC 10.3,
which breaks NVCC. 10.2 is compatible (or 10.4, but that's not
available from rocky linux 8 repos yet)
If expanding the runners fails, don't leave a corrupt/incomplete payloads dir
We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs
and remove this as long as there isn't still a process running.
This fixes some brittle, simple equality checks to use errors.Is. Since
go1.13, errors.Is is the idiomatic way to check for errors.
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
This fixes a few bugs in the new sysfs discovery logic. iGPUs are now
correctly identified by their <1G VRAM reported. the sysfs IDs are off
by one compared to what HIP wants due to the CPU being reported
in amdgpu, but HIP only cares about GPUs.
This allows people who package up ollama on their own to place
the rocm dependencies in a peer directory to the ollama executable
much like our windows install flow.
The recent ROCm change partially removed idempotent
payloads, but the ggml-metal.metal file for mac was still
idempotent. This finishes switching to always extract
the payloads, and now that idempotentcy is gone, the
version directory is no longer useful.
Also, document OLLAMA_HOST client semantics per command that honors it.
This looks nicer than having a general puprose environment variable
section in the root usage which was showing up after the "addition help
topics" section outputed by Cobra's default template.
It was decided this was easier to work with than using a custom template
for Cobra right now.
This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.
We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.
For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.
* Add Odin Runes to README
Add Odin Runes to README
This commit adds Odin Runes to the "Community Integrations" section of the README. Odin Runes is a Java-based GPT client designed to provide seamless interaction with GPT models, enhancing productivity in prompt engineering and text generation tasks. This addition highlights the integration between Odin Runes and Ollama, offering users the flexibility to leverage large language models locally within their development workflow.
* Update README.md
this commit applies the comments of the reviewer.
* read iogpu.wired_limit_mb on macOS
Fix for https://github.com/ollama/ollama/issues/1826
* improved determination of available vram on macOS
read the recommended maximal vram on macOS via Metal API
* Removed macOS-specific logging
* Remove logging from gpu_darwin.go
* release Core Foundation object
fixes a possible memory leak
* [fix] /bye and /exit are now treated as prefixes
instead of being treated as entire lines which doesn't align with the way the rest of the commands are treated
* Update cmd/interactive.go
Fixing whitespace
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Even though we weren't setting it to on, somewhere in the cmake config
it was getting toggled on. By explicitly setting it to off, we get `/arch:AVX`
as intended.
This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.
At least with the ROCm libraries, its possible to have the library
present with zero GPUs. This fix avoids a divide by zero bug in llm.go
when we try to calculate GPU memory with zero GPUs.
We build the GPU libraries with AVX enabled to ensure that if not all
layers fit on the GPU we get better performance in a mixed mode.
If the user is using a virtualization/emulation system that lacks AVX
this used to result in an illegal instruction error and crash before this
fix. Now we will report a warning in the server log, and just use
CPU mode to ensure we don't crash.
Fix an ordering glitch of dlerr/dlclose and add more logging to help
root cause some crashes users are hitting. This also refines the
function pointer names to use the underlying function names instead
of simplified names for readability.
This adds additional calls to both CUDA and ROCm management libraries to
discover additional attributes about the GPU(s) detected in the system, and
wires up runtime verbosity selection. When users hit problems with GPUs we can
ask them to run with `OLLAMA_DEBUG=1 ollama serve` and share the results.
If a VERSION is not specified, this will generate a version string that
represents the state of the repo. For example `0.1.21-12-gffaf52e-dirty`
representing 12 commits away from 0.1.21 tag, on commit gffaf52e
and the tree is dirty.
The linux build now support parallel CPU builds to speed things up.
This also exposes AMD GPU targets as an optional setting for advaced
users who want to alter our default set.
This renames Dockerfile.build to Dockerfile, and adds some new stages
to support 2 modes of building - the build_linux.sh script uses
intermediate stages to extract the artifacts for ./dist, and the default
build generates a container image usable by both cuda and rocm cards.
This required transitioniing the x86 base to the rocm image to avoid
layer bloat.
Update gpu.go initGPUHandles() to declare gpuHandles variable before
reading it. This resolves an "invalid memory address or nil pointer
dereference" error.
Update dyn_ext_server.c to avoid setting the RTLD_DEEPBIND flag under
__TERMUX__ (Android).
Upstream llama.cpp has added a new dependency with the
NVIDIA CUDA Driver Libraries (libcuda.so) which is part of the
driver distribution, not the general cuda libraries, and is not
available as an archive, so we can not statically link it. This may
introduce some additional compatibility challenges which we'll
need to keep an eye on.
After executing the `userdel ollama` command, I saw this message:
```sh
$ sudo userdel ollama
userdel: group ollama not removed because it has other members.
```
Which reminded me that I had to remove the dangling group too. For completeness, the uninstall instructions should do this too.
Thanks!
The memory changes and multi-variant change had some merge
glitches I missed. This fixes them so we actually get the cpu llm lib
and best variant for the given system.
This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available
In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.
This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.
* increase minimum cuda overhead and fix minimum overhead for multi-gpu
* fix multi gpu overhead
* limit overhead to 10% of all gpus
* better wording
* allocate fixed amount before layers
* fixed only includes graph alloc
When there are multiple management libraries installed on a system
not every one will be compatible with the current driver. This change
improves our management library algorithm to build up a set of discovered
libraries based on glob patterns, and then try all of them until we're able to
load one without error.
This can help speed up incremental builds when you're only testing one
archicture, like amd64. E.g.
BUILD_ARCH=amd64 ./scripts/build_linux.sh && scp ./dist/ollama-linux-amd64 test-system:
If you attempt to run the current CUDA build on compute capability 5.2
cards, you'll hit the following failure:
cuBLAS error 15 at ggml-cuda.cu:7956: the requested functionality is not supported
* select layers based on estimated model memory usage
* always account for scratch vram
* dont load +1 layers
* better estmation for graph alloc
* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update llm/llm.go
* add overhead for cuda memory
* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* fix build error on linux
* address comments
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
If we try to load the CUDA library on an old GPU, it panics and crashes
the server. This checks the compute capability before we load the
library so we can gracefully fall back to CPU mode.
This enhances our regex to support windows style paths. The regex will
match invalid path specifications, but we'll still validate file
existence and filter out mismatches
On linux, we link the CPU library in to the Go app and fall back to it
when no GPU match is found. On windows we do not link in the CPU library
so that we can better control our dependencies for the CLI. This fixes
the logic so we correctly fallback to the dynamic CPU library
on windows.
Go embed doesn't like when there's no matching files, so put
a dummy placeholder in to allow building without any GPU support
If no "server" library is found, it's safely ignored at runtime.
This prevents users from accidentally installing on WSL1 with instructions
guiding how to upgrade their WSL instance to version 2. Once running WSL2
if you have an NVIDIA card, you can follow their instructions to set up
GPU passthrough and run models on the GPU. This is not possible on WSL1.
Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.
This changes the model for llama.cpp inclusion so we're not applying a patch,
but instead have the C++ code directly in the ollama tree, which should make it
easier to refine and update over time.
Most of the examples needed updates of Readmes to show how to run them. Some of the requirements.txt files had extra content that wasn't needed, or missing altogether. Apparently some folks like to run npm start
to run typescript, so a script was added to all typescript examples which
hadn't been done before.
Basically just a lot of cleanup.
Signed-off-by: Matt Williams <m@technovangelist.com>
By default builds will now produce non-debug and non-verbose binaries.
To enable verbose logs in llama.cpp and debug symbols in the
native code, set `CGO_CFLAGS=-g`
The default thread count logic was broken and resulted in 2x the number
of threads as it should on a hyperthreading CPU
resulting in thrashing and poor performance.
The windows native setup still needs some more work, but this gets it building
again and if you set the PATH properly, you can run the resulting exe on a cuda system.
This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.
This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.
The build tags rocm or cuda must be specified to both go generate and go build.
ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well
as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the
CLBlast cmake directory (likely /usr/lib/cmake/CLBlast). Build tags are also
used to switch VRAM detection between cuda and rocm implementations, using
added "accelerator_foo.go" files which contain architecture specific functions
and variables. accelerator_none is used when no tags are set, and a helper
function addRunner will ignore it if it is the chosen accelerator. Fix go
generate commands, thanks @deadmeu for testing.
- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>
* restore model load duration on generate response
- set model load duration on generate and chat done response
- calculate createAt time when response created
* remove checkpoints predict opts
* Update routes.go
previous layer creation was not ideal because:
1. it required reading the input file multiple times, once to calculate
the sha256 checksum, another to write it to disk, and potentially one
more to decode the underlying gguf
2. used io.ReadSeeker which is prone to user error. if the file isn't
reset correctly or in the right place, it could end up reading an
empty file
there are also some brittleness when reading existing layers else
writing the inherited layers will error reading an already closed file
this commit aims to fix these issues by restructuring layer creation.
1. it will now write the layer to a temporary file as well as the hash
function and move it to the final location on Commit
2. layers are read once once when copied to the destination. exception
is raw model files which still requires a second read to decode the
model metadata
- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history
When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.
See #961.
Moved the arch package and someone has added a pr for brew.
that needs to get updated to be a link.
Signed-off-by: Matt Williams <m@technovangelist.com>
| Orca Mini | 3B | 1.9GB | `ollama run orca-mini` |
| Vicuna | 7B | 3.8GB | `ollama run vicuna` |
| LLaVA | 7B | 4.5GB | `ollama run llava` |
| Solar | 10.7B | 6.1GB | `ollama run solar` |
> Note: You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.
> Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
## Customize your own model
## Customize a model
### Import from GGUF
@@ -88,21 +96,21 @@ See the [guide](docs/import.md) on importing models for more information.
### Customize a prompt
Models from the Ollama library can be customized with a prompt. For example, to customize the `llama2` model:
Models from the Ollama library can be customized with a prompt. For example, to customize the `llama3` model:
```
ollama pull llama2
ollama pull llama3
```
Create a `Modelfile`:
```
FROM llama2
FROM llama3
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# set the system prompt
# set the system message
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""
@@ -125,10 +133,14 @@ For more examples, see the [examples](examples) directory. For more information
`ollama create` is used to create a model from a Modelfile.
```
ollama create mymodel -f ./Modelfile
```
### Pull a model
```
ollama pull llama2
ollama pull llama3
```
> This command can also be used to update a local model. Only the diff will be pulled.
@@ -136,13 +148,13 @@ ollama pull llama2
### Remove a model
```
ollama rm llama2
ollama rm llama3
```
### Copy a model
```
ollama cp llama2 my-llama2
ollama cp llama3 my-model
```
### Multiline input
@@ -156,13 +168,26 @@ For multiline input, you can wrap text with `"""`:
I'm a basic program that prints the famous "Hello, world!" message to the console.
```
### Pass in prompt as arguments
### Multimodal models
```
$ ollama run llama2 "Summarize this file: $(cat README.md)"
>>> What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.
```
### Pass the prompt as an argument
```
$ ollama run llama3 "Summarize this file: $(cat README.md)"
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
```
### Show model information
```
ollama show llama3
```
### List models on your computer
```
@@ -175,18 +200,9 @@ ollama list
## Building
Install `cmake` and `go`:
See the [developer guide](https://github.com/ollama/ollama/blob/main/docs/development.md)
```
brew install cmake go
```
Then generate dependencies and build:
```
go generate ./...
go build .
```
### Running local builds
Next, start the server:
@@ -197,34 +213,86 @@ Next, start the server:
Finally, in a separate shell, run a model:
```
./ollama run llama2
./ollama run llama3
```
## REST API
Ollama has a REST API for running and managing models.
For example, to generate text from a model:
### Generate a response
```
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama2",
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'
```
### Chat with a model
```
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{ "role": "user", "content": "why is the sky blue?" }
]
}'
```
See the [API documentation](./docs/api.md) for all endpoints.
- [MindsDB](https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/ollama_handler/README.md) (Connects Ollama models with nearly 200 data platforms and apps)
- [chromem-go](https://github.com/philippgille/chromem-go/blob/v0.5.0/embed_ollama.go) with [example](https://github.com/philippgille/chromem-go/tree/v0.5.0/examples/rag-wikipedia-ollama)
- [LangChain](https://python.langchain.com/docs/integrations/llms/ollama) and [LangChain.js](https://js.langchain.com/docs/modules/model_io/models/llms/integrations/ollama) with [example](https://js.langchain.com/docs/use_cases/question_answering/local_retrieval_qa)
- [LangChainGo](https://github.com/tmc/langchaingo/) with [example](https://github.com/tmc/langchaingo/tree/main/examples/ollama-completion-example)
- [LangChain4j](https://github.com/langchain4j/langchain4j) with [example](https://github.com/langchain4j/langchain4j-examples/tree/main/ollama-examples/src/main/java)
- [LangChainRust](https://github.com/Abraxas-365/langchain-rust) with [example](https://github.com/Abraxas-365/langchain-rust/blob/main/examples/llm_ollama.rs)
- [PromptingTools.jl](https://github.com/svilupp/PromptingTools.jl) with an [example](https://svilupp.github.io/PromptingTools.jl/dev/examples/working_with_ollama)
- [Discord AI chat/moderation bot](https://github.com/rapmd73/Companion) Chat/moderation bot written in python. Uses Ollama to create personalities.
- [Headless Ollama](https://github.com/nischalj10/headless-ollama) (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server)
### Supported backends
- [llama.cpp](https://github.com/ggerganov/llama.cpp) project founded by Georgi Gerganov.
; NOTE: if the user has a custom OLLAMA_MODELS it will be preserved
[Messages]
WizardReady=Ollama Windows Preview
ReadyLabel1=%nLet's get you up and running with your own large language models.
SetupAppRunningError=Another Ollama installer is running.%n%nPlease cancel or finish the other installer, then click OK to continue with this install, or Cancel to exit.
;FinishedHeadingLabel=Run your first model
;FinishedLabel=%nRun this command in a PowerShell or cmd terminal.%n%n%n ollama run llama3
- Install cmake or (optionally, required tools for GPUs)
- run `go generate ./...`
- run `go build .`
Install required tools:
- cmake version 3.24 or higher
- go version 1.20 or higher
- go version 1.22 or higher
- gcc version 11.4.0 or higher
### MacOS
```bash
brew install go cmake gcc
```
Get the required libraries:
Optionally enable debugging and more verbose logging:
```bash
# At build time
exportCGO_CFLAGS="-g"
# At runtime
exportOLLAMA_DEBUG=1
```
Get the required libraries and build the native LLM code:
```bash
go generate ./...
@@ -32,8 +40,111 @@ Now you can run `ollama`:
./ollama
```
## Building on Linux with GPU support
### Linux
- Install cmake and nvidia-cuda-toolkit
- run `go generate ./...`
- run `go build .`
#### Linux CUDA (NVIDIA)
_Your operating system distribution may already have packages for NVIDIA CUDA. Distro packages are often preferable, but instructions are distro-specific. Please consult distro-specific docs for dependencies if available!_
Install `cmake` and `golang` as well as [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads)
development and runtime packages.
Typically the build scripts will auto-detect CUDA, however, if your Linux distro
or installation approach uses unusual paths, you can specify the location by
specifying an environment variable `CUDA_LIB_DIR` to the location of the shared
libraries, and `CUDACXX` to the location of the nvcc compiler. You can customize
a set of target CUDA architectures by setting `CMAKE_CUDA_ARCHITECTURES` (e.g. "50;60;70")
Then generate dependencies:
```
go generate ./...
```
Then build the binary:
```
go build .
```
#### Linux ROCm (AMD)
_Your operating system distribution may already have packages for AMD ROCm and CLBlast. Distro packages are often preferable, but instructions are distro-specific. Please consult distro-specific docs for dependencies if available!_
Install [CLBlast](https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md) and [ROCm](https://rocm.docs.amd.com/en/latest/) development packages first, as well as `cmake` and `golang`.
Typically the build scripts will auto-detect ROCm, however, if your Linux distro
or installation approach uses unusual paths, you can specify the location by
specifying an environment variable `ROCM_PATH` to the location of the ROCm
install (typically `/opt/rocm`), and `CLBlast_DIR` to the location of the
CLBlast install (typically `/usr/lib/cmake/CLBlast`). You can also customize
the AMD GPU targets by setting AMDGPU_TARGETS (e.g. `AMDGPU_TARGETS="gfx1101;gfx1102"`)
```
go generate ./...
```
Then build the binary:
```
go build .
```
ROCm requires elevated privileges to access the GPU at runtime. On most distros you can add your user account to the `render` group, or run as root.
#### Advanced CPU Settings
By default, running `go generate ./...` will compile a few different variations
of the LLM library based on common CPU families and vector math capabilities,
including a lowest-common-denominator which should run on almost any 64 bit CPU
somewhat slowly. At runtime, Ollama will auto-detect the optimal variation to
load. If you would like to build a CPU-based build customized for your
processor, you can set `OLLAMA_CUSTOM_CPU_DEFS` to the llama.cpp flags you would
like to use. For example, to compile an optimized binary for an Intel i9-9880H,
you might use:
```
OLLAMA_CUSTOM_CPU_DEFS="-DGGML_AVX=on -DGGML_AVX2=on -DGGML_F16C=on -DGGML_FMA=on" go generate ./...
go build .
```
#### Containerized Linux Build
If you have Docker available, you can build linux binaries with `./scripts/build_linux.sh` which has the CUDA and ROCm dependencies included. The resulting binary is placed in `./dist`
### Windows
Note: The Windows build for Ollama is still under development.
First, install required tools:
- MSVC toolchain - C/C++ and cmake as minimal requirements
- Go version 1.22 or higher
- MinGW (pick one variant) with GCC.
- [MinGW-w64](https://www.mingw-w64.org/)
- [MSYS2](https://www.msys2.org/)
- The `ThreadJob` Powershell module: `Install-Module -Name ThreadJob -Scope CurrentUser`
Then, build the `ollama` binary:
```powershell
$env:CGO_ENABLED="1"
gogenerate./...
gobuild.
```
#### Windows CUDA (NVIDIA)
In addition to the common Windows development tools described above, install CUDA after installing MSVC.
Lastly, add `ninja.exe` included with MSVC to the system path (e.g. `C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja`).
Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version [manually](https://ollama.com/download/).
On Linux, re-run the install script:
```shell
curl -fsSL https://ollama.com/install.sh | sh
```
## How can I view the logs?
On macOS:
Review the [Troubleshooting](./troubleshooting.md) docs for more about using logs.
## Is my GPU compatible with Ollama?
Please refer to the [GPU docs](./gpu.md).
## How can I specify the context window size?
By default, Ollama uses a context window size of 2048 tokens.
To change this when using `ollama run`, use `/set parameter`:
```
cat ~/.ollama/logs/server.log
/set parameter num_ctx 4096
```
On Linux:
When using the API, specify the `num_ctx` parameter:
```
journalctl -u ollama
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"options": {
"num_ctx": 4096
}
}'
```
If you're running `ollama serve` directly, the logs will be printed to the console.
## How can I tell if my model was loaded onto the GPU?
Use the `ollama ps` command to see what models are currently loaded into memory.
```shell
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
```
The `Processor` column will show which memory the model was loaded in to:
*`100% GPU` means the model was loaded entirely into the GPU
*`100% CPU` means the model was loaded entirely in system memory
*`48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
## How do I configure Ollama server?
Ollama server can be configured with environment variables.
### Setting environment variables on Mac
If Ollama is run as a macOS application, environment variables should be set using `launchctl`:
1. For each environment variable, call `launchctl setenv`.
```bash
launchctl setenv OLLAMA_HOST "0.0.0.0"
```
2. Restart Ollama application.
### Setting environment variables on Linux
If Ollama is run as a systemd service, environment variables should be set using `systemctl`:
1. Edit the systemd service by calling `systemctl edit ollama.service`. This will open an editor.
2. For each environment variable, add a line `Environment` under section `[Service]`:
```ini
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
```
3. Save and exit.
4. Reload `systemd` and restart Ollama:
```bash
systemctl daemon-reload
systemctl restart ollama
```
### Setting environment variables on Windows
On Windows, Ollama inherits your user and system environment variables.
1. First Quit Ollama by clicking on it in the task bar.
2. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for _environment variables_.
3. Click on _Edit environment variables for your account_.
4. Edit or create a new variable for your user account for `OLLAMA_HOST`, `OLLAMA_MODELS`, etc.
5. Click OK/Apply to save.
6. Start the Ollama application from the Windows Start menu.
## How do I use Ollama behind a proxy?
Ollama is compatible with proxy servers if `HTTP_PROXY` or `HTTPS_PROXY` are configured. When using either variables, ensure it is set where `ollama serve` can access the values. When using `HTTPS_PROXY`, ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
### How do I use Ollama behind a proxy in Docker?
The Ollama Docker container image can be configured to use a proxy by passing `-e HTTPS_PROXY=https://proxy.example.com` when starting the container.
Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on [macOS](https://docs.docker.com/desktop/settings/mac/#proxies), [Windows](https://docs.docker.com/desktop/settings/windows/#proxies), and [Linux](https://docs.docker.com/desktop/settings/linux/#proxies), and Docker [daemon with systemd](https://docs.docker.com/config/daemon/systemd/#httphttps-proxy).
Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
```
## Does Ollama send my prompts and answers back to ollama.com?
No. Ollama runs locally, and conversation data does not leave your machine.
## How can I expose Ollama on my network?
Ollama binds to 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the `OLLAMA_HOST` environment variable.
On macOS:
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
```bash
OLLAMA_HOST=0.0.0.0:11435 ollama serve
## How can I use Ollama with a proxy server?
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:
```
server {
listen 80;
server_name example.com; # Replace with your domain or IP
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host localhost:11434;
}
}
```
On Linux:
## How can I use Ollama with ngrok?
Create a `systemd` drop-in directory and set `Environment=OLLAMA_HOST`
Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
## Where are models stored?
- macOS: Raw model data is stored under `~/.ollama/models`.
- Linux: Raw model data is stored under `/usr/share/ollama/.ollama/models`
- macOS: `~/.ollama/models`
- Linux: `/usr/share/ollama/.ollama/models`
- Windows: `C:\Users\%username%\.ollama\models`
### How do I set them to a different location?
If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.
Below the models directory you will find a structure similar to the following:
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
## How can I use Ollama in Visual Studio Code?
There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of [extensions & plugins](https://github.com/ollama/ollama#extensions--plugins) at the bottom of the main repository readme.
## How do I use Ollama with GPU acceleration in Docker?
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). See [ollama/ollama](https://hub.docker.com/r/ollama/ollama) for more details.
GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
## Why is networking slow in WSL2 on Windows 10?
This can impact both installing Ollama, as well as downloading models.
Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these
properties.
## How can I preload a model into Ollama to get faster response times?
If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the `/api/generate` and `/api/chat` API endpoints.
To preload the mistral model using the generate endpoint, use:
There is a `manifests/registry.ollama.ai/namespace` path. In example above, the user has downloaded models from the official `library`, `f0rodo`, `mattw`, and `saikatkumardey` namespaces. Within each of those directories, you will find directories for each of the models downloaded. And in there you will find a file name representing each tag. Each tag file is the manifest for the model.
The manifest lists all the layers used in this model. You will see a `media type` for each layer, along with a digest. That digest corresponds with a file in the `models/blobs directory`.
To preload a model using the CLI, use the command:
```shell
ollama run llama3 ""
```
### How can I change where Ollama stores models?
## How do I keep a model loaded in memory or make it unload immediately?
To modify where models are stored, you can use the `OLLAMA_MODELS` environment variable. Note that on Linux this means defining `OLLAMA_MODELS` in a drop-in `/etc/systemd/system/ollama.service.d` service file, reloading systemd, and restarting the ollama service.
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory.
The `keep_alive` parameter can be set to:
* a duration string (such as "10m" or "24h")
* a number in seconds (such as 3600)
* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
* '0' which will unload the model immediately after generating a response
For example, to preload a model and leave it in memory use:
Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints.
## How do I manage the maximum number of requests the Ollama server can queue?
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
## How does Ollama handle concurrent requests?
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
This guide walks through importing a GGUF, PyTorch or Safetensors model.
GGUF models and select Safetensors models can be imported directly into Ollama.
## Importing (GGUF)
## ImportGGUF
### Step 1: Write a `Modelfile`
A binary GGUF file can be imported directly into Ollama through a Modelfile.
Start by creating a `Modelfile`. This file is the blueprint for your model, specifying weights, parameters, prompt templates and more.
```
FROM ./mistral-7b-v0.1.Q4_0.gguf
```dockerfile
FROM/path/to/file.gguf
```
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
## Import Safetensors
```
FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
If the model being imported is one of these architectures, it can be imported directly into Ollama through a Modelfile:
- LlamaForCausalLM
- MistralForCausalLM
- GemmaForCausalLM
```dockerfile
FROM/path/to/safetensors/directory
```
### Step 2: Create the Ollama model
For architectures not directly convertable by Ollama, see llama.cpp's [guide](https://github.com/ggerganov/llama.cpp/blob/master/README.md#prepare-and-quantize) on conversion. After conversion, see [Import GGUF](#import-gguf).
Finally, create a model from your `Modelfile`:
## Automatic Quantization
> [!NOTE]
> Automatic quantization requires v0.1.35 or higher.
Ollama is capable of quantizing FP16 or FP32 models to any of the supported quantizations with the `-q/--quantize` flag in `ollama create`.
```dockerfile
FROM/path/to/my/gemma/f16/model
```
ollama create example -f Modelfile
```
### Step 3: Run your model
Next, test the model with `ollama run`:
```
ollama run example "What is your favourite condiment?"
```
## Importing (PyTorch & Safetensors)
### Supported models
Ollama supports a set of model architectures, with support for more coming soon:
- Llama & Mistral
- Falcon & RW
- GPT-NeoX
- BigCode
To view a model's architecture, check the `config.json` file in its HuggingFace repo. You should see an entry under `architectures` (e.g. `LlamaForCausalLM`).
### Step 1: Clone the HuggingFace repository (optional)
If the model is currently hosted in a HuggingFace repository, first clone that repository to download the raw model.
### Step 2: Convert and quantize to a `.bin` file (optional, for PyTorch and Safetensors)
If the model is in PyTorch or Safetensors format, a [Docker image](https://hub.docker.com/r/ollama/quantize) with the tooling required to convert and quantize models is available.
docker run --rm -v .:/model ollama/quantize -q q4_0 /model
```
This will output two files into the directory:
-`f16.bin`: the model converted to GGUF
-`q4_0.bin` the model quantized to a 4-bit quantization (we will use this file to create the Ollama model)
### Step 3: Write a `Modelfile`
Next, create a `Modelfile` for your model:
```
FROM ./q4_0.bin
```
(Optional) many chat models require a prompt template in order to answer correctly. A default prompt template can be specified with the `TEMPLATE` instruction in the `Modelfile`:
```
FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
```
### Step 4: Create the Ollama model
Finally, create a model from your `Modelfile`:
```
ollama create example -f Modelfile
```
### Step 5: Run your model
Next, test the model with `ollama run`:
```
ollama run example "What is your favourite condiment?"
```
## Publishing your model (optional – early alpha)
Publishing models is in early alpha. If you'd like to publish your model to share with others, follow these steps:
1. Create [an account](https://ollama.ai/signup)
2. Run `cat ~/.ollama/id_ed25519.pub` to view your Ollama public key. Copy this to the clipboard.
3. Add your public key to your [Ollama account](https://ollama.ai/settings/keys)
Next, copy your model to your username's namespace:
```
ollama cp example <your username>/example
```
Then push the model:
```
ollama push <your username>/example
```
After publishing, your model will be available at `https://ollama.ai/<your username>/example`.
## Quantization reference
The quantization options are as follow (from highest highest to lowest levels of quantization). Note: some architectures such as Falcon do not support K quants.
-`q2_K`
-`q3_K`
-`q3_K_S`
-`q3_K_M`
-`q3_K_L`
-`q4_0` (recommended)
-`q4_1`
-`q4_K`
-`q4_K_S`
-`q4_K_M`
-`q5_0`
-`q5_1`
-`q5_K`
-`q5_K_S`
-`q5_K_M`
-`q6_K`
-`q8_0`
## Manually converting & quantizing models
### Prerequisites
Start by cloning the `llama.cpp` repo to your machine in another directory:
Use `OLLAMA_VERSION` environment variable with the install script to install a specific version of Ollama, including pre-releases. You can find the version numbers in the [releases page](https://github.com/ollama/ollama/releases).
For example:
```
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.1.32 sh
```
## Viewing logs
To view logs of Ollama running as a startup service, run:
```bash
journalctl -u ollama
journalctl -e -u ollama
```
## Uninstall
@@ -109,8 +134,10 @@ Remove the ollama binary from your bin directory (either `/usr/local/bin`, `/usr
sudo rm $(which ollama)
```
Remove the downloaded models and Ollama service user:
Remove the downloaded models and Ollama service user and group:
| mirostat_eta | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1) | float | mirostat_eta 0.1 |
| mirostat_tau | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0) | float | mirostat_tau 5.0 |
| num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048) | int | num_ctx 4096 |
| num_gqa | The number of GQA groups in the transformer layer. Required for some models, for example it is 8 for llama2:70b | int | num_gqa 1 |
| num_gpu | The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. | int | num_gpu 50 |
| num_thread | Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). | int | num_thread 8 |
| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 |
| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 |
| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 |
`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.
`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system message, a user's message and the response from the model. Note: syntax may be model specific. Templates use Go [template syntax](https://pkg.go.dev/text/template).
| `{{ .System }}` | The system message used to specify custom behavior. |
| `{{ .Prompt }}` | The user prompt message. |
| `{{ .Response }}` | The response from the model. When generating a response, text after this variable is omitted. |
```modelfile
TEMPLATE """
### System:
{{ .System }}
### User:
{{ .Prompt }}
### Response:
```
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
SYSTEM """<system message>"""
```
### SYSTEM
The `SYSTEM` instruction specifies the system prompt to be used in the template, if applicable.
The `SYSTEM` instruction specifies the system message to be used in the template, if applicable.
```modelfile
SYSTEM """<system message>"""
@@ -154,7 +172,7 @@ SYSTEM """<system message>"""
### ADAPTER
The `ADAPTER` instruction specifies the LoRA adapter to apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.
The `ADAPTER` instruction is an optional instruction that specifies any LoRA adapter that should apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.
```modelfile
ADAPTER ./ollama-lora.bin
@@ -170,7 +188,38 @@ LICENSE """
"""
```
### MESSAGE
The `MESSAGE` instruction allows you to specify a message history for the model to use when responding. Use multiple iterations of the MESSAGE command to build up a conversation which will guide the model to answer in a similar way.
> **Note:** OpenAI compatibility is experimental and is subject to major adjustments including breaking changes. For fully-featured access to the Ollama API, see the Ollama [Python library](https://github.com/ollama/ollama-python), [JavaScript library](https://github.com/ollama/ollama-js) and [REST API](https://github.com/ollama/ollama/blob/main/docs/api.md).
Ollama provides experimental compatibility with parts of the [OpenAI API](https://platform.openai.com/docs/api-reference) to help connect existing applications to Ollama.
Sometimes Ollama may not perform as expected. One of the best ways to figure out what happened is to take a look at the logs. Find the logs on **Mac** by running the command:
```shell
cat ~/.ollama/logs/server.log
```
On **Linux** systems with systemd, the logs can be found with this command:
```shell
journalctl -u ollama
```
When you run Ollama in a **container**, the logs go to stdout/stderr in the container:
```shell
docker logs <container-name>
```
(Use `docker ps` to find the container name)
If manually running `ollama serve` in a terminal, the logs will be on that terminal.
When you run Ollama on **Windows**, there are a few different locations. You can view them in the explorer window by hitting `<cmd>+R` and type in:
- `explorer %LOCALAPPDATA%\Ollama` to view logs. The most recent server logs will be in `server.log` and older logs will be in `server-#.log`
- `explorer %LOCALAPPDATA%\Programs\Ollama` to browse the binaries (The installer adds this to your user PATH)
- `explorer %HOMEPATH%\.ollama` to browse where models and configuration is stored
- `explorer %TEMP%` where temporary executable files are stored in one or more `ollama*` directories
To enable additional debug logging to help troubleshoot problems, first **Quit the running app from the tray menu** then in a powershell terminal
```powershell
$env:OLLAMA_DEBUG="1"
& "ollama app.exe"
```
Join the [Discord](https://discord.gg/ollama) for help interpreting the logs.
## LLM libraries
Ollama includes multiple LLM libraries compiled for different GPUs and CPU vector features. Ollama tries to pick the best one based on the capabilities of your system. If this autodetection has problems, or you run into other problems (e.g. crashes in your GPU) you can workaround this by forcing a specific LLM library. `cpu_avx2` will perform the best, followed by `cpu_avx` an the slowest but most compatible is `cpu`. Rosetta emulation under MacOS will work with the `cpu` library.
In the server log, you will see a message that looks something like this (varies from release to release):
```
Dynamic LLM libraries [rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11 rocm_v5]
```
**Experimental LLM Library Override**
You can set OLLAMA_LLM_LIBRARY to any of the available LLM libraries to bypass autodetection, so for example, if you have a CUDA card, but want to force the CPU LLM library with AVX2 vector support, use:
```
OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve
```
You can see what features your CPU has with the following.
```
cat /proc/cpuinfo| grep flags | head -1
```
## Installing older or pre-release versions on Linux
If you run into problems on Linux and want to install an older version, or you'd like to try out a pre-release before it's officially released, you can tell the install script which version to install.
```sh
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION="0.1.29" sh
```
## Linux tmp noexec
If your system is configured with the "noexec" flag where Ollama stores its temporary executable files, you can specify an alternate location by setting OLLAMA_TMPDIR to a location writable by the user ollama runs as. For example OLLAMA_TMPDIR=/usr/share/ollama/
## NVIDIA GPU Discovery
When Ollama starts up, it takes inventory of the GPUs present in the system to determine compatibility and how much VRAM is available. Sometimes this discovery can fail to find your GPUs. In general, running the latest driver will yield the best results.
### Linux NVIDIA Troubleshooting
If you are using a container to run Ollama, make sure you've set up the container runtime first as described in [docker.md](./docker.md)
Sometimes the Ollama can have difficulties initializing the GPU. When you check the server logs, this can show up as various error codes, such as "3" (not initialized), "46" (device unavailable), "100" (no device), "999" (unknown), or others. The following troubleshooting techniques may help resolve the problem
- If you are using a container, is the container runtime working? Try `docker run --gpus all ubuntu nvidia-smi` - if this doesn't work, Ollama wont be able to see your NVIDIA GPU.
- Is the uvm driver loaded? `sudo nvidia-modprobe -u`
- Try reloading the nvidia_uvm driver - `sudo rmmod nvidia_uvm` then `sudo modprobe nvidia_uvm`
- Try rebooting
- Make sure you're running the latest nvidia drivers
If none of those resolve the problem, gather additional information and file an issue:
- Set `CUDA_ERROR_LEVEL=50` and try again to get more diagnostic logs
- Check dmesg for any errors `sudo dmesg | grep -i nvrm` and `sudo dmesg | grep -i nvidia`
## Windows Terminal Errors
Older versions of Windows 10 (e.g., 21H1) are known to have a bug where the standard terminal program does not display control characters correctly. This can result in a long string of strings like `←[?25h←[?25l` being displayed, sometimes erroring with `The parameter is incorrect` To resolve this problem, please update to Win 10 22H1 or newer.
Ollama runs with little to no configuration on [Fly.io GPU instances](https://fly.io/docs/gpus/gpu-quickstart/). If you don't have access to GPUs yet, you'll need to [apply for access](https://fly.io/gpu/) on the waitlist. Once you're accepted, you'll get an email with instructions on how to get started.
Create a new app with `fly apps create`:
```bash
fly apps create
```
Then create a `fly.toml` file in a new folder that looks like this:
```toml
app = "sparkling-violet-709"
primary_region = "ord"
vm.size = "a100-40gb" # see https://fly.io/docs/gpus/gpu-quickstart/ for more info
[build]
image = "ollama/ollama"
[http_service]
internal_port = 11434
force_https = false
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
processes = ["app"]
[mounts]
source = "models"
destination = "/root/.ollama"
initial_size = "100gb"
```
Then create a [new private IPv6 address](https://fly.io/docs/reference/private-networking/#flycast-private-load-balancing) for your app:
```bash
fly ips allocate-v6 --private
```
Then deploy your app:
```bash
fly deploy
```
And finally you can access it interactively with a new Fly.io Machine:
```
fly machine run -e OLLAMA_HOST=http://your-app-name.flycast --shell ollama/ollama
```
```bash
$ ollama run openchat:7b-v3.5-fp16
>>> How do I bake chocolate chip cookies?
To bake chocolate chip cookies, follow these steps:
1. Preheat the oven to 375°F (190°C) and line a baking sheet with parchment paper or silicone baking mat.
2. In a large bowl, mix together 1 cup of unsalted butter (softened), 3/4 cup granulated sugar, and 3/4
cup packed brown sugar until light and fluffy.
3. Add 2 large eggs, one at a time, to the butter mixture, beating well after each addition. Stir in 1
teaspoon of pure vanilla extract.
4. In a separate bowl, whisk together 2 cups all-purpose flour, 1/2 teaspoon baking soda, and 1/2 teaspoon
salt. Gradually add the dry ingredients to the wet ingredients, stirring until just combined.
5. Fold in 2 cups of chocolate chips (or chunks) into the dough.
6. Drop rounded tablespoons of dough onto the prepared baking sheet, spacing them about 2 inches apart.
7. Bake for 10-12 minutes, or until the edges are golden brown. The centers should still be slightly soft.
8. Allow the cookies to cool on the baking sheet for a few minutes before transferring them to a wire rack
to cool completely.
Enjoy your homemade chocolate chip cookies!
```
When you set it up like this, it will automatically turn off when you're done using it. Then when you access it again, it will automatically turn back on. This is a great way to save money on GPU instances when you're not using them. If you want a persistent wake-on-use connection to your Ollama instance, you can set up a [connection to your Fly network using WireGuard](https://fly.io/docs/reference/private-networking/#discovering-apps-through-dns-on-a-wireguard-connection). Then you can access your Ollama instance at `http://your-app-name.flycast`.
@@ -5,28 +5,28 @@ In this tutorial, we are going to use JavaScript with LangChain and Ollama to le
To get started, let's just use **LangChain** to ask a simple question to a model. To do this with JavaScript, we need to install **LangChain**:
```bash
npm install langchain
npm install @langchain/community
```
Now we can start building out our JavaScript:
```javascript
import { Ollama } from "langchain/llms/ollama";
import { Ollama } from "@langchain/community/llms/ollama";
const ollama = new Ollama({
baseUrl: "http://localhost:11434",
model: "llama2",
model: "llama3",
});
const answer = await ollama.call(`why is the sky blue?`);
const answer = await ollama.invoke(`why is the sky blue?`);
console.log(answer);
```
That will get us the same thing as if we ran `ollama run llama2 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's install **Cheerio** and build that part of the app.
That will get us the same thing as if we ran `ollama run llama3 "why is the sky blue"` in the terminal. But we want to load a document from the web to ask a question against. **Cheerio** is a great library for ingesting a webpage, and **LangChain** uses it in their **CheerioWebBaseLoader**. So let's install **Cheerio** and build that part of the app.
@@ -12,15 +12,17 @@ So let's figure out how we can use **LangChain** with Ollama to ask our question
Let's start by asking a simple question that we can get an answer to from the **Llama2** model using **Ollama**. First, we need to install the **LangChain** package:
It's split up, but we have to find the relevant splits and then submit those to the model. We can do this by creating embeddings and storing them in a vector database. For now, we don't have embeddings built in to Ollama, though we will be adding that soon, so for now, we can use the GPT4All library for that. We will use ChromaDB in this example for a vector database. `pip install GPT4All chromadb`
It's split up, but we have to find the relevant splits and then submit those to the model. We can do this by creating embeddings and storing them in a vector database. We can use Ollama directly to instantiate an embedding model. We will use ChromaDB in this example for a vector database. `pip install chromadb`
We also need to pull embedding model: `ollama pull nomic-embed-text`
```python
from langchain.embeddings import GPT4AllEmbeddings
Now let's ask a question from the document. **Who was Neleus, and who is in his family?** Neleus is a character in the Odyssey, and the answer can be found in our text.
@@ -65,7 +68,8 @@ The next thing is to send the question and the relevant parts of the docs to the
With some minor configuration, Ollama runs well on [NVIDIA Jetson Devices](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/). The following has been tested on [JetPack 5.1.2](https://developer.nvidia.com/embedded/jetpack).
Ollama runs well on [NVIDIA Jetson Devices](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/) and should run out of the box with the standard installation instructions.
NVIDIA Jetson devices are Linux-based embedded AI computers that are purpose-built for AI applications.
The following has been tested on [JetPack 5.1.2](https://developer.nvidia.com/embedded/jetpack), but should also work on JetPack 6.0.
Jetsons have an integrated GPU that is wired directly to the memory controller of the machine. For this reason, the `nvidia-smi` command is unrecognized, and Ollama proceeds to operate in "CPU only"
mode. This can be verified by using a monitoring tool like jtop.
In order to address this, we simply pass the path to the Jetson's pre-installed CUDA libraries into `ollama serve` (while in a tmux session). We then hardcode the num_gpu parameters into a cloned
version of our target model.
Prerequisites:
- curl
- tmux
Here are the steps:
- Install Ollama via standard Linux command (ignore the 404 error): `curl https://ollama.ai/install.sh | sh`
- Stop the Ollama service: `sudo systemctl stop ollama`
- Start Ollama serve in a tmux session called ollama_jetson and reference the CUDA libraries path: `tmux has-session -t ollama_jetson 2>/dev/null || tmux new-session -d -s ollama_jetson
- Install Ollama via standard Linux command (ignore the 404 error): `curl https://ollama.com/install.sh | sh`
- Pull the model you want to use (e.g. mistral): `ollama pull mistral`
- Create a new Modelfile specifically for enabling GPU support on the Jetson: `touch ModelfileMistralJetson`
- In the ModelfileMistralJetson file, specify the FROM model and the num_gpu PARAMETER as shown below:
```
FROM mistral
PARAMETER num_gpu 999
```
- Create a new model from your Modelfile: `ollama create mistral-jetson -f ./ModelfileMistralJetson`
- Run the new model: `ollama run mistral-jetson`
If you run a monitoring tool like jtop you should now see that Ollama is using the Jetson's integrated GPU.
- Start an interactive session: `ollama run mistral`
And that's it!
# Running Ollama in Docker
When running GPU accelerated applications in Docker, it is highly recommended to use [dusty-nv jetson-containers repo](https://github.com/dusty-nv/jetson-containers).
When calling `ollama`, you can pass it a file to run all the prompts in the file, one after the other:
`ollama run llama2 < sourcequestions.txt`
This concept is used in the following example.
## Compare Models
`comparemodels.sh` is a script that runs all the questions in `sourcequestions.txt` using any 4 models you choose that you have already pulled from the Ollama library or have created locally.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.