This logic was causing issues for me when importing a gguf that had some padding at the end of the file. The valid gguf would be read, but then it would try to read the offset as a different gguf file. This does not seem right.
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.
This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.
Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
If there is a CUDA OOM, we currently don't check the return value
and will evetually segfault. This checks for the problem and generates
a Go error. At the moment, this will still result in a panic but having
the error is the first step to being able to handle it more gracefully.
improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.
The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.
Co-authored-by: Jesse Gross <jesse@ollama.com>
When truncating inputs to the the context window at the beginning of
a sequence, we remove the minimum amount possible. However, this
may cause us to truncate to the middle of a set of inputs that
the model specified should not be split up. To avoid this, we
need to remove the rest of the partial batch.
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
Clear KV cache when shift operation is not supported by model.
Added KvCacheCanShift() check to handle models that can't perform cache shifts,
falling back to full cache clear while preserving logical token history to
maintain expected behavior when context window fills up.
This change adds tracking of download chunks during the pull process so
that subsequent pulls can skip downloading already completed chunks.
This works across restarts of ollama.
Currently, download state will be lost if a prune is triggered during a
pull (e.g. restart or remove). This issue should be addressed in a
follow-up PR.
If we have an error after creating a new sequence but before
finding a slot for it, we return without releasing the semaphore.
This reduces our parallel sequences and eventually leads to deadlock.
In practice this should never happen because once we have acquired
the semaphore, we should always be able to find a slot. However, the
code is clearly not correct.
With the llama runner, we can generate up to NUM_PARALLEL batches
at once, which will then get broken up to into individual batches
to get executed by llama.cpp (i.e. we add up to 2048 tokens and
this gets split into 4 batches of 512 tokens at default settings).
This splitting can improve parallelism on multi-GPU systems because
the individual batches can move though the pipeline without blocking
on the first one to fully complete. However, we don't yet support
this in the Ollama runner, partially because it makes it hard to
enforce model-specified batch constraints, which didn't exist
previously.
The result is that we will try to execute the full, unsplit batch.
This could result in out of memory or insufficient KV cache space
errors.
This triggers batch breaking when the total inputs from all sequences
exceeds the batch size, rather than per-sequence. In order to ensure
fairness, it also reintroduces round-robinning around sequences so
that we don't let one busy sequence starve the others.
Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.
Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.
Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.
This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.
Fixes#9730Fixes#9890
When computing the size of the cache for sliding window attention,
we don't need to multiple the batch size by the number of parallel
sequences - the batch size is constant.
This also simplifies the check for whether to allocate the cache
size based on capacity or window size as the batch size is already
incorporated into the capacity when handled by the runner.
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.
Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.
Also rename 'total' to 'received' for clarity.
Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.
At large context sizes this improves two things:
- Memory allocated - since the fully context size is allocated up front,
memory requirements drop substantially. On Gemma3:4b with a 32k
context window, total memory usage (including weights and non-sliding
layers) drops from ~20GB to ~8GB.
- Computation - ranges that are completely outside of the sliding
window are now removed from the tensors that are returned from the
cache rather than simply being masked out. This results in more
efficient processing, scaling with the size of the context that
has actually been used.
Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.
Bug #9730
Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.
Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.
This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.
Most of the infrastructure was already there, this extends it to
be available to the backends.
Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.
Fixes#9904
Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.
Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.
Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.
If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.
It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.
Note that there still some corner cases:
- A long prompt that exceeds the context window can get truncated
in the middle of an image. With the current models, this will
result in the model not recognizing the image at all, which is
pretty much the expected result with truncation.
- The context window is set less than the minimum batch size. The
only solution to this is to refuse to load the model with these
settings. However, this can never occur with current models and
default settings.
Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.
We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.
This fixes the case where a FROM line in previous modelfile points to a
file which may/may not be present in a different ollama instance. We
shouldn't be relying on the filename though and instead just check if
the FROM line was instead a valid model name and point to that instead.
This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.
Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.
This can no longer happen:
- The vision encoder is now on the GPU so both images would be
processed on the GPU.
- We require images to be fully contained in a batch and each
image including its special tokens is over half the batch size.
As a result, we will never get two images in the same batch.
Fixes#9731
Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.
This changes to use a separate context for each image, ensuring
that available resource limits are consistent.
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.
Fixes#9697
This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).
Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.
The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.
This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.
The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
This is useful for a few things:
- Work around bugs, such as having 2 images in one batch
- Keep the image in a single batch for fully connected attention
- Improve performance by not evaluating embeddings multiple times
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.
Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.
Most of this is simply moving the input data structures to a new
package to avoid import cycles.
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
This change bring in various interface cleanups along with greatly improving the performance of the sampler.
Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s
The problem with default.target is that it always points to the target that is currently started. So if you boot into single user mode or the rescue mode still Ollama tries to start.
I noticed this because either tried (and failed) to start all the time during a system update, where Ollama definitely is not wanted.
Various vision models have different requirements for how they
receive their inputs. For example:
- Mllama wants images together with text and the image embeddings
don't themselves have positions or get stored in the main KV cache
- Llava-style models feed in embeddings similar to tokens and
images correspond to a varying number of tokens in the cache.
In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.
This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.
In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.
We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.
This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.
The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".
Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.
This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.
- output backend system info when initializing the backend. this ensures
this information is always present without needing to be called
explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
* Include unified vision layers in memory prediction
For newer vision models with a single gguf, include
the projection estimates.
* Adjust CLI to handle both styles of vision model metadata
* Wire up new tokenizers for new engine
If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp. This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.
This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.
* Lay foundation for auto selection of new engine
This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.
Issues with the current approach:
1. Users may accidentally delete a model (`ollama rm llama3.3` instead
of `ollama rm llama3.2`), requiring a full re-download unless another
model references the same blobs.
2. Users may assume a deleted model is still referenced elsewhere, but
due to prior updates or deletions, the references no longer exist,
leading to unnecessary re-downloads.
Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.
Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.
Example usage:
# Run basic blob GC
$ ollama prune
# Remove unreferenced blobs older than 7 days
$ ollama prune --age 7d
# Remove all blobs, referenced or not, older than 7 days (and their manifests?)
$ ollama prune --age 7d --all
# Remove all unreferenced blobs immediately
$ ollama prune --age 0 --all
# Remove all blobs
$ ollama prune --age 0 --all
This should provide a safer and more predictable cleanup process.
Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.
synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.
The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.
Previously, using a Registry required a DiskCache to be passed in for
use in various methods. This was a bit cumbersome, as the DiskCache is
required for most operations, and the DefaultCache is used in most of
those cases. This change makes the DiskCache an optional field on the
Registry struct.
This also changes DefaultCache to initialize on first use. This is to
not burden clients with the cost of creating a new cache per use, or
having to hold onto a cache for the lifetime of the Registry.
Also, slip in some minor docs updates for Trace.
Reverts ccache installation to be done manually via curl instead of
using the dnf package manager as this has side effects of prepending
ccache's install directory to the front of the PATH
The extended name format is a superset of the name format that only the
client needs to know about, not the server or other dependents of the
name package, so move the split logic into the client package.
Also, take advantage of knowing about the extended name format to allow
the client to use the extended name format when unlinking to verify they
are unlinking the manifest with the content they intend.
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.
Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.
The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.
Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.
To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
- for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.
This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.
Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.
Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.
More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment
Focuses initial Blackwell support on compute capability 12.0
which includes the 50x series of GeForce cards. In the future
additional compute capabilities may be added
Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".
Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.
Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.
Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.
Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.
As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.
update Context.Forward to accept multiple tensors to match
Context.Compute signature
update Context.Forward to return Context such that it can be chained
with Context.Compute
This fixes panics introduced in 2412adf42b
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
* Windows ARM build
Skip cmake, and note it's unused in the developer docs.
* Win: only check for ninja when we need it
On windows ARM, the cim lookup fails, but we don't need ninja anyway.
* Add cuda Blackwell architecture for v12
* Win: Split rocm out to separate zip file
* Reduce CC matrix
The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.
During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.
This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.
They are:
* Always run tests, even if the other checks fail.
Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.
* Check that `go generate` is clean.
This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.
Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.
* Check that `go mod tidy` is clean.
A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.
* More robust caching.
We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.
The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.
* Bump cuda and rocm versions
Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.
* Fix windows build script
centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.
Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.
New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.
Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
There are two benefits to doing this:
- Provide a library function that models can use, reducing code for
each model implementation
- Enables a single place to drop in optimized implementations of
attention based on the backend or other factors. One is provided for
GGML.
On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.
Changing the model definition in this way improves token generation
performance by approximately 8%.
We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.
This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.
Currently the following parameters are in the runner but not used:
- numGPULayers
- mainGPU
- threads
- tensorSplit
This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses
clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance
sapphire rapids has amx support but it ends up having a negative
performance impact.
emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
set owner and group when building the linux tarball so extracted files
are consistent. this is the behaviour of release tarballs in version
0.5.7 and lower
The previous commit fixed flickering in the progress bar itself. Cursor
flickering is harder to address.
Cursor flickering could be fixed by hiding the cursor altogether while
the progress bar is displayed. The downside of this is that if the
program is killed in such a way that it can't clean up its state, it
would leave the cursor invisible.
Instead, this commit introduces an output buffer. All of the escape
codes and content for a single progress update are written to a buffer,
which is then flushed to the terminal all at once. This significantly
decreases the time during which the terminal has seen the cursor-hiding
code but has not yet seen the cursor-showing code, thus minimizing (but
not 100% eliminating) cursor flickering.
For more context, see:
https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501
Previous code cleared the display before writing new content, creating a
window where the terminal could (and in some cases did) render empty lines.
Instead, we now write new content over the old content, only clearing
the trailing end of lines for cases where the new line is shorter.
Fixes#1664
We currently print system info before the GGML backends are loaded.
This results in only getting information about the default lowest
common denominator runner. If we move up the GGML init then we can
see what we are actually running.
Before:
time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24
After:
time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
provides a better approach to #9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues
In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.
In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
- Parallel processing
- Memory management for defragmentation and shifting
- Multi-modal modals
Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:
Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve
Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
Special tokens are currently read as uint32 from the model metadata.
However, all other parts of the system (including the tokenizer) use
int32 to represent tokens so it is impossible to represent the high
portion of the unsigned range. For consistency and to avoid casts,
we should just use int32 everywhere.
Currently, if a model uses an interface for its data structures (as mllama
does) then the tensor data in the structs implementing that interface will
not get loaded.
We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.
Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.
Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.
There are two cases where we may not have an output after computing:
- Prompt processing where the length of the input exceeds the batch
size
- Internal memory management operations such as cache defrag and shift
Currently there is a mixture of int and int64 used when dealing with
tensor dimensions and shapes, which causes unnecessary conversions -
they all should be the same type.
In general, most interfaces (such as Pytorch) use int64 for
generality but most implementations (such as CUDA) use int32 for
performance. There isn't much benefit to us to being more flexible
than the implementations we are likely to run on.
In addition, as a practical matter, a model with a tensor with a single
dimension larger than 32 bits is unlikely to run on a 32-bit machine.
It is not common to return errors with close/free operations - most
people won't check it and even if they did there's probably not much
that can do. It's better to not give implementations false expectations.
feat: add new Ollama engine using ggml through cgo
This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations
This is the first implementation of the new engine. Follow up PRs will implement more features:
- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
ollama requires vcruntime140_1.dll which isn't found on 2019. previously
the job used the windows runner (2019) but it explicitly installs
2022 to build the app. since the sign job doesn't actually build
anything, it can use the windows-2022 runner instead.
Shield the code processing the embedding result
from subsequent calls that may overwrite the same
buffer to process a second input when retrieving
model embeddings.
Removed redundant checks and streamlined the switch-case structure.
Added test cases for both HumanBytes and HumanBytes2 to cover a wide range of scenarios.
In some cases, downloads slow due to disk i/o or other factors,
causing the download to restart a part. This causes the download
to "reverse" in percent completion. By increasing the timeout to 30s,
this should happen less frequently.
* add build to .dockerignore
* test: only build one arch
* add build to .gitignore
* fix ccache path
* filter amdgpu targets
* only filter if autodetecting
* Don't clobber gpu list for default runner
This ensures the GPU specific environment variables are set properly
* explicitly set CXX compiler for HIP
* Update build_windows.ps1
This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset.
* build: add ollama subdir
* add .git to .dockerignore
* docs: update development.md
* update build_darwin.sh
* remove unused scripts
* llm: add cwd and build/lib/ollama to library paths
* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS
* add additional cmake output vars for msvc
* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12
* remove unncessary filepath.Dir, cleanup
* add hardware-specific directory to path
* use absolute server path
* build: linux arm
* cmake install targets
* remove unused files
* ml: visit each library path once
* build: skip cpu variants on arm
* build: install cpu targets
* build: fix workflow
* shorter names
* fix rocblas install
* docs: clean up development.md
* consistent build dir removal in development.md
* silence -Wimplicit-function-declaration build warnings in ggml-cpu
* update readme
* update development readme
* llm: update library lookup logic now that there is one runner (#8587)
* tweak development.md
* update docs
* add windows cuda/rocm tests
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
These fields are deprecated, but specifying them will not do anything. Removing them as the other deprecated fields will still work, but these do not, so they dont match our existing pattern.
This change adds a test to catch a regression in schema_to_grammar where
the order of keys in the JSON schema is not preserved in the generated
grammar, which is critical for step-by-step reasoning.
Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.
In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.
Fixes#7949
Changes in #8002 introduced fixes for bugs with mangling JSON Schemas.
It also fixed a bug where the server would silently fail when clients
requested invalid formats. It also, unfortunately, introduced a bug
where the server would reject requests with an empty format, which
should be allowed.
The change in #8127 updated the code to allow the empty format, but also
reintroduced the regression where the server would silently fail when
the format was set, but invalid.
This commit fixes both regressions. The server does not reject the empty
format, but it does reject invalid formats. It also adds tests to help
us catch regressions in the future.
Also, the updated code provides a more detailed error message when a
client sends a non-empty, but invalid format, echoing the invalid format
in the response.
This commits also takes the opportunity to remove superfluous linter
checks.
In 0.5.2 we simplified packaging to have avx only for macos x86. It looks like
there may still be some non-AVX systems out there, so this puts back the prior
logic of building no-AVX for the primary binary, and now 2 runners for avx and avx2.
These will be packaged in the App bundle only, so the stand-alone binary will now be
without AVX support on macos. On arm, we'll also see these runners reported
as available in the log, but they're dormant and will never be used at runtime.
* docs: switch around database integrations order and link to quickstart
* docs: link to blog post in example readme
* chore: link to main readme
* readme: removing example to link externally
readme: removing example to link externally so we don't have to keep this example up-to-date
---------
Previously we decoded and re-encoded JSON schemas during validation,
which served no purpose since json.RawMessage already validates JSON
syntax. Worse, the re-encoding lost field ordering from the original
schema, which affects inference quality during step-by-step reasoning.
While fixing this ordering issue by using json.RawMessage directly,
testing revealed that schema_to_grammar (from llama.cpp) also fails to
preserve field order during grammar generation. This appears to be the
root cause of inference degradation.
This change prevents us from mangling the user's original schema order,
but we still need to address the ordering issue in schema_to_grammar.
That will be a separate change.
Updates #7978
upload-artifacts strips off leading common paths so when
the ./build/ artifacts were removed, the ./dist/windows-amd64
prefix became common and was stripped, making the
later download-artifacts place them in the wrong location
The final implementation of #7499 removed dynamic vector requirements
in favor of a simpler filename based model, and this was left over logic that
is no longer needed.
* llama: wire up builtin runner
This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build. After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.
* build: Make target improvements
Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.
* Support customized CPU flags for runners
This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash. If the user builds a customized set, we omit the naming
scheme and don't check for compatibility. This avoids checking
requirements at runtime, so that logic has been removed as well. This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.
* Use relative paths
If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.
* Remove payloads from main binary
* install: clean up prior libraries
This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
New lines can be an important part of a user's prompt and trimming
it can alter the results. We previously only trimmed prompts with
images but refactoring brought this behavior to all prompts, where
it became more noticable.
The /generate endpoint adds less whitespace and therefore doesn't
need to trim it out - this brings the same behavior to /chat.
Thanks to @gabe-l-hart for spotting the issue!
Fixes#7795
When processing a prompt, we look for image tags of the form
[img-0], which are inserted by the Ollama server process.
However, this can cause errors if the original prompt has these
tags - typically an image not found error is returned.
This changes tag searching behavior to be similar to the 0.3.x
series, which will largely avoid these problems. However,they can
still happen when input text with these tags is used with image
models. The correct solution is to escape the tags but this is a
larger issue with special sequences in general so this is an
incremental fix that should avoid the problem for the majority
of cases.
This also makes it easier to truncate long inputs the same as
shifting but does not actually implement it. This type of
truncation has a trade off between quality and time to first
token.
This changes makeRequest to update the http client Transport if and only
if testMakeRequestDialContext is set. This is to avoid overriding the
default Transport when testMakeRequestDialContext is nil, which broke
existing behavior, included proxies, timeouts, and other behaviors.
Fixes#7829Fixes#7788
After a user pushes their model it is not clear what to do next. Add a link
to the output of `ollama push` that tells the user where their model can now
be found.
If there are no avilable slots for new sequences then a request
will not be added to the processing queue but will continue on
to wait for a response that never comes. Besides never giving a
response to the request, this prevents the model from being
unloaded due to the outstanding request.
To prevent this, there are semaphores that prevent more requests
from being processed than there are slots - one in the Ollama
server and one in the runner.
- The Ollama server one works but it is not designed to protect
the runner's data internal structures and the runner can return a
final response before clearing its data structures.
- The internal runner semaphore has similar behavior where it
can release the semaphore when it issues a response. This is
wrong - it should only release the semaphore after it has
cleared the data structure.
In addition, we should return an error if a slot is not found
rather than deadlocking in the event we ever get to this spot.
Fixes#7779
In the past the ollama.com server would return a JWT that contained
information about the user being authenticated. This was used to return
different error messages to the user. This is no longer possible since the
token used to authenticate does not contain information about the user
anymore. Removing this code that no longer works.
Follow up changes will improve the error messages returned here, but good to
clean up first.
Users get confused by "Failed to acquire semaphore" error="context canceled"
messages in the logs, which are actually clients giving up. While there could be
a legitimate hang bug in the system, sometimes this is just short client timeouts
with an overloaded system, so this should help users understand what's going on
better.
This avoids emitting the progress indicators to stderr, and the interactive
prompts to the output file or pipe. Running "ollama run model > out.txt"
now exits immediately, and "echo hello | ollama run model > out.txt"
produces zero stderr output and a typical response in out.txt
OrionChat is a free web-based chat interface that simplifies interactions
with multiple AI model providers. It provides a unified platform for chatting
and exploring multiple large language models (LLMs).
Previous versions of the runner would truncate inputs to the context
window before beginning processing. The main processing loop relied
on this behavior if the context needed to be shifted later (due to
token generation). If truncation did not occur then invariants
would be broken, causing crashes or infinite loops.
Later versions attempted to fix these bugs and make the logic less
subtle so that all inputs could be handled. Truncation was removed
to make things consistent.
However, truncation is much faster than processing and shifting, so
removing it caused performance problems when the input vastly exceeded
the context size. This restores the input truncation as a performance
optimization while keeping the more robust processing logic.
Fixes#7762
We need to track which tokens are in the cache ourselves. We currently
add tokens to the cache tracker when we add them to batch but they are
not actually in the cache until we call Decode. This can cause
confusion when we are shifting the cache.
Avoids "could not find a KV slot for the batch" issues.
Bug #7545
We try to recover from errors by dropping the tokens that caused the
problem and re-trying. However, dropping the tokens is not correct
and continuing often leads to infinite loops. To avoid, this we
end the sequence if such a condition is detected, which is also
surprising.
At this point, it is better to just report the error. This will make
it easier to find problems and the alternatives are perhaps even more
surprising to users.
This is not a very satisfactory solution either - we should isolate
the error and return it to the user without killing the whole process.
However, this is an incremental step and consistent with most other
failures (which either manifest as abort() or panic).
Fragmentation of the KV cache can occur due to cache shifting or
different sequences getting processed. Decode uses a heuristic to
decide if it should defrag. However, this heuristic isn't 100%
accurate, so decoding can sometimes fail by surprise.
For these cases, if decode indicates that there is no KV cache space,
we should defrag and then try again.
Many model crashes are masked behind "An existing connection was forcibly closed by the remote host"
This captures that common error message and wires in any detected errors from the log.
This also adds the deepseek context shift error to the known errors we capture.
This change allows for mixed-case model names to be pushed, pulled,
copied, and created, which was previously disallowed because the Ollama
registry was backed by a Docker registry that enforced a naming
convention that disallowed mixed-case names, which is no longer the
case.
This does not break existing, intended, behaviors.
Also, make TestCase test a story of creating, updating, pulling, and
copying a model with case variations, ensuring the model's manifest is
updated correctly, and not duplicated across different files with
different case variations.
This is a partial revert of 8a35bb92
"runner.go: Increase survivability of main processing loop", removing
the panic handler.
Although we want to avoid errors taking down the runner, we also
should make the user aware of problems when they happen. In the
future, we can restructure things so both parts are true.
Currently, if an error occurs during the prep stages (such as
tokenizing) of a single request, it will only affect that request.
However, if an error happens during decoding, it can take down the
entire runner.
Instead, it's better to drop the tokens that triggered the error and try to
keep going. However, we also need to stop when we run out of tokens,
otherwise, this just causes an infinite loop. This is likely the cause
of at least some of the hanging issues that have been reported.
Bug #7573
It's possible to get prompts that consist entirely of whitespace -
this is most likely to happen when generating embeddings. Currently,
we will trim this away, leaving an empty prompt, which will then
generate an error.
Generating embeddings from whitespace should not trigger an error,
as this may break pipelines. It's better to just leave the whitespace
in place and process what we are given. This is consistent with
past versions of Ollama.
Bug #7578
NUM_PARALEL is currently enforced by the Ollama server process - it
will only issue requests to the runner if the maximum number of
concurrent requests has not been exceeded. Although this should
be sufficient, it is good for the runner to protect its own data
structures. Currently, if too many requests get through to the
runner, they will just get stuck and never return.
This may help with reports of Ollama hanging, though it is unclear
how it would actually occur.
Bug #7573
Docker uses the container filesystem for name resolution, so we can't guide users
to use the name of the host group. Instead they must specify the numeric ID.
The structure of the accounting for KV cache shifting was carried
over from the old runner but it now doesn't feel natural with the new
runner. There are a number of invariants that should hold true but
are difficult to reason about. There is at least one bug report
that would imply that the invariants are not holding.
This reduces the number of implicit assumptions and is more forgiving
of unexpected situations. It also improves behavior around which input
tokens are kept when truncation occurs.
Bug #7545
The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.
However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.
One potential failure mode is an empty file which bubbles up as an EOF error,
leading to all pulls and listing operations failing. Instead, continue and
warn about the corrupt manifest. This also allows re-pulling the corrupt
manifest to repair the system.
Currently we assume that images take 768 tokens of context size for
the purposes of clipping old messages that exceed the context window.
However, our mllama implementation stores the full image embedding
in a single token. As a result, there is significant waste of context
space.
Ideally, we would handle this more generically and have the
implementation report the number of tokens. However, at the moment
this would just result in a similar set of 'if' conditions in the
runner plus APIs to report it back. So for now, we just keep this
simple.
This leverages caching, and some reduced installer scope to try
to speed up builds. It also tidies up some windows build logic
that was only relevant for the older generate/cmake builds.
Check for NULL return values from llama.cpp in more places and
convert them into Go errors, which should make debugging easier
in the future rather than having hidden surprises in our data
structures.
Mllama has large embeddings (100 MB per image) and each embedding is
represented as 1 token when passed to llama.cpp. Batches are pre-
allocated for the size of the tokens times the batch size, so this
results in allocations of over 50 GB at the default batch size.
On some systems, these mallocs will fail.
Since an image is represented as a single token and mllama doesn't
support more than 1 image per request, we only need to allocate a
batch size of 1, which is much more reasonable. In addition, for
non-multimodal models, we don't need to allocate the embedding
batches at all.
Fixes#7464
Currently if an input has embeddings at any point then we will set
cross attention to true from the beginning. This means that any
tokens before the embeddings are sent will incorrectly have cross
attention layers applied.
This only sets cross attention when we have an embedding, either
previously in this sequence or in the cache. It also makes cross
attention capable of supporting parallelism at the runner level,
though the mllama implementation doesn't support that yet.
* Give unicode test more time to run
Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test
* Give more time for concurrency test
CPU inference can be very slow under stress
-Update mllama to take the cross attention state as embeddings in
a batch, more similar to how Llava handles it. This improves
integration with the input cache.
-Pass locations in a prompt for embeddings using tags similar to Llava.
-Abstract interface to vision models so the main runner accesses Clip
and Mllama similarly
Co-authored-by: Michael Yang <mxyng@pm.me>
This will no longer error if built with regular gcc on windows. To help
triage issues that may come in related to different compilers, the runner now
reports the compier used by cgo.
* windows: Support alt install paths
Advanced users are leveraging innosetup's /DIR switch to target
an alternate location, but we get confused by things not existing in the LocalAppData dir.
This also hardens the server path lookup code for a future attempt to unify with a ./bin prefix
* Fit and finish improvements for windows app
Document alternate install location instructions for binaries and model.
Pop up progress UI for upgrades (automatic, with cancel button).
Expose non-default port in menu to disambiguate mutiple instances.
Set minimum Windows version to 10 22H2
* Switch over to clang for deepseek on windows
The patch for deepseek requires clang on windows. gcc on windows
has a buggy c++ library and can't handle the unicode characters
* Fail fast with wrong compiler on windows
Avoid users mistakenly building with GCC when we need clang
Llama.cpp sometimes returns NULL as a return value to report an
error. We should explicitly check for this and convert it to a Go
error rather than putting NULL in our data structures and waiting
for it to blow up later.
* Better support for AMD multi-GPU
This resolves a number of problems related to AMD multi-GPU setups on linux.
The numeric IDs used by rocm are not the same as the numeric IDs exposed in
sysfs although the ordering is consistent. We have to count up from the first
valid gfx (major/minor/patch with non-zero values) we find starting at zero.
There are 3 different env vars for selecting GPUs, and only ROCR_VISIBLE_DEVICES
supports UUID based identification, so we should favor that one, and try
to use UUIDs if detected to avoid potential ordering bugs with numeric IDs
* ROCR_VISIBLE_DEVICES only works on linux
Use the numeric ID only HIP_VISIBLE_DEVICES on windows
On windows ensure windows version define is properly set for rocm.
Remove duplicate rocm arch flags.
Resolve wildcards in the targets so parallel builds don't race.
Use readlink to resolve rocm dependencies since wildcards omit libelf
Keep windows rocm deps aligned with unified packaging model
We check for partial unicode characters and accumulate them before
sending. However, when we did send, we still sent each individual piece
separately, leading to broken output. This combines everything into
a single group, which is also more efficient.
This also switches to the built-in check for valid unicode characters,
which is stricter. After this, we should never send back an invalid
sequence.
Fixes#7290
* fix(ext_server): Port llama.cpp sampling refactors to ext_server
This was a fairly large changeset. I closely followed the changes here:
df270ef745
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Bump llama.cpp to the latest master with `granite` support
This does not yet have granite MoE support, but that can come in a
follow up PR
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(solar): Update solar patch for llama.cpp bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump llama.cpp for granitemoe support
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump llama.cpp for granitemoe support
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(solar): Update the solar-pro patch for latest llama.cpp bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump to the latest master of llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(patches): Update all patches for latest bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama): Always run sync.sh from the right directory
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/patches): Update llama patches
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama)!: Rough sync with llama.cpp submodule
There are a number of changes that will need to be propagated to llama.go
before any of this works!
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/patches): Add a patch and update for missing ggml-impl.h include
This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Add missing log.cpp
This was added as part of the logging overhaul done in llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Overhaul use of sampling module for llama.cpp changes
The changes here reflect the changes made in the big llama.cpp sampling PR
https://github.com/ggerganov/llama.cpp/pull/9294
The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Fix the impl of SampleTokenGreedy for new sampling
I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.
Branch: IBMGraniteArchitectureSupport
* fix(llama): Remove unused SampleTokenGreedy
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(sync): Remove bash-specific change to sync.sh
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* chore(gofumpt): Format on llama.go to pass linting
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llm): Fix missing <thread> include in ext_server
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Remove TODO about grammar_first
This feature was not used/needed previously so should be fine without
plumbing it through now.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Better naming for sampling wrapper and args
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Fix patch 05 to use new wrapper api and re-sync
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* runner: Flush pending responses before returning
If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.
Fixes#6707
* fix(llama/sampling): Use gpt_sampler with a forward declaration
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Remove unnecessary patch for gguf impl header
This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llm): Remove use of deprecated --log-disable flag
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Having v11 support hard-coded into the cgo settings causes warnings
for newer Xcode versions. This should help keep the build clean for users
building from source with the latest tools, while still allow us to target
the older OS via our CI processes.
Currently the CLI only sends images from the most recent image-
containing message. This prevents doing things like sending
one message with an image and then a follow message with a
second image and asking for comparision based on additional
information not present in any text that was output.
It's possible that some models have a problem with this but the
CLI is not the right place to do this since any adjustments are
model-specific and should affect all clients.
Both llava:34b and minicpm-v do reasonable things with multiple
images in the history.
When a single token contains both text to be return and a stop
sequence, this causes an out of bounds error when we update the
cache to match our text. This is because we currently assume that
the removing the stop sequence will consume at least one token.
This also inverts the logic to deal with positive numbers, rather
than a value to be subtracted, which is easier to reason about.
Fixes#7153
Close can be called on an LLM server if the runner subprocess dies.
However, the Ollama scheduler code may not know about this yet and
still try to access it. In this case, it is important that 'cmd'
is still available as it is used to check on the status of the
subprocess. If this happens, Kill may be called twice on the subprocess -
that is fine.
In addition, model unloading may race with new accesses, so we should
hold a lock around this. This may result in the model being reloaded
after the first close call - this is also fine as close will be called
again later.
The recent change to applying patches leaves the submodule dirty based on
"new commits" being present. This ensures we clean up so the tree no longer
reports dirty after a `go generate ./...` run.
The Makefile was being a bit too aggressive in cleaning things up and would result in deleting the placeholder files which someone might accidentally commit.
* Re-introduce the llama package
This PR brings back the llama package, making it possible to call llama.cpp and
ggml APIs from Go directly via CGo. This has a few advantages:
- C APIs can be called directly from Go without needing to use the previous
"server" REST API
- On macOS and for CPU builds on Linux and Windows, Ollama can be built without
a go generate ./... step, making it easy to get up and running to hack on
parts of Ollama that don't require fast inference
- Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
takes <5 min on a fast CPU)
- No git submodule making it easier to clone and build from source
This is a big PR, but much of it is vendor code except for:
- llama.go CGo bindings
- example/: a simple example of running inference
- runner/: a subprocess server designed to replace the llm/ext_server package
- Makefile an as minimal as possible Makefile to build the runner package for
different targets (cpu, avx, avx2, cuda, rocm)
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
* cache: Clear old KV cache entries when evicting a slot
When forking a cache entry, if no empty slots are available we
evict the least recently used one and copy over the KV entries
from the closest match. However, this copy does not overwrite
existing values but only adds new ones. Therefore, we need to
clear the old slot first.
This change fixes two issues:
- The KV cache fills up and runs out of space even though we think
we are managing it correctly
- Performance gets worse over time as we use new cache entries that
are not hot in the processor caches
* doc: explain golang objc linker warning (#6830)
* llama: gather transitive dependencies for rocm for dist packaging (#6848)
* Refine go server makefiles to be more DRY (#6924)
This breaks up the monolithic Makefile for the Go based runners into a
set of utility files as well as recursive Makefiles for the runners.
Files starting with the name "Makefile" are buildable, while files that
end with ".make" are utilities to include in other Makefiles. This
reduces the amount of nearly identical targets and helps set a pattern
for future community contributions for new GPU runner architectures.
When we are ready to switch over to the Go runners, these files should
move to the top of the repo, and we should add targets for the main CLI,
as well as a helper "install" (put all the built binaries on the local
system in a runnable state) and "dist" target (generate the various
tar/zip files for distribution) for local developer use.
* llama: don't create extraneous directories (#6988)
* llama: Exercise the new build in CI (#6989)
Wire up some basic sanity testing in CI for the Go runner. GPU runners are not covered yet.
* llama: Refine developer docs for Go server (#6842)
This enhances the documentation for development focusing on the new Go
server. After we complete the transition further doc refinements
can remove the "transition" discussion.
* runner.go: Allocate batches for all sequences during init
We should tell the model that we could have full batches for all
sequences. We already do this when we allocate the batches but it was
missed during initialization.
* llama.go: Don't return nil from Tokenize on zero length input
Potentially receiving nil in a non-error condition is surprising to
most callers - it's better to return an empty slice.
* runner.go: Remove stop tokens from cache
If the last token is EOG then we don't return this and it isn't
present in the cache (because it was never submitted to Decode).
This works well for extending the cache entry with a new sequence.
However, for multi-token stop sequences, we won't return any of the
tokens but all but the last one will be in the cache. This means
when the conversation continues the cache will contain tokens that
don't overlap with the new prompt.
This works (we will pick up the portion where there is overlap) but
it causes unnecessary cache thrashing because we will fork the original
cache entry as it is not a perfect match.
By trimming the cache to the tokens that we actually return this
issue can be avoided.
* runner.go: Simplify flushing of pending tokens
* runner.go: Update TODOs
* runner.go: Don't panic when processing sequences
If there is an error processing a sequence, we should return a
clean HTTP error back to Ollama rather than panicing. This will
make us more resilient to transient failures.
Panics can still occur during startup as there is no way to serve
requests if that fails.
Co-authored-by: jmorganca <jmorganca@gmail.com>
* runner.go: More accurately capture timings
Currently prompt processing time doesn't capture the that it takes
to tokenize the input, only decoding time. We should capture the
full process to more accurately reflect reality. This is especially
true once we start processing images where the initial processing
can take significant time. This is also more consistent with the
existing C++ runner.
* runner.go: Support for vision models
In addition to bringing feature parity with the C++ runner, this also
incorporates several improvements:
- Cache prompting works with images, avoiding the need to re-decode
embeddings for every message in a conversation
- Parallelism is supported, avoiding the need to restrict to one
sequence at a time. (Though for now Ollama will not schedule
them while we might need to fall back to the old runner.)
Co-authored-by: jmorganca <jmorganca@gmail.com>
* runner.go: Move Unicode checking code and add tests
* runner.go: Export external cache members
Runner and cache are in the same package so the change doesn't
affect anything but it is more internally consistent.
* runner.go: Image embedding cache
Generating embeddings from images can take significant time (on
my machine between 100ms and 8s depending on the model). Although
we already cache the result of decoding these images, the embeddings
need to be regenerated every time. This is not necessary if we get
the same image over and over again, for example, during a conversation.
This currently uses a very small cache with a very simple algorithm
but it is easy to improve as is warranted.
* llama: catch up on patches
Carry forward solar-pro and cli-unicode patches
* runner.go: Don't re-allocate memory for every batch
We can reuse memory allocated from batch to batch since batch
size is fixed. This both saves the cost of reallocation as well
keeps the cache lines hot.
This results in a roughly 1% performance improvement for token
generation with Nvidia GPUs on Linux.
* runner.go: Default to classic input cache policy
The input cache as part of the go runner implemented a cache
policy that aims to maximize hit rate in both single and multi-
user scenarios. When there is a cache hit, the response is
very fast.
However, performance is actually slower when there is an input
cache miss due to worse GPU VRAM locality. This means that
performance is generally better overall for multi-user scenarios
(better input cache hit rate, locality was relatively poor already).
But worse for single users (input cache hit rate is about the same,
locality is now worse).
This defaults the policy back to the old one to avoid a regression
but keeps the new one available through an environment variable
OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is
to improve this in the future to get the best of both worlds
without user configuration.
For inputs that result in cache misses, on Nvidia/Linux this
change improves performance by 31% for prompt processing and
13% for token generation.
* runner.go: Increase size of response channel
Generally the CPU can easily keep up with handling reponses that
are generated but there's no reason not to let generation continue
and handle things in larger batches if needed.
* llama: Add CI to verify all vendored changes have patches (#7066)
Make sure we don't accidentally merge changes in the vendored code
that aren't also reflected in the patches.
* llama: adjust clip patch for mingw utf-16 (#7065)
* llama: adjust clip patch for mingw utf-16
* llama: ensure static linking of runtime libs
Avoid runtime dependencies on non-standard libraries
* runner.go: Enable llamafile (all platforms) and BLAS (Mac OS)
These are two features that are shown on llama.cpp's system info
that are currently different between the two runners. On my test
systems the performance difference is very small to negligible
but it is probably still good to equalize the features.
* llm: Don't add BOS/EOS for tokenize requests
This is consistent with what server.cpp currently does. It affects
things like token processing counts for embedding requests.
* runner.go: Don't cache prompts for embeddings
Our integration with server.cpp implicitly disables prompt caching
because it is not part of the JSON object being parsed, this makes
the Go runner behavior similarly.
Prompt caching has been seen to affect the results of text completions
on certain hardware. The results are not wrong either way but they
are non-deterministic. However, embeddings seem to be affected even
on hardware that does not show this behavior for completions. For
now, it is best to maintain consistency with the existing behavior.
* runner.go: Adjust debug log levels
Add system info printed at startup and quiet down noisier logging.
* llama: fix compiler flag differences (#7082)
Adjust the flags for the new Go server to more closely match the
generate flow
* llama: refine developer docs (#7121)
* llama: doc and example clean up (#7122)
* llama: doc and example clean up
* llama: Move new dockerfile into llama dir
Temporary home until we fully transition to the Go server
* llama: runner doc cleanup
* llama.go: Add description for Tokenize error case
---------
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
This change closes the response body when an error occurs in
makeRequestWithRetry. Previously, the first, non-200 response body was
not closed before reattempting the request. This change ensures that
the response body is closed in all cases where an error occurs,
preventing leaks of file descriptors.
Fixes#6974
GPUs handled the dependency path properly, but CPU runners didn't which
results in missing vc redist libraries on systems where the user didn't
already have it installed from some other app.
* Unified arm/x86 windows installer
This adjusts the installer payloads to be architecture aware so we can cary
both amd64 and arm64 binaries in the installer, and install only the applicable
architecture at install time.
* Include arm64 in official windows build
* Harden schedule test for slow windows timers
This test seems to be a bit flaky on windows, so give it more time to converge
The rocm CI step for RCs was incorrectly tagging them as the latest rocm build.
The multiarch manifest was incorrectly tagged twice (with and without the
prefix "v"). Static windows artifacts weren't being carried between build
jobs. This also fixes the latest tagging script.
* Optimize container images for startup
This change adjusts how to handle runner payloads to support
container builds where we keep them extracted in the filesystem.
This makes it easier to optimize the cpu/cuda vs cpu/rocm images for
size, and should result in faster startup times for container images.
* Refactor payload logic and add buildx support for faster builds
* Move payloads around
* Review comments
* Converge to buildx based helper scripts
* Use docker buildx action for release
If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.
Fixes#6707
This adds back a check which was lost many releases back to verify /dev/kfd permissions
which when lacking, can lead to confusing failure modes of:
"rocBLAS error: Could not initialize Tensile host: No devices found"
This implementation does not hard fail the serve command but instead will fall back to CPU
with an error log. In the future we can include this in the GPU discovery UX to show
detected but unsupported devices we discovered.
Provide a mechanism for users to set aside an amount of VRAM on each GPU
to make room for other applications they want to start after Ollama, or workaround
memory prediction bugs
It looks like driver 525 (aka, cuda driver 12.0) has problems with the cuda v12 library
we compile against, so run v11 on those older drivers if detected.
/Users/au/src/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.
During rebasing, the ordering was inverted causing the cuda version
selection logic to break, with driver version being evaluated as zero
incorrectly causing a downgrade to v11.
* Fix embeddings memory corruption
The patch was leading to a buffer overrun corruption. Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count. To
work around this, only use slot 0 for embeddings.
* Fix embed integration test assumption
The token eval count has changed with recent llama.cpp bumps (0.3.5+)
We're over budget for github's maximum release artifact size with rocm + 2 cuda
versions. This splits rocm back out as a discrete artifact, but keeps the layout so it can
be extracted into the same location as the main bundle.
This adjusts linux to follow a similar model to windows with a discrete archive
(zip/tgz) to cary the primary executable, and dependent libraries. Runners are
still carried as payloads inside the main binary
Darwin retain the payload model where the go binary is fully self contained.
- Updated setIcon method to include tooltip text for the system tray icon.
- Added NIF_TIP flag and set the tooltip text using UTF16 encoding.
Resolves: #6372
* Fix typo and improve readability
Summary:
* Rename updatAvailableMenuID to updateAvailableMenuID
* Replace unused cmd parameter with _ in RunServer function
* Fix typos in comments
(cherry picked from commit 5b8715f0b04773369e8eb1f9e6737995a0ab3ba7)
* Update api/client.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
The previous value of 64 was WAY too high and unnecessary. It reached
diminishing returns and blew past it. This is a more reasonable number
for _most_ normal cases. For users on cloud servers with excellent
network quality, this will keep screaming for them, without hitting our
CDN limits. For users with relatively poor network quality, this will
keep them from saturating their network and causing other issues.
For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.
Don't allow loading models that would lead to memory exhaustion (across vram, system memory and disk paging). This check was already applied on Linux but should also be applied on Windows as well.
Commit 1829fb61 ("manifest: Fix crash on startup when trying to clean up
unused files (#5840)") changed the config layer stored in manifests
from a pointer to a value. This was done in order to avoid potential
nil pointer dereferences after it is deserialized from JSON in the
event that the field is missing.
This changes the Layers slice to also be stored by value. This enables
consistency in handling across the two objects.
When creating a model the config layer is appended to the list of
layers and then the last layer is used as the config when writing the
manifest. This change directly uses the config layer to write the
manifest. There is no behavior change but it is less error prone.
Currently if the config field is missing in the manifest file (or
corrupted), Ollama will crash when it tries to read it. This can
happen at startup or when pulling new models.
This data is mostly just used for showing model information so we
can be tolerant of it not being present - it is not required to
run the models. Besides avoiding crashing, this also gives us the
ability to restructure the config in the future by pulling it
into the main manifest file.
The change b732beba6 makes all files text files and sets lf as eol. This
will automatically change all files to have lf if they are touched by
git (e.g. via git status). This change cannot be stashed and makes it
hard to work with the repo (rebase and checkout don't really work). See
also #6183.
Here, we set the offending files (*.png and *.ico, but that might be
more in the future) to be treated as binary files and not be changed by
git.
If there is an error when opening a manifest file (corrupted, permission denied, etc.)
then the referenced layers will not be included in the list of active
layers. This causes them to be deleted when pruning happens at startup
or a model is pulled.
In such a situation, we should prefer to preserve data in the hopes that
it can be recovered rather than being agressive about deletion.
The file.Truncate call on windows will write the whole file
unless you set the sparse flag, leading to heavy I/O at the
beginning of download. This should improve our
I/O behavior on windows and put less stress on the users disk.
If the system has multiple numa nodes, enable numa support in llama.cpp
If we detect numactl in the path, use that, else use the basic "distribute" mode.
In mult-brand GPU setups, if we couldn't fully load the model we
would fall through the scheduler and mistakenly try to load across
a mix of brands. This makes sure we find the set of GPU(s) that
best fit for the partial load.
This fixes various data races scattered throughout the download/pull
client where the client was accessing the download state concurrently.
This commit is mostly a hot-fix and will be replaced by a new client one
day soon.
Also, remove the unnecessary opts argument from downloadChunk.
This changes the registry client to reuse the original download URL
it gets on the first redirect response for all subsequent requests,
preventing thundering herd issues when hot new LLMs are released.
If we detect an NVIDIA GPU, but nvidia doesn't support the os/arch,
this will report a better error for the user and point them to docs
to self-install the drivers if possible.
Make sure if something goes wrong spawning the process, the user gets
enough info to be able to try to self correct, or at least file a bug
with details so we can fix it. Once the process starts, we immediately
change back to the recommended setting to prevent the blocking dialog.
This ensures if the model fails to load (OOM, unsupported model type,
etc.) the process will exit quickly and we can scan the stdout/stderr
of the subprocess for the reason to report via API.
The OLLAMA_MAX_VRAM env var was a temporary workaround for OOM
scenarios. With Concurrency this was no longer wired up, and the simplistic
value doesn't map to multi-GPU setups. Users can still set `num_gpu`
to limit memory usage to avoid OOM if we get our predictions wrong.
On windows, the exit status winds up being the search term many
users search for and end up piling in on issues that are unrelated.
This refines the reporting so that if we have a more detailed message
we'll suppress the exit status portion of the message.
The v5 hip library returns unsupported GPUs which wont enumerate at
inference time in the runner so this makes sure we align discovery. The
gfx906 cards are no longer supported so we shouldn't compile with that
GPU type as it wont enumerate at runtime.
* reopen pr
* tools
* remove tc from stream for now
* ID and Function
* openai expects arguments to be a string (#5739)
* mutually exclusive content and tool calls
* clean up
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Initial Batch Embedding
* Revert "Initial Batch Embedding"
This reverts commit c22d54895a.
* Initial Draft
* mock up notes
* api/embed draft
* add server function
* check normalization
* clean up
* normalization
* playing around with truncate stuff
* Truncation
* Truncation
* move normalization to go
* Integration Test Template
* Truncation Integration Tests
* Clean up
* use float32
* move normalize
* move normalize test
* refactoring
* integration float32
* input handling and handler testing
* Refactoring of legacy and new
* clear comments
* merge conflicts
* touches
* embedding type 64
* merge conflicts
* fix hanging on single string
* refactoring
* test values
* set context length
* clean up
* testing clean up
* testing clean up
* remove function closure
* Revert "remove function closure"
This reverts commit 55d48c6ed1.
* remove function closure
* remove redundant error check
* clean up
* more clean up
* clean up
* OpenAI v1 models
* Refactor Writers
* Add Test
Co-Authored-By: Attila Kerekes
* Credit Co-Author
Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com>
* Empty List Testing
* Use Namespace for Ownedby
* Update Test
* Add back envconfig
* v1/models docs
* Use ModelName Parser
* Test Names
* Remove Docs
* Clean Up
* Test name
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Add Middleware for Chat and List
* Testing Cleanup
* Test with Fatal
* Add functionality to chat test
* Support image input for OpenAI chat
* Decoding
* Fix message processing logic
* openai vision test
* type errors
* clean up
* redundant check
* merge conflicts
* merge conflicts
* merge conflicts
* flattening and smaller image
* add test
* support python and js SDKs and mandate prefixing
* clean up
---------
Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* llm: avoid loading model if system memory is too small
* update log
* Instrument swap free space
On linux and windows, expose how much swap space is available
so we can take that into consideration when scheduling models
* use `systemSwapFreeMemory` in check
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
This also adjusts our algorithm to favor our bundled ROCm.
I've confirmed VRAM reporting still doesn't work properly so we
can't yet enable concurrency by default.
This adds logic to detect skew between the driver and
management library which can be attributed to OS overhead
and records that so we can adjust subsequent management
library free VRAM updates and avoid OOM scenarios.
This was leading to ambiguity and confusion in ollama.com, and is not
used anywhere in ollama at the moment. Once manifests are addressable by
digest, we can add this back in, and in a way that is more tailored to
the concept of addressing a manifest by digest.
This change fixes the handling of keep_alive so that if client
request omits the setting, we only set this on initial load. Once
the model is loaded, if new requests leave this unset, we'll keep
whatever keep_alive was there.
Users may not realize the siny new model they're trying to load
fits on their disk, but can't load into system+GPU memory. Today
we crash, but with this fix, we'll give them a better error message
before even trying to load it.
When ollama is running a long time, tmp cleaners can remove the
runners. This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"
On windows, if the model dir contained unicode characters
clip models would fail to load. This fixes the file name
handling in clip.cpp to support utf16 on windows.
Refine the way we log GPU discovery to improve the non-debug
output, and report more actionable log messages when possible
to help users troubleshoot on their own.
* OpenAI v1 models
* Refactor Writers
* Add Test
Co-Authored-By: Attila Kerekes
* Credit Co-Author
Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com>
* Empty List Testing
* Use Namespace for Ownedby
* Update Test
* Add back envconfig
* v1/models docs
* Use ModelName Parser
* Test Names
* Remove Docs
* Clean Up
* Test name
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Add Middleware for Chat and List
* Completions Endpoint
* Testing Cleanup
* Test with Fatal
* Add functionality to chat test
* Rename function
* float types
* type cleanup
* cleaning
* more cleaning
* Extra test cases
* merge conflicts
* merge conflicts
* merge conflicts
* merge conflicts
* cleaning
* cleaning
---------
Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* OpenAI v1 models
* Refactor Writers
* Add Test
Co-Authored-By: Attila Kerekes
* Credit Co-Author
Co-Authored-By: Attila Kerekes <439392+keriati@users.noreply.github.com>
* Empty List Testing
* Use Namespace for Ownedby
* Update Test
* Add back envconfig
* v1/models docs
* Use ModelName Parser
* Test Names
* Remove Docs
* Clean Up
* Test name
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Add Middleware for Chat and List
* Testing Cleanup
* Test with Fatal
* Add functionality to chat test
* OpenAI: /v1/models/{model} compatibility (#5028)
* Retrieve Model
* OpenAI Delete Model
* Retrieve Middleware
* Remove Delete from Branch
* Update Test
* Middleware Test File
* Function name
* Cleanup
* Test Update
* Test Update
---------
Co-authored-by: Attila Kerekes <439392+keriati@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:
* Too many allocations when decoding strings
* Hitting disk for each read of each key and value, resulting in a
not-okay amount of syscalls/disk I/O.
The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.
This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.
Also, this fixes a broken test that was not encoding valid GGUF.
This commit changes the 'ollama run' command to defer fetching model
information until it really needs it. That is, when in interactive mode.
It also removes one such case where the model information is fetch in
duplicate, just before calling generateInteractive and then again, first
thing, in generateInteractive.
This positively impacts the performance of the command:
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total
; time ./before run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?
./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
; time ./after run llama3 'hi'
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.
This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.
* API Show Extended
* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>
* Clean Up
* Descriptive arg error messages and other fixes
* Second Draft of Show with Projectors Included
* Remove Chat Template
* Touches
* Prevent wrapping from files
* Verbose functionality
* Docs
* Address Feedback
* Lint
* Resolve Conflicts
* Function Name
* Tests for api/show model info
* Show Test File
* Add Projector Test
* Clean routes
* Projector Check
* Move Show Test
* Touches
* Doc update
---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>
The Digest type in its current form is awkward to work with and presents
challenges with regard to how it serializes via String using the '-'
prefix.
We currently only use this in ollama.com, so we'll move our specific
needs around digest parsing and validation there.
The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off. This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
* docs: add missing instruction for powershell build
The powershell script for building Ollama on Windows now requires the `ThreadJob` module. Add this to the instructions and dependency list.
* Update development.md
This implements the release logic we want via gh cli
to support updating releases with rc tags in place and retain
release notes and other community reactions.
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one. This exposes that
tunable behavior.
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
multiple templates may appear in a model if a model is created from
another model that 1) has an autodetected template and 2) defines a
custom template
* Added instructions to easily install specific versions on faq.md
* Small typo
* Moved instructions on how to install specific version to linux.md
* Update docs/linux.md
* Update docs/linux.md
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* envconfig/config.go: Fix wrong description of OLLAMA_LLM_LIBRARY
Signed-off-by: Lei Jitang <leijitang@outlook.com>
* serve: Add more env to help message of ollama serve
Add more enviroment variables to `ollama serve --help`
to let users know what can be configurated.
Signed-off-by: Lei Jitang <leijitang@outlook.com>
---------
Signed-off-by: Lei Jitang <leijitang@outlook.com>
* integrated runewidth for display management - fixed cursor movement for mutli-width char
* updated input and deletion of multi-byte chars
* fixed line history with some exceptions
* improved insert and add
* fixed issues with moving across lines
* end of line extra space tracking'
* saved changes
* fixed end of line issues with empty spaces
* worked some more
* worked on end of line
* fixed failed test
* fixed minor inserting bug
* fixed movement hotkeys
* adjusted hotkeys
* removed comments
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* Update readline/buffer.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
* deleted comments and duplicate code
* removed duplicate code
* added comments, refactored add function to use addChar
* added helper to retrieve lineSpacing, renamed lineFlags for clarity
* fixed remove()
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.
* ensure kernel modules are loaded in `install.sh` script and at startup
* indentation
* use `SUDO` variable
* restart if nouveau is detected
* consistent success message for AMD
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
This test needs to be able to adjust the queue size down from
our default setting for a reliable test, so it needs to skip on
remote test execution mode.
* Fixed the API endpoint /api/tags to return {models: []} instead of {models: null} when the model list is empty.
* Update server/routes.go
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* rename `--quantization` to `--quantize`
* backwards
* Update api/types.go
Co-authored-by: Michael Yang <mxyng@pm.me>
---------
Co-authored-by: Michael Yang <mxyng@pm.me>
The GPU drivers take a while to update their free memory reporting, so we need
to wait until the values converge with what we're expecting before proceeding
to start another runner in order to get an accurate picture.
* Add preflight OPTIONS handling and update CORS config
- Implement early return with HTTP 204 (No Content) for OPTIONS requests in allowedHostsMiddleware to optimize preflight handling.
- Extend CORS configuration to explicitly allow 'Authorization' headers and 'OPTIONS' method when OLLAMA_ORIGINS environment variable is set.
* allow auth, content-type, and user-agent headers
* Update routes.go
* api: fill up API documentation
Followup for #2878
Now that the documentation is more complete, mention it in the README.
Updates #2840
* fix typo/lint
* Update README.md
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
- add test for number formatting
- fix bug where 1B and 1M were not stored correctly
- display 2 decimal points for million param sizes
- display 1 decimal point for billion param sizes
* add details on kubernetes deployment and separate the testing process
* Update examples/kubernetes/README.md
thanks for suggesting this change, I agree with you and let's make this project better together !
Co-authored-by: JonZeolla <Zeolla@gmail.com>
---------
Co-authored-by: QIN Mélony <MQN1@dsone.3ds.com>
Co-authored-by: JonZeolla <Zeolla@gmail.com>
* note on naming restrictions
else push would fail with cryptic
retrieving manifest
Error: file does not exist
==> maybe change that in code too
* Update docs/import.md
---------
Co-authored-by: C-4-5-3 <154636388+C-4-5-3@users.noreply.github.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Update api.md
Changed the calculation of tps (token/s) in the documentation
* Update docs/api.md
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
- return descriptive error messages when unauthorized to create blob or push a model
- display the local public key associated with the request that was denied
* app: restart server on failure
* fix linter
* address comments
* refactor log directory creation to be where logs are written
* check all log dir creation errors
Prior refactoring passes accidentally removed the logic to bypass VRAM
checks for CPU loads. This adds that back, along with test coverage.
This also fixes loaded map access in the unit test to be behind the mutex which was
likely the cause of various flakes in the tests.
description:What happened? What did you expect to happen?
validations:
required:true
- type:textarea
id:logs
attributes:
label:Relevant log output
description:Please copy and paste any relevant log output. See [Troubleshooting Guide](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) for details.
"Regular expression describing AMDGPU_TARGETS not supported on Windows. Override to force building these targets. Default \"^gfx(906|908|90a|1200|1201):xnack[+-]$\"."
Thank you for your interest in contributing to Ollama! Here are a few guidelines to help get you started.
## Set up
See the [development documentation](./docs/development.md) for instructions on how to build and run Ollama locally.
### Ideal issues
* [Bugs](https://github.com/ollama/ollama/issues?q=is%3Aissue+is%3Aopen+label%3Abug): issues where Ollama stops working or where it results in an unexpected error.
* [Performance](https://github.com/ollama/ollama/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance): issues to make Ollama faster at model inference, downloading or uploading.
* [Security](https://github.com/ollama/ollama/blob/main/SECURITY.md): issues that could lead to a security vulnerability. As mentioned in [SECURITY.md](https://github.com/ollama/ollama/blob/main/SECURITY.md), please do not disclose security vulnerabilities publicly.
### Issues that are harder to review
* New features: new features (e.g. API fields, environment variables) add surface area to Ollama and make it harder to maintain in the long run as they cannot be removed without potentially breaking users in the future.
* Refactoring: large code improvements are important, but can be harder or take longer to review and merge.
* Documentation: small updates to fill in or correct missing documentation is helpful, however large documentation additions can be hard to maintain over time.
### Issues that may not be accepted
* Changes that break backwards compatibility in Ollama's API (including the OpenAI-compatible API)
* Changes that add significant friction to the user experience
* Changes that create a large future maintenance burden for maintainers and contributors
## Proposing a (non-trivial) change
> By "non-trivial", we mean a change that is not a bug fix or small
> documentation update. If you are unsure, please ask us on our [Discord
> server](https://discord.gg/ollama).
Before opening a non-trivial Pull Request, please open an issue to discuss the change and
get feedback from the maintainers. This helps us understand the context of the
change and how it fits into Ollama's roadmap and prevents us from duplicating
work or you from spending time on a change that we may not be able to accept.
Tips for proposals:
* Explain the problem you are trying to solve, not what you are trying to do.
* Explain why the change is important.
* Explain how the change will be used.
* Explain how the change will be tested.
Additionally, for bonus points: Provide draft documentation you would expect to
see if the change were accepted.
## Pull requests
**Commit messages**
The title should look like:
<package>: <shortdescription>
The package is the most affected Go package. If the change does not affect Go
code, then use the directory name instead. Changes to a single well-known
file in the root directory may use the file name.
The short description should start with a lowercase letter and be a
continuation of the sentence:
"This changes Ollama to..."
Examples:
llm/backend/mlx: support the llama architecture
CONTRIBUTING: provide clairity on good commit messages, and bad
Bad Examples:
feat: add more emoji
fix: was not using famous web framework
chore: generify code
**Tests**
Please include tests. Strive to test behavior, not implementation.
**New dependencies**
Dependencies should be added sparingly. If you are adding a new dependency,
please explain why it is necessary and what other ways you attempted that
did not work without it.
## Need help?
If you need help with anything, feel free to reach out to us on our [Discord server](https://discord.gg/ollama).
> Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
> [!NOTE]
> You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
## Customize a model
@@ -78,32 +96,32 @@ Ollama supports importing GGUF models in the Modelfile:
2. Create the model in Ollama
```
```shell
ollama create example -f Modelfile
```
3. Run the model
```
```shell
ollama run example
```
### Import from PyTorch or Safetensors
### Import from Safetensors
See the [guide](docs/import.md) on importing models for more information.
### Customize a prompt
Models from the Ollama library can be customized with a prompt. For example, to customize the `llama3` model:
Models from the Ollama library can be customized with a prompt. For example, to customize the `llama3.2` model:
```
ollama pull llama3
```shell
ollama pull llama3.2
```
Create a `Modelfile`:
```
FROM llama3
FROM llama3.2
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
@@ -123,7 +141,7 @@ ollama run mario
Hello! It's your friend Mario.
```
For more examples, see the [examples](examples) directory. For more information on working with a Modelfile, see the [Modelfile](docs/modelfile.md) documentation.
For more information on working with a Modelfile, see the [Modelfile](docs/modelfile.md) documentation.
## CLI Reference
@@ -131,28 +149,28 @@ For more examples, see the [examples](examples) directory. For more information
`ollama create` is used to create a model from a Modelfile.
```
```shell
ollama create mymodel -f ./Modelfile
```
### Pull a model
```
ollama pull llama3
```shell
ollama pull llama3.2
```
> This command can also be used to update a local model. Only the diff will be pulled.
### Remove a model
```
ollama rm llama3
```shell
ollama rm llama3.2
```
### Copy a model
```
ollama cp llama3 my-model
```shell
ollama cp llama3.2 my-model
```
### Multiline input
@@ -169,61 +187,63 @@ I'm a basic program that prints the famous "Hello, world!" message to the consol
### Multimodal models
```
>>> What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.
ollama run llava "What's in this image? /Users/jmorgan/Desktop/smile.png"
```
### Pass in prompt as arguments
> **Output**: The image features a yellow smiley face, which is likely the central focus of the picture.
### Pass the prompt as an argument
```shell
ollama run llama3.2 "Summarize this file: $(cat README.md)"
```
$ ollama run llama3 "Summarize this file: $(cat README.md)"
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
> **Output**: Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
### Show model information
```shell
ollama show llama3.2
```
### List models on your computer
```
```shell
ollama list
```
### List which models are currently loaded
```shell
ollama ps
```
### Stop a model which is currently running
```shell
ollama stop llama3.2
```
### Start Ollama
`ollama serve` is used when you want to start ollama without running the desktop application.
## Building
Install `cmake` and `go`:
```
brew install cmake go
```
Then generate dependencies:
```
go generate ./...
```
Then build the binary:
```
go build .
```
More detailed instructions can be found in the [developer guide](https://github.com/ollama/ollama/blob/main/docs/development.md)
See the [developer guide](https://github.com/ollama/ollama/blob/main/docs/development.md)
### Running local builds
Next, start the server:
```
```shell
./ollama serve
```
Finally, in a separate shell, run a model:
```
./ollama run llama3
```shell
./ollama run llama3.2
```
## REST API
@@ -232,18 +252,18 @@ Ollama has a REST API for running and managing models.
### Generate a response
```
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"model": "llama3.2",
"prompt":"Why is the sky blue?"
}'
```
### Chat with a model
```
```shell
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "why is the sky blue?" }
]
@@ -257,18 +277,21 @@ See the [API documentation](./docs/api.md) for all endpoints.
- [TagSpaces](https://www.tagspaces.org) (A platform for file based apps, [utilizing Ollama](https://docs.tagspaces.org/ai/) for the generation of tags and descriptions)
- [QA-Pilot: Chat with Code Repository](https://github.com/reid41/QA-Pilot)
- [ChatOllama: Open Source Chatbot based on Ollama with Knowledge Bases](https://github.com/sugarforever/chat-ollama)
- [CRAG Ollama Chat: Simple Web Search with Corrective RAG](https://github.com/Nagi-ovo/CRAG-Ollama-Chat)
- [RAGFlow: Open-source Retrieval-Augmented Generation engine based on deep document understanding](https://github.com/infiniflow/ragflow)
- [chat: chat web app for teams](https://github.com/swuecho/chat)
- [IntelliBar](https://intellibar.app/) (AI-powered assistant for macOS)
- [QA-Pilot](https://github.com/reid41/QA-Pilot) (Interactive chat tool that can leverage Ollama models for rapid understanding and navigation of GitHubcode repositories)
- [ChatOllama](https://github.com/sugarforever/chat-ollama) (Open Source Chatbot based on Ollama with Knowledge Bases)
- [CRAG Ollama Chat](https://github.com/Nagi-ovo/CRAG-Ollama-Chat) (Simple Web Search with Corrective RAG)
- [RAGFlow](https://github.com/infiniflow/ragflow) (Open-source Retrieval-Augmented Generation engine based on deep document understanding)
- [chat](https://github.com/swuecho/chat) (chat web app for teams)
- [Lobe Chat](https://github.com/lobehub/lobe-chat) with [Integrating Doc](https://lobehub.com/docs/self-hosting/examples/ollama)
- [Ollama RAG Chatbot: Local Chat with multiples PDFs using Ollama and RAG.](https://github.com/datvodinh/rag-chatbot.git)
- [Ollama RAG Chatbot](https://github.com/datvodinh/rag-chatbot.git) (Local Chat with multiple PDFs using Ollama and RAG)
- [BrainSoup](https://www.nurgo-software.com/products/brainsoup) (Flexible native client with RAG & multi-agent automation)
- [macai](https://github.com/Renset/macai) (macOS client for Ollama, ChatGPT, and other compatible API back-ends)
- [RWKV-Runner](https://github.com/josStorer/RWKV-Runner) (RWKV offline LLM deployment tool, also usable as a client for ChatGPT and Ollama)
- [Ollama Grid Search](https://github.com/dezoito/ollama-grid-search) (app to evaluate and compare models)
- [Olpaka](https://github.com/Otacon/olpaka) (User-friendly Flutter Web App for Ollama)
- [Casibase](https://casibase.org) (An open source AI knowledge base and dialogue system combining the latest RAG, SSO, ollama support and multiple large language models.)
- [OllamaSpring](https://github.com/CrazyNeil/OllamaSpring) (Ollama Client for macOS)
- [LLocal.in](https://github.com/kartikm7/llocal) (Easy to use Electron Desktop Client for Ollama)
- [Shinkai Desktop](https://github.com/dcSpark/shinkai-apps) (Two click install Local AI using Ollama + Files + RAG)
- [AiLama](https://github.com/zeyoyt/ailama) (A Discord User App that allows you to interact with Ollama anywhere in discord )
- [Ollama with Google Mesop](https://github.com/rapidarchitect/ollama_mesop/) (Mesop Chat Client implementation with Ollama)
- [Local Multimodal AI Chat](https://github.com/Leon-Sander/Local-Multimodal-AI-Chat) (Ollama-based LLM Chat with support for multiple features, including PDF RAG, voice chat, image-based interactions, and integration with OpenAI.)
- [ARGO](https://github.com/xark-argo/argo) (Locally download and run Ollama and Huggingface models with RAG on Mac/Windows/Linux)
- [OrionChat](https://github.com/EliasPereirah/OrionChat) - OrionChat is a web interface for chatting with different AI providers
- [G1](https://github.com/bklieger-groq/g1) (Prototype of using prompting strategies to improve the LLM's reasoning through o1-like reasoning chains.)
- [Perfect Memory AI](https://www.perfectmemory.ai/) (Productivity AI assists personalized by what you have seen on your screen, heard and said in the meetings)
- [Hexabot](https://github.com/hexastack/hexabot) (A conversational AI builder)
- [Reddit Rate](https://github.com/rapidarchitect/reddit_analyzer) (Search and Rate Reddit topics with a weighted summation)
- [OpenTalkGpt](https://github.com/adarshM84/OpenTalkGpt) (Chrome Extension to manage open-source models supported by Ollama, create custom models, and chat with models from a user-friendly UI)
- [VT](https://github.com/vinhnx/vt.ai) (A minimal multimodal AI chat app, with dynamic conversation routing. Supports local models via Ollama)
- [Nosia](https://github.com/nosia-ai/nosia) (Easy to install and use RAG platform based on Ollama)
- [Witsy](https://github.com/nbonamy/witsy) (An AI Desktop application available for Mac/Windows/Linux)
- [Abbey](https://github.com/US-Artificial-Intelligence/abbey) (A configurable AI interface server with notebooks, document storage, and YouTube support)
- [Minima](https://github.com/dmayboroda/minima) (RAG with on-premises or fully local workflow)
- [aidful-ollama-model-delete](https://github.com/AidfulAI/aidful-ollama-model-delete) (User interface for simplified model cleanup)
- [Perplexica](https://github.com/ItzCrazyKns/Perplexica) (An AI-powered search engine & an open-source alternative to Perplexity AI)
- [Ollama Chat WebUI for Docker ](https://github.com/oslook/ollama-webui) (Support for local docker deployment, lightweight ollama webui)
- [AI Toolkit for Visual Studio Code](https://aka.ms/ai-tooklit/ollama-docs) (Microsoft-official VSCode extension to chat, test, evaluate models with Ollama support, and use them in your AI applications.)
- [MinimalNextOllamaChat](https://github.com/anilkay/MinimalNextOllamaChat) (Minimal Web UI for Chat and Model Control)
- [Chipper](https://github.com/TilmanGriesel/chipper) AI interface for tinkerers (Ollama, Haystack RAG, Python)
- [ChibiChat](https://github.com/CosmicEventHorizon/ChibiChat) (Kotlin-based Android app to chat with Ollama and Koboldcpp API endpoints)
- [LocalLLM](https://github.com/qusaismael/localllm) (Minimal Web-App to run ollama models on it with a GUI)
- [Ollamazing](https://github.com/buiducnhat/ollamazing) (Web extension to run Ollama models)
- [OpenDeepResearcher-via-searxng](https://github.com/benhaotang/OpenDeepResearcher-via-searxng) (A Deep Research equivent endpoint with Ollama support for running locally)
- [1Panel](https://github.com/1Panel-dev/1Panel/) (Web-based Linux Server Management Tool)
- [AstrBot](https://github.com/Soulter/AstrBot/) (User-friendly LLM-based multi-platform chatbot with a WebUI, supporting RAG, LLM agents, and plugins integration)
- [Reins](https://github.com/ibrahimcetin/reins) (Easily tweak parameters, customize system prompts per chat, and enhance your AI experiments with reasoning model support.)
- [Ellama](https://github.com/zeozeozeo/ellama) (Friendly native app to chat with an Ollama instance)
- [screenpipe](https://github.com/mediar-ai/screenpipe) Build agents powered by your screen history
- [Ollamb](https://github.com/hengkysteen/ollamb) (Simple yet rich in features, cross-platform built with Flutter and designed for Ollama. Try the [web demo](https://hengkysteen.github.io/demo/ollamb/).)
- [Writeopia](https://github.com/Writeopia/Writeopia) (Text editor with integration with Ollama)
- [Ollama Mixture of Experts (MOE) in 50 lines of code](https://github.com/rapidarchitect/ollama_moe)
- [vim-intelligence-bridge](https://github.com/pepo-ec/vim-intelligence-bridge) Simple interaction of "Ollama" with the Vim editor
- [x-cmd ollama](https://x-cmd.com/mod/ollama)
- [bb7](https://github.com/drunkwcodes/bb7)
- [SwollamaCLI](https://github.com/marcusziade/Swollama) bundled with the Swollama Swift package. [Demo](https://github.com/marcusziade/Swollama?tab=readme-ov-file#cli-usage)
- [aichat](https://github.com/sigoden/aichat) All-in-one LLM CLI tool featuring Shell Assistant, Chat-REPL, RAG, AI tools & agents, with access to OpenAI, Claude, Gemini, Ollama, Groq, and more.
- [PowershAI](https://github.com/rrg92/powershai) PowerShell module that brings AI to terminal on Windows, including support for Ollama
- [DeepShell](https://github.com/Abyss-c0re/deepshell) Your self-hosted AI assistant. Interactive Shell, Files and Folders analysis.
- [orbiton](https://github.com/xyproto/orbiton) Configuration-free text editor and IDE with support for tab completion with Ollama.
- [orca-cli](https://github.com/molbal/orca-cli) Ollama Registry CLI Application - Browse, pull and download models from Ollama Registry in your terminal.
- [GGUF-to-Ollama](https://github.com/jonathanhecl/gguf-to-ollama) - Importing GGUF to Ollama made easy (multiplatform)
### Apple Vision Pro
- [SwiftChat](https://github.com/aws-samples/swift-chat) (Cross-platform AI chat app supporting Apple Vision Pro via "Designed for iPad")
- [pgai](https://github.com/timescale/pgai) - PostgreSQL as a vector database (Create and search embeddings from Ollama models using pgvector)
- [Get started guide](https://github.com/timescale/pgai/blob/main/docs/vectorizer-quick-start.md)
- [MindsDB](https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/ollama_handler/README.md) (Connects Ollama models with nearly 200 data platforms and apps)
- [chromem-go](https://github.com/philippgille/chromem-go/blob/v0.5.0/embed_ollama.go) with [example](https://github.com/philippgille/chromem-go/tree/v0.5.0/examples/rag-wikipedia-ollama)
- [Kangaroo](https://github.com/dbkangaroo/kangaroo) (AI-powered SQL client and admin tool for popular databases)
- [LangChain](https://python.langchain.com/docs/integrations/llms/ollama) and [LangChain.js](https://js.langchain.com/docs/modules/model_io/models/llms/integrations/ollama) with [example](https://js.langchain.com/docs/use_cases/question_answering/local_retrieval_qa)
- [LangChain](https://python.langchain.com/docs/integrations/llms/ollama) and [LangChain.js](https://js.langchain.com/docs/integrations/chat/ollama/) with [example](https://js.langchain.com/docs/tutorials/local_rag/)
- [Yacana](https://remembersoftwares.github.io/yacana/) (User-friendly multi-agent framework for brainstorming and executing predetermined flows with built-in tool integration)
- [Spring AI](https://github.com/spring-projects/spring-ai) with [reference](https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html) and [example](https://github.com/tzolov/ollama-tools)
- [LangChainGo](https://github.com/tmc/langchaingo/) with [example](https://github.com/tmc/langchaingo/tree/main/examples/ollama-completion-example)
- [LangChain4j](https://github.com/langchain4j/langchain4j) with [example](https://github.com/langchain4j/langchain4j-examples/tree/main/ollama-examples/src/main/java)
- [LangChainRust](https://github.com/Abraxas-365/langchain-rust) with [example](https://github.com/Abraxas-365/langchain-rust/blob/main/examples/llm_ollama.rs)
- [LangChain for .NET](https://github.com/tryAGI/LangChain) with [example](https://github.com/tryAGI/LangChain/blob/main/examples/LangChain.Samples.OpenAI/Program.cs)
- [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/ollama/) and [LlamaIndexTS](https://ts.llamaindex.ai/modules/llms/available_llms/ollama)
- [LiteLLM](https://github.com/BerriAI/litellm)
- [OllamaFarm for Go](https://github.com/presbrey/ollamafarm)
- [OllamaSharp for .NET](https://github.com/awaescher/OllamaSharp)
- [Ollama for Ruby](https://github.com/gbaptista/ollama-ai)
- [Ollama-rs for Rust](https://github.com/pepperoni21/ollama-rs)
- [Ollama4j for Java](https://github.com/amithkoujalgi/ollama4j)
- [Ollama-hpp for C++](https://github.com/jmont-dev/ollama-hpp)
- [Ollama4j for Java](https://github.com/ollama4j/ollama4j)
- [PromptingTools.jl](https://github.com/svilupp/PromptingTools.jl) with an [example](https://svilupp.github.io/PromptingTools.jl/dev/examples/working_with_ollama)
- [Agents-Flex for Java](https://github.com/agents-flex/agents-flex) with [example](https://github.com/agents-flex/agents-flex/tree/main/agents-flex-llm/agents-flex-llm-ollama/src/test/java/com/agentsflex/llm/ollama)
- [Parakeet](https://github.com/parakeet-nest/parakeet) is a GoLang library, made to simplify the development of small generative AI applications with Ollama.
- [Haverscript](https://github.com/andygill/haverscript) with [examples](https://github.com/andygill/haverscript/tree/main/examples)
- [Ollama for Swift](https://github.com/mattt/ollama-swift)
- [Swollama for Swift](https://github.com/marcusziade/Swollama) with [DocC](https://marcusziade.github.io/Swollama/documentation/swollama/)
- [Ollama App](https://github.com/JHubi1/ollama-app) (Modern and easy-to-use multi-platform client for Ollama)
- [ConfiChat](https://github.com/1runeberg/confichat) (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)
- [Ollama Android Chat](https://github.com/sunshine0523/OllamaServer) (No need for Termux, start the Ollama service with one click on an Android device)
- [Reins](https://github.com/ibrahimcetin/reins) (Easily tweak parameters, customize system prompts per chat, and enhance your AI experiments with reasoning model support.)
- [Plasmoid Ollama Control](https://github.com/imoize/plasmoid-ollamacontrol) (KDE Plasma extension that allows you to quickly manage/control Ollama model)
- [AI Telegram Bot](https://github.com/tusharhero/aitelegrambot) (Telegram bot using Ollama in backend)
- [AI ST Completion](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (Sublime Text 4 AI assistant plugin with Ollama support)
- [ChatGPTBox: All in one browser extension](https://github.com/josStorer/chatGPTBox) with [Integrating Tutorial](https://github.com/josStorer/chatGPTBox/issues/616#issuecomment-1975186467)
- [Discord AI chat/moderation bot](https://github.com/rapmd73/Companion) Chat/moderation bot written in python. Uses Ollama to create personalities.
- [Headless Ollama](https://github.com/nischalj10/headless-ollama) (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server)
- [Terraform AWS Ollama & Open WebUI](https://github.com/xuyangbocn/terraform-aws-self-host-llm) (A Terraform module to deploy on AWS a ready-to-use Ollama service, together with its front end Open WebUI service.)
- [Local AI Helper](https://github.com/ivostoykov/localAI) (Chrome and Firefox extensions that enable interactions with the active tab and customisable API endpoints. Includes secure storage for user prompts.)
- [vnc-lm](https://github.com/jake83741/vnc-lm) (Discord bot for messaging with LLMs through Ollama and LiteLLM. Seamlessly move between local and flagship models.)
- [LSP-AI](https://github.com/SilasMarvin/lsp-ai) (Open-source language server for AI-powered functionality)
- [QodeAssist](https://github.com/Palm1r/QodeAssist) (AI-powered coding assistant plugin for Qt Creator)
- [LLM Telegram Bot](https://github.com/innightwolfsleep/llm_telegram_bot) (telegram bot, primary for RP. Oobabooga-like buttons, [A1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) API integration e.t.c)
- [mcp-llm](https://github.com/sammcj/mcp-llm) (MCP Server to allow LLMs to call other LLMs)
### Supported backends
- [llama.cpp](https://github.com/ggerganov/llama.cpp) project founded by Georgi Gerganov.
### Observability
- [Opik](https://www.comet.com/docs/opik/cookbook/ollama) is an open-source platform to debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. Opik supports native intergration to Ollama.
- [Lunary](https://lunary.ai/docs/integrations/ollama) is the leading open-source LLM observability platform. It provides a variety of enterprise-grade features such as real-time analytics, prompt templates management, PII masking, and comprehensive agent tracing.
- [OpenLIT](https://github.com/openlit/openlit) is an OpenTelemetry-native tool for monitoring Ollama Applications & GPUs using traces and metrics.
- [HoneyHive](https://docs.honeyhive.ai/integrations/ollama) is an AI observability and evaluation platform for AI agents. Use HoneyHive to evaluate agent performance, interrogate failures, and monitor quality in production.
- [Langfuse](https://langfuse.com/docs/integrations/ollama) is an open source LLM observability platform that enables teams to collaboratively monitor, evaluate and debug AI applications.
- [MLflow Tracing](https://mlflow.org/docs/latest/llms/tracing/index.html#automatic-tracing) is an open source LLM observability tool with a convenient API to log and visualize traces, making it easy to debug and evaluate GenAI applications.
The Ollama maintainer team takes security seriously and will actively work to resolve security issues.
## Reporting a vulnerability
If you discover a security vulnerability, please do not open a public issue. Instead, please report it by emailing hello@ollama.com. We ask that you give us sufficient time to investigate and address the vulnerability before disclosing it publicly.
Please include the following details in your report:
- A description of the vulnerability
- Steps to reproduce the issue
- Your assessment of the potential impact
- Any possible mitigations
## Security best practices
While the maintainer team does their best to secure Ollama, users are encouraged to implement their own security best practices, such as:
- Regularly updating to the latest version of Ollama
- Securing access to hosted instances of Ollama
- Monitoring systems for unusual activity
## Contact
For any other questions or concerns related to security, please contact us at hello@ollama.com
NSMenuItem*chooseModelDirectoryItem=[[NSMenuItemalloc]initWithTitle:@"Choose model directory..."action:@selector(chooseModelDirectory)keyEquivalent:@""];
ReadyLabel1=%nLet's get you up and running with your own large language models.
SetupAppRunningError=Another Ollama installer is running.%n%nPlease cancel or finish the other installer, then click OK to continue with this install, or Cancel to exit.
;FinishedHeadingLabel=Run your first model
;FinishedLabel=%nRun this command in a PowerShell or cmd terminal.%n%n%n ollama run llama3
;FinishedLabel=%nRun this command in a PowerShell or cmd terminal.%n%n%n ollama run llama3.2
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.