Compare commits
	
		
			1 Commits
		
	
	
		
			v0.1.16
			...
			mattw/howt
		
	
	| Author | SHA1 | Date | |
|---|---|---|---|
|   | 4522109b11 | 
| @@ -124,7 +124,7 @@ PARAMETER <parameter> <parametervalue> | |||||||
| | repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)                                                                                                                                           | int        | repeat_last_n 64     | | | repeat_last_n  | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)                                                                                                                                           | int        | repeat_last_n 64     | | ||||||
| | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1   | | | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)                                                                     | float      | repeat_penalty 1.1   | | ||||||
| | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      | | | temperature    | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)                                                                                                                                     | float      | temperature 0.7      | | ||||||
| | seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 | | | seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. | int | seed 42 | | ||||||
| | stop           | Sets the stop sequences to use.                                                                                                                                                                                                                         | string     | stop "AI assistant:" | | | stop           | Sets the stop sequences to use.                                                                                                                                                                                                                         | string     | stop "AI assistant:" | | ||||||
| | tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1              | | | tfs_z          | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)                                               | float      | tfs_z 1              | | ||||||
| | num_predict    | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)                                                                                                                                   | int        | num_predict 42       | | | num_predict    | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)                                                                                                                                   | int        | num_predict 42       | | ||||||
|   | |||||||
| @@ -1,6 +1,6 @@ | |||||||
| # How to Quantize a Model | # How to Quantize a Model | ||||||
|  |  | ||||||
| Sometimes the model you want to work with is not available at [https://ollama.ai/library](https://ollama.ai/library). If you want to try out that model before we have a chance to quantize it, you can use this process. | Sometimes the model you want to work with is not available at [https://ollama.ai/library](https://ollama.ai/library). | ||||||
|  |  | ||||||
| ## Figure out if we can run the model? | ## Figure out if we can run the model? | ||||||
|  |  | ||||||
| @@ -37,6 +37,20 @@ This will output two files into the directory. First is a f16.bin file that is t | |||||||
|  |  | ||||||
| You can find the repository for the Docker container here: [https://github.com/mxyng/quantize](https://github.com/mxyng/quantize) | You can find the repository for the Docker container here: [https://github.com/mxyng/quantize](https://github.com/mxyng/quantize) | ||||||
|  |  | ||||||
|  | For instance, if you wanted to convert the Mistral 7B model to a Q4 quantized model, then you could go through the following steps: | ||||||
|  |  | ||||||
|  | 1. First verify the model will potentially work. | ||||||
|  | 2. Now clone Mistral 7B to your machine. You can find the command to run when you click the three vertical dots button on the model page, then click **Clone Repository**. | ||||||
|  |    1. For this repo, the command is: | ||||||
|  |  | ||||||
|  |       ```shell | ||||||
|  |       git lfs install | ||||||
|  |       git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 | ||||||
|  |       ``` | ||||||
|  |  | ||||||
|  |    2. Navigate into the new directory and run `docker run --rm -v .:/repo ollama/quantize -q q4_0 /repo` | ||||||
|  |    3. Now you can create a modelfile using the q4_0.bin file that was created. | ||||||
|  |  | ||||||
| ## Convert and Quantize Manually | ## Convert and Quantize Manually | ||||||
|  |  | ||||||
| ### Clone llama.cpp to your machine | ### Clone llama.cpp to your machine | ||||||
| @@ -48,6 +62,7 @@ If we know the model has a chance of working, then we need to convert and quanti | |||||||
|  [`git clone https://github.com/ggerganov/llama.cpp.git`](https://github.com/ggerganov/llama.cpp.git) |  [`git clone https://github.com/ggerganov/llama.cpp.git`](https://github.com/ggerganov/llama.cpp.git) | ||||||
|     1. If you don't have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip |     1. If you don't have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip | ||||||
| 3. Install the Python dependencies: `pip install torch transformers sentencepiece` | 3. Install the Python dependencies: `pip install torch transformers sentencepiece` | ||||||
|  | 4. Run 'make' to build the project and the quantize executable. | ||||||
|  |  | ||||||
| ### Convert the model to GGUF | ### Convert the model to GGUF | ||||||
|  |  | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user