TL;DR
By default, Inference Node tries to use as much available memory as possible in order to deliver the fastest inference speed. When the node starts, most of the VRAM (or RAM on macOS) left after loading your model is allocated to caching. Q: What if you plan to use your computer for other tasks in parallel, like gaming, video rendering, graphic design and other resource-intensive applications? A: Customisation and resources management.You can limit how much resources the node is allowed to utilize by doing two things:
- Pick a model
- Set a custom KV Cache configuration that suits the model and leaves enough free memory when summed up with the model size.
KV Cache in Fortytwo App
Easiest way to control cache size.
KV Cache in Fortytwo CLI
Convenient but more time consuming way to control cache size.
What Is KV Cache?
KV Cache is a reserved memory area used to store the model’s internal key–value tensors during inference. These tensors allow the model to avoid recalculating previous tokens, which significantly speeds up generation and enables longer context windows. Because KV Cache grows as the history/context (prompt) becomes longer, it directly affects:- Maximum context length — larger KV Cache allows the model to store more tokens.
- Inference speed — cached states reduce repeated computation.
- Total memory consumption — the node must hold both the model and its KV Cache at the same time.
- System stability — if KV Cache grows too large, other applications may experience slowdowns.
How KV Cache Works with the Model
KV Cache limit can be specified in GB or tokens size. This number represents the maximum KV Cache size the node is allowed to use. This is the amount of memory that will be attempted to be taken up by the node on top of what the model itself takes up. Thus:- If your system has enough available memory -> the node will use the full value you provided.
- Otherwise -> it will only take what is left, and it might not leave enough space for other processes. This might lead to freezes and lags if you intend to run other programs alongside your node.
Case A
- Available system memory: 20 GB
- Model size: 5 GB
- Custom KV Cache limit: 4 GB
- You get some free resources to run other processes.
- Model is small, and if it is fast as well, you might entirely avoid experiencing freezes or lags when your node generates inference.
- If there is no intention to run other programs alongside your node, then you have 11 GB left free. You might want to increase the KV Cache size to make the most of it.
Case B
- Available system memory: 20 GB
- Model size: 15 GB
- Custom KV Cache limit: 10 GB
- Node loads the model first.
- Then it uses up to 5 GB of leftover memory.
- You have no free resources to run other processes.
- Your system resources are fully utilized for quality inference.
- This is still a safe way to run the node and correlates with
--kv-size-mode auto.
Simple Rules
To ensure smooth noderunning:- Determine how much memory you can afford to allocate to your node.
Example: 10 GB - Define KV Cache size that is 20%-30% of your memory pool
Example: 2 GB is 20% of 10 GB - Select a model that will work with 2 GB of KV Cache and is 2 GB smaller than your memory limit.
Example: 10 - 2 = 8 GB for the model - Start your node.
- Start other applications. Your node should not take more resources than allocated to it.
Remember that we only talk about the momory allocaition here. If you pick a heavy and slow model, then on the inference rounds the utilization of your GPU will spike and lead to freezes and lags if use use something resources-cousuming alongside your node.
How To Setup Custom KV Cache in Fortytwo App and Fortytwo CLI
Options
- Mode: Auto
Recommended by default. The node takes up as much memory as it requires or is available, but leaves some of it free so that you can work comfortably alongside it. - Mode: Min
Limits the node to 33% of your currently available memory. It is a dynamic value, it changes each time the node restarts. - Mode: Medium
Limits the node to 66% of your currently available memory. It is a dynamic value, it changes each time the node restarts. - Mode: Max
Use it when you dedicate your entire system to inference alone. Allows the node to use 100% of all available system memory. - Custom
Lets you manually define the exact KV Cache size that your node is allowed to consume:- Size in Tokens
The best way to work with models. Use it if you know how it works. Study the models that you run and define optimal sizes for ideal performance balance. - Size in GB The best shortcut, allows to have a static and understandable limit in GB. However, it is not the best in terms of performance balancing. You need to manually adapt it to the models you run if they drastically vary in size or you encounter poor perfromance.
- Size in Tokens
Setup
Fortytwo App
Fortytwo CLI
Choose preferred mode:
Auto | Min | Medium | Max or define custom size with Custom in GB or Custom in Tokens options.