Performance Balancing

P2P Fortytwo P2P App Fortytwo P2P CLI

TL;DR

By default, Inference Node tries to use as much available memory as possible in order to deliver the fastest inference speed. When the node starts, most of the VRAM (or RAM on macOS) left after loading your model is allocated to caching. Q: What if you plan to use your computer for other tasks in parallel, like gaming, video rendering, graphic design and other resource-intensive applications? A: Customisation and resources management.
You can limit how much resources the node is allowed to utilize by doing two things:

Pick a model
Set a custom KV Cache configuration that suits the model and leaves enough free memory when summed up with the model size.

https://mintcdn.com/fortytwo-f43ac997/0cQCszqEw3UY-Jgf/resources/icons/Fortytwo-App-callout.svg?fit=max&auto=format&n=0cQCszqEw3UY-Jgf&q=85&s=e05855831c979f7fa6f0e280287f8d2f

KV Cache in Fortytwo P2P App

Easiest way to control cache size.

https://mintcdn.com/fortytwo-f43ac997/0cQCszqEw3UY-Jgf/resources/icons/Fortytwo-CLI-callout.svg?fit=max&auto=format&n=0cQCszqEw3UY-Jgf&q=85&s=4ed7ce3bfad98ded656573609cc50b4a

KV Cache in Fortytwo P2P CLI

Convenient but more time consuming way to control cache size.

A detailed explanation of what KV Cache is and how it works with the models can be found here: Selecting a Model for Your Node.

What Is KV Cache?

KV Cache is a reserved memory area used to store the model’s internal key–value tensors during inference. These tensors allow the model to avoid recalculating previous tokens, which significantly speeds up generation and enables longer context windows. Because KV Cache grows as the history/context (prompt) becomes longer, it directly affects:

Maximum context length — larger KV Cache allows the model to store more tokens.
Inference speed — cached states reduce repeated computation.
Total memory consumption — the node must hold both the model and its KV Cache at the same time.
System stability — if KV Cache grows too large, other applications may experience slowdowns.

How KV Cache Works with the Model

KV Cache limit can be specified in GB or tokens size. This number represents the maximum KV Cache size the node is allowed to use. This is the amount of memory that will be attempted to be taken up by the node on top of what the model itself takes up. Thus:

total_memory_used = model_size + kv_cache_size

Output:

If your system has enough available memory -> the node will use the full value you provided.
Otherwise -> it will only take what is left, and it might not leave enough space for other processes. This might lead to freezes and lags if you intend to run other programs alongside your node.

Case A

Available system memory: 20 GB
Model size: 5 GB
Custom KV Cache limit: 4 GB

The node will allocate 5 + 4 = 9 GB in total, because enough memory is available.Takeaways:

You get some free resources to run other processes.
Model is small, and if it is fast as well, you might entirely avoid experiencing freezes or lags when your node generates inference.
If there is no intention to run other programs alongside your node, then you have 11 GB left free. You might want to increase the KV Cache size to make the most of it.

Case B

Available system memory: 20 GB
Model size: 15 GB
Custom KV Cache limit: 10 GB

Node asks for 15 + 10 = 25 GB in total. There is not enough memory for that, so the following will happen:

Node loads the model first.
Then it uses up to 5 GB of leftover memory.

Takeaways:

You have no free resources to run other processes.
Your system resources are fully utilized for quality inference.
This is still a safe way to run the node and correlates with --kv-size-mode auto.

Simple Rules

To ensure smooth noderunning:

Determine how much memory you can afford to allocate to your node.
Example: 10 GB
Define KV Cache size that is 20%-30% of your memory pool
Example: 2 GB is 20% of 10 GB
Select a model that will work with 2 GB of KV Cache and is 2 GB smaller than your memory limit.
Example: 10 - 2 = 8 GB for the model
Start your node.
Start other applications. Your node should not take more resources than allocated to it.

Remember that we only talk about the momory allocaition here. If you pick a heavy and slow model, then on the inference rounds the utilization of your GPU will spike and lead to freezes and lags if use use something resources-cousuming alongside your node.

How To Setup Custom KV Cache in Fortytwo P2P App and Fortytwo P2P CLI

Options

Mode: Auto
Recommended by default. The node takes up as much memory as it requires or is available, but leaves some of it free so that you can work comfortably alongside it.
Mode: Min
Limits the node to 33% of your currently available memory. It is a dynamic value, it changes each time the node restarts.
Mode: Medium
Limits the node to 66% of your currently available memory. It is a dynamic value, it changes each time the node restarts.
Mode: Max
Use it when you dedicate your entire system to inference alone. Allows the node to use 100% of all available system memory.
Custom
Lets you manually define the exact KV Cache size that your node is allowed to consume:
- Size in Tokens
  The best way to work with models. Use it if you know how it works. Study the models that you run and define optimal sizes for ideal performance balance.
- Size in GB The best shortcut, allows to have a static and understandable limit in GB. However, it is not the best in terms of performance balancing. You need to manually adapt it to the models you run if they drastically vary in size or you encounter poor perfromance.

Setup

Fortytwo P2P App
Fortytwo P2P CLI

Open Fortytwo P2P App and Go to Settings.

Scroll down to find the Key-Value Cache Size option.

Choose preferred mode: Auto | Min | Medium | Max or define custom size with Custom in GB or Custom in Tokens options.

Click Apply to apply the changed settings.
Monitor system resources to see if any further adjustment is needed.

At the moment, you have to define the KV Cache each time you re-launch Fortytwo P2P CLI.
Otherwise it falls back to the deafult option: --kv-size-mode auto

Launch Fortytwo P2P CLI.

When prompted to select a model, type 0 to go to Settings.

Go to KV Cache management by typing 1.

Choose preferred mode: Auto | Min | Medium | Max or define custom size with Custom in GB or Custom in Tokens options.

After those settings are adjusted, you are returned to the model selection step.
Select a model and it will run with adhering to your KV Cache size limits.
Monitor system resources to see if any further adjustment is needed.

Get Started

App Fortytwo

P2P

API

Legal

Performance Balancing

TL;DR

KV Cache in Fortytwo P2P App

KV Cache in Fortytwo P2P CLI

What Is KV Cache?

How KV Cache Works with the Model

Case A

Case B

Simple Rules

How To Setup Custom KV Cache in Fortytwo P2P App and Fortytwo P2P CLI

Options

Setup

Get Started

App Fortytwo

P2P

API

Legal

​TL;DR

KV Cache in Fortytwo P2P App

KV Cache in Fortytwo P2P CLI

​What Is KV Cache?

​How KV Cache Works with the Model

Case A

Case B

​Simple Rules

​How To Setup Custom KV Cache in Fortytwo P2P App and Fortytwo P2P CLI

​Options

​Setup

TL;DR

What Is KV Cache?

How KV Cache Works with the Model

Simple Rules

How To Setup Custom KV Cache in Fortytwo P2P App and Fortytwo P2P CLI

Options

Setup