TL;DR
When selecting your model manually, begin with the following:- Know your available VRAM (Windows, Linux) or available RAM on ARM devices (macOS).
- Research your model. Source repositories usually contain all key information such as model speed, required context length and how model variations compare in speed and quality. It is usually best to select a Q4_K_M variation of the model, as it is well-balanced in size and quality.
- Choose newer models over older ones.
- KV Cache size:
- either select a model that is 20%-30% smaller than your free VRAM/RAM,
- or leave at least 3GB of VRAM/RAM free for Context size. If you have 10GB VRAM/RAM, select a model smaller than 7GB.
- Aim for 20,000+ tokens context length.
- Get to 35-40 tokens per second to be able to join most of the inference rounds on time.
- Match the expertise that is currently required by the network the most. It can be achieved by monitoring the performance of the models you select.
How Models Work
When you load a model into your system’s memory (VRAM for GPU, RAM for ARM based systems), model weights and Key-Value Cache are loaded (KV Cache). KV Cache is necessary to accelerate generation and it directly depends on the size of your context window — or the amount of tokens that a model can simultaneously process and then provide an answer. Based on this, we arrive at two key points:- Complexity and diversity of the queries your node can process depends on how big your context window is. Bigger context window means your node can take on more complex queries.
- Higher TPS (Token Per Second), or response generation speed, means that your node has a higher chance to provide the network with an answer on time and keep earning consistently.
- GPU-based systems (primarily Windows, Linux) — reserves 90% of idle VRAM.
- ARM-based systems with unified memory (primarily macOS) — reserves 80 to 85% of leftover RAM.
Performance Balancing
This article explains how Fortytwo applications support KV Cache size management. This allows to limit the node’s resources consumption to run alongside other resource-intensive applications and minimize possible performance impact.
General Recommendations for Selecting a Model
Evaluate your system:
Evaluate your system: GPUs are made to do quick calculations but their VRAM is limited, only a handful of models have substantial VRAM. While ARM devices can operate large AI models, their TPS is not as high.
- Modern day GPU
- macOS or Old GPU
- TPS will be fast in general if (2) and (3) stay true. You may favor bigger model size over KV Cache size.
- Model should always be smaller than the available idle VRAM.
- Always leave at least 2 GB of VRAM free for the KV Cache when launching your node.
Each model has different capabilities in speed and quality:
Consider innate quality and speed of a model. Your KV Cache size choice should match your model choice. Explore the model’s repo page for information on its speed and quality.
- Newer models
- Older models
- Tend to be better in quality at the cost of slower speed compared to older models. For example, newer models like QN3 perform consistently better.
- Newer doesn’t always mean better. They can perform well in benchmarks but end up lacking in real-world applications.
Try to find a balance based on network requirements:
The queries in the network which your model will try to respond to can differ in complexity and required expertise level. Some questions can be easy and short, others, like dataset-quality level questions, can be long and complex.
- When expertise matters, like coding or medicine, select a model that can provide the right expertise.
-
When output speed and request size matters, follow these numbers:
Common requests
- Context Length: 3,000-4,000 tokens
- TPS: 15-30
High quality requests
- Context Length: ~25,000 tokens
- TPS: 25-30
Problem Case Studies
Let’s explore examples of situations to be avoided.Example 1. Low Context Length: Powerful GPU, Big AI Model
Example 1. Low Context Length: Powerful GPU, Big AI Model
Setup:
- System: GPU-based with Nvidia RTX 4070, 8 GB VRAM
- AI Model: 7 GB
- Output: Extremely small context window with just about 500–700 MB of VRAM left for it:
8 GB VRAM - 7 GB taken up by the model, which can fit roughly 1500 tokens.
- The model is fast and high-quality when responding to very small questions.
- The model cannot take on lengthy and complex questions and will reject them.
- If complex questions dominate the network, this node will rarely participate in inference rounds.
Example 2. Low Generation Speed: Medium GPU, Smaller Model
Example 2. Low Generation Speed: Medium GPU, Smaller Model
Setup:
- System: GPU-based with Nvidia GTX 1660, 8 GB VRAM
- AI Model: 4 GB
- Output: Big context window, a lot of VRAM memory is left:
8 GB VRAM - 4 GB taken up by the modelequals to about 20,000+ tokens. Yet the TPS (response generation speed) is low due to the GPU’s old age.
- This node can take lengthy questions and is likely to provide quality responses that might end up being the best in a given round.
- It will generate its responses slowly. This node will tend to lose rounds if other nodes provide their responses faster.
Example 3. Fast but Ignorant: A Tiny AI Model on a Powerful Mac
Example 3. Fast but Ignorant: A Tiny AI Model on a Powerful Mac
Setup:
- System: macOS with Unified Memory, 64 GB RAM
- AI Model: 1.5 GB
- Output: 180–200 TPS, context length is up to 32,000 tokens.
- Great response generation speed, huge context window, can provide responses to questions massive in size.
- The model itself is far from being smart and tends to generate low quality responses. It will challenging to win rounds with it as other nodes will provide better responses.
Conclusion
To keep your node successfully responding to questions and consistently winning in inference rounds, you need a model that is:- As new as possible. We strive to offer optimal model choices in our Featured models list in our applications. However, you are not limited in your choice.
- Suitable for your hardware and capable of maintaining balance between
- context length of 20,000 tokens,
- generation speed of 35-40 tokens per second,
- response quality. Usually, a bigger model means higher quality. But bigger doesn’t necessarily mean better all around. A massive general knowledge model will usually lose to a smaller specialized model.
- Every model has its pros and cons. Select your model based on current network context. If we are generating a specific dataset, it is probably best to select a model that excels in that dataset’s area of knowledge.