How to Select the Right Model
Choosing a model is a balancing act. Simply selecting the biggest, ‘smartest’ or the fastest model is not sufficient. You need to understand how AI models and the network work in order to make a choice that is optimal for your hardware.TL;DR
- You want to let your node to select a model for you automatically.
- You regularly use your system for something other than noderunning. The node will try to adjust to your system resources consumption.
How Models Work
When you load a model into your system’s memory (VRAM for GPU, RAM for ARM based systems), model weights and Key-Value Cache are loaded (KV Cache). KV Cache is necessary to accelerate generation and it directly depends on the size of your context window — or the amount of tokens that a model can simultaneously process and then provide an answer. Based on this, we arrive at two key points:- Complexity and diversity of the queries your node can process depends on how big your context window is. Bigger context window means your node can take on more complex queries.
- Higher TPS (Token Per Second), or response generation speed, means that your node has a higher chance to provide the network with an answer on time and keep earning consistently.
- GPU-based systems (primarily Windows, Linux) — reserves 90% of idle VRAM.
- ARM-based systems with unified memory (primarily macOS) — reserves 80 to 85% of leftover RAM.
General Recommendations for Selecting a Model
1
Evaluate your system:
Evaluate your system: GPUs are made to do quick calculations but their VRAM is limited, only a handful of models have substantial VRAM. While ARM devices can operate large AI models, their TPS is not as high.
- TPS will be fast in general if (2) and (3) stay true. You may favor bigger model size over KV Cache size.
- Model should always be smaller than the available idle VRAM.
- Always leave at least 2 GB of VRAM free for the KV Cache when launching your node.
2
Each model has different capabilities in speed and quality:
Consider innate quality and speed of a model. Your KV Cache size choice should match your moodel choice. Explore the model’s repo page for information on its speed and quality.
- Tend to be better in quality at the cost of slower speed compared to older models. For example, newer models like QN3 perform consistently better.
- Newer doesn’t always mean better. They can perform well in benchmarks but end up lacking in real-world applications.
3
Try to find a balance based on network requirements:
The queries in the network which your model will try to respond to can differ in complexity and required expertise level. Some questions can be easy and short, others, like dataset-quality level questions, can be long and complex.
- When expertise matters, like coding or medicine, select a model that can provide the right expertise.
-
When output speed and request size matters, follow these numbers:
Common requests
- Context Length: 3,000-4,000 tokens
- TPS: 15-30
High quality requests
- Context Length: ~25,000 tokens
- TPS: 25-30
Problem Case Studies
Let’s explore examples of situations to be avoided.Example 1. Low Context Length: Powerful GPU, Big AI Model
Example 1. Low Context Length: Powerful GPU, Big AI Model
Setup:
- System: GPU-based with Nvidia RTX 4070, 8 GB VRAM
- AI Model: 7 GB
- Output: Extremely small context window with just about 500–700 MB of VRAM left for it:
8 GB VRAM - 7 GB taken up by the model
, which can fit roughly 1500 tokens.
- The model is fast and high-quality when responding to very small questions.
- The model cannot take on lengthy and complex questions and will reject them.
- If complex questions dominate the network, this node will rarely participate in inference rounds.
Example 2. Low Generation Speed: Medium GPU, Smaller Model
Example 2. Low Generation Speed: Medium GPU, Smaller Model
Setup:
- System: GPU-based with Nvidia GTX 1660, 8 GB VRAM
- AI Model: 4 GB
- Output: Big context window, a lot of VRAM memory is left:
8 GB VRAM - 4 GB taken up by the model
equals to about 20,000+ tokens. Yet the TPS (response generation speed) is low due to the GPU’s old age.
- This node can take lengthy questions and is likely to provide quality responses that might end up being the best in a given round.
- It will generate its responses slowly. This node will tend to lose rounds if other nodes provide their responses faster.
Example 3. Fast but Ignorant: A Tiny AI Model on a Powerful Mac
Example 3. Fast but Ignorant: A Tiny AI Model on a Powerful Mac
Setup:
- System: macOS with Unified Memory, 64 GB RAM
- AI Model: 1.5 GB
- Output: 180–200 TPS, context length is up to 32,000 tokens.
- Great response generation speed, huge context window, can provide responses to questions massive in size.
- The model itself is far from being smart and tends to generate low quality responses. It will challenging to win rounds with it as other nodes will provide better responses.
Conclusion
To keep your node successfully responding to questions and consistently winning in inference rounds, you need a model that is:- as new as possible. We strive to offer optimal model choices in our Featured models list in our applications. However, you are not limited in your choice.
- suitable for your hardware and capable of maintaining balance between
- context length of 20,000 tokens,
- generation speed of 35-40 tokens per second,
- response quality. Usually, a bigger model means higher quality. But bigger doesn’t necessarily mean better all around. A massive general knowledge model will usually lose to a smaller specialized model.
- Every model has its pros and cons. Select your model based on current network context. If we are generating a specific dataset, it is probably best to select a model that excels in that dataset’s area of knowledge.