How to Select the Right Model

Choosing a model is a balancing act. Simply selecting the biggest, ‘smartest’ or the fastest model is not sufficient. You need to understand how AI models and the network work in order to make a choice that is optimal for your hardware.

TL;DR

  • You want to let your node to select a model for you automatically.
  • You regularly use your system for something other than noderunning. The node will try to adjust to your system resources consumption.
Want to know more? Keep reading.

How Models Work

When you load a model into your system’s memory (VRAM for GPU, RAM for ARM based systems), model weights and Key-Value Cache are loaded (KV Cache). KV Cache is necessary to accelerate generation and it directly depends on the size of your context window — or the amount of tokens that a model can simultaneously process and then provide an answer. Based on this, we arrive at two key points:
  1. Complexity and diversity of the queries your node can process depends on how big your context window is. Bigger context window means your node can take on more complex queries.
  2. Higher TPS (Token Per Second), or response generation speed, means that your node has a higher chance to provide the network with an answer on time and keep earning consistently.
The main goal is to find a balance between these two parameters: to provide an answer to a complex enough query with sufficient speed. In our applications we use adaptive KV Cache size, so the node can adapt to your hardware. When you launch the node it analyses your available resources and reserves the following:
  • GPU-based systems (primarily Windows, Linux) — reserves 90% of idle VRAM.
  • ARM-based systems with unified memory (primarily macOS) — reserves 80 to 85% of leftover RAM.
This is the only way the KV Cache size is controlled at the moment. This allows for most efficient utilization of processing power. Now we need to select a model of a suitable size and quality.

General Recommendations for Selecting a Model

1

Evaluate your system: Evaluate your system: GPUs are made to do quick calculations but their VRAM is limited, only a handful of models have substantial VRAM. While ARM devices can operate large AI models, their TPS is not as high.
  1. TPS will be fast in general if (2) and (3) stay true. You may favor bigger model size over KV Cache size.
  2. Model should always be smaller than the available idle VRAM.
  3. Always leave at least 2 GB of VRAM free for the KV Cache when launching your node.
2

Each model has different capabilities in speed and quality: Consider innate quality and speed of a model. Your KV Cache size choice should match your moodel choice. Explore the model’s repo page for information on its speed and quality.
  1. Tend to be better in quality at the cost of slower speed compared to older models. For example, newer models like QN3 perform consistently better.
  2. Newer doesn’t always mean better. They can perform well in benchmarks but end up lacking in real-world applications.
3

Try to find a balance based on network requirements: The queries in the network which your model will try to respond to can differ in complexity and required expertise level. Some questions can be easy and short, others, like dataset-quality level questions, can be long and complex.
  • When expertise matters, like coding or medicine, select a model that can provide the right expertise.
  • When output speed and request size matters, follow these numbers:

    Common requests

    • Context Length: 3,000-4,000 tokens
    • TPS: 15-30

    High quality requests

    • Context Length: ~25,000 tokens
    • TPS: 25-30

Problem Case Studies

Let’s explore examples of situations to be avoided.

Conclusion

To keep your node successfully responding to questions and consistently winning in inference rounds, you need a model that is:
  • as new as possible. We strive to offer optimal model choices in our Featured models list in our applications. However, you are not limited in your choice.
  • suitable for your hardware and capable of maintaining balance between
    • context length of 20,000 tokens,
    • generation speed of 35-40 tokens per second,
    • response quality. Usually, a bigger model means higher quality. But bigger doesn’t necessarily mean better all around. A massive general knowledge model will usually lose to a smaller specialized model.
  • Every model has its pros and cons. Select your model based on current network context. If we are generating a specific dataset, it is probably best to select a model that excels in that dataset’s area of knowledge.
Using these guidelines will help you select models that will allow your node to participate in a larger number of rounds and consistently secure winning places by providing best responses to requests from the network.