Skip to main content

How to Select the Right Model

Choosing a model is a balancing act. Simply selecting the biggest, ‘smartest’ or the fastest model is not sufficient. You need to understand how AI models and the network work in order to make a choice that is optimal for your hardware.

TL;DR

  • select Fortytwo App if
  • select Fortytwo CLI if
  • You want to let your node to select a model for you automatically.
  • You regularly use your system for something other than noderunning. The node will try to adjust to your system resources consumption.
Want to know more? Keep reading.

How Models Work

When you load a model into your system’s memory (VRAM for GPU, RAM for ARM based systems), model weights and Key-Value Cache are loaded (KV Cache). KV Cache is necessary to accelerate generation and it directly depends on the size of your context window — or the amount of tokens that a model can simultaneously process and then provide an answer. Based on this, we arrive at two key points:
  1. Complexity and diversity of the queries your node can process depends on how big your context window is. Bigger context window means your node can take on more complex queries.
  2. Higher TPS (Token Per Second), or response generation speed, means that your node has a higher chance to provide the network with an answer on time and keep earning consistently.
The main goal is to find a balance between these two parameters: to provide an answer to a complex enough query with sufficient speed. In our applications we use adaptive KV Cache size, so the node can adapt to your hardware. When you launch the node it analyses your available resources and reserves the following:
  • GPU-based systems (primarily Windows, Linux) — reserves 90% of idle VRAM.
  • ARM-based systems with unified memory (primarily macOS) — reserves 80 to 85% of leftover RAM.
This is the only way the KV Cache size is controlled at the moment. This allows for most efficient utilization of processing power. Now we need to select a model of a suitable size and quality.

General Recommendations for Selecting a Model

1

Evaluate your system: Evaluate your system: GPUs are made to do quick calculations but their VRAM is limited, only a handful of models have substantial VRAM. While ARM devices can operate large AI models, their TPS is not as high.
  • Modern day GPU
  • macOS or Old GPU
  1. TPS will be fast in general if (2) and (3) stay true. You may favor bigger model size over KV Cache size.
  2. Model should always be smaller than the available idle VRAM.
  3. Always leave at least 2 GB of VRAM free for the KV Cache when launching your node.
2

Each model has different capabilities in speed and quality: Consider innate quality and speed of a model. Your KV Cache size choice should match your moodel choice. Explore the model’s repo page for information on its speed and quality.
  • Newer models
  • Older models
  1. Tend to be better in quality at the cost of slower speed compared to older models. For example, newer models like QN3 perform consistently better.
  2. Newer doesn’t always mean better. They can perform well in benchmarks but end up lacking in real-world applications.
3

Try to find a balance based on network requirements: The queries in the network which your model will try to respond to can differ in complexity and required expertise level. Some questions can be easy and short, others, like dataset-quality level questions, can be long and complex.
  • When expertise matters, like coding or medicine, select a model that can provide the right expertise.
  • When output speed and request size matters, follow these numbers:

    Common requests

    • Context Length: 3,000-4,000 tokens
    • TPS: 15-30

    High quality requests

    • Context Length: ~25,000 tokens
    • TPS: 25-30

Problem Case Studies

Let’s explore examples of situations to be avoided.
Setup:
  • System: GPU-based with Nvidia RTX 4070, 8 GB VRAM
  • AI Model: 7 GB
  • Output: Extremely small context window with just about 500–700 MB of VRAM left for it: 8 GB VRAM - 7 GB taken up by the model, which can fit roughly 1500 tokens.
Result:
  • The model is fast and high-quality when responding to very small questions.
  • The model cannot take on lengthy and complex questions and will reject them.
  • If complex questions dominate the network, this node will rarely participate in inference rounds.
Setup:
  • System: GPU-based with Nvidia GTX 1660, 8 GB VRAM
  • AI Model: 4 GB
  • Output: Big context window, a lot of VRAM memory is left: 8 GB VRAM - 4 GB taken up by the model equals to about 20,000+ tokens. Yet the TPS (response generation speed) is low due to the GPU’s old age.
Result:
  • This node can take lengthy questions and is likely to provide quality responses that might end up being the best in a given round.
  • It will generate its responses slowly. This node will tend to lose rounds if other nodes provide their responses faster.
Setup:
  • System: macOS with Unified Memory, 64 GB RAM
  • AI Model: 1.5 GB
  • Output: 180–200 TPS, context length is up to 32,000 tokens.
Result:
  • Great response generation speed, huge context window, can provide responses to questions massive in size.
  • The model itself is far from being smart and tends to generate low quality responses. It will challenging to win rounds with it as other nodes will provide better responses.

Conclusion

To keep your node successfully responding to questions and consistently winning in inference rounds, you need a model that is:
  • as new as possible. We strive to offer optimal model choices in our Featured models list in our applications. However, you are not limited in your choice.
  • suitable for your hardware and capable of maintaining balance between
    • context length of 20,000 tokens,
    • generation speed of 35-40 tokens per second,
    • response quality. Usually, a bigger model means higher quality. But bigger doesn’t necessarily mean better all around. A massive general knowledge model will usually lose to a smaller specialized model.
  • Every model has its pros and cons. Select your model based on current network context. If we are generating a specific dataset, it is probably best to select a model that excels in that dataset’s area of knowledge.
Using these guidelines will help you select models that will allow your node to participate in a larger number of rounds and consistently secure winning places by providing best responses to requests from the network.