Changelog
added

[CAPSULE:0.1.22] ⎯ 2025-05-07

Major performance and capability upgrade including dynamic context, GPU offloading, and embeddings truncation.

Added

  • Embedding truncation for improved memory efficiency.
  • Support for using the model's maximum context size.
  • Optimized input token decoding for better inference speed.
  • Tokenizer object now returned in /metadata.
  • Dynamic selection mechanism for context length.
  • GPU offloading support for KV cache.

Changed

  • Updated development dependencies.

Fixed

  • Fixed batch prefill logic to improve prompt handling stability.