added

[CAPSULE:0.1.22] ⎯ 2025-05-07

about 2 months ago

Major performance and capability upgrade including dynamic context, GPU offloading, and embeddings truncation.

Added

Embedding truncation for improved memory efficiency.
Support for using the model's maximum context size.
Optimized input token decoding for better inference speed.
Tokenizer object now returned in /metadata.
Dynamic selection mechanism for context length.
GPU offloading support for KV cache.

Changed

Updated development dependencies.

Fixed

Fixed batch prefill logic to improve prompt handling stability.