added
[CAPSULE:0.1.22] ⎯ 2025-05-07
4 days ago
Major performance and capability upgrade including dynamic context, GPU offloading, and embeddings truncation.
Added
- Embedding truncation for improved memory efficiency.
- Support for using the model's maximum context size.
- Optimized input token decoding for better inference speed.
- Tokenizer object now returned in
/metadata
. - Dynamic selection mechanism for context length.
- GPU offloading support for KV cache.
Changed
- Updated development dependencies.
Fixed
- Fixed batch prefill logic to improve prompt handling stability.