You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on this error, the token 49426_u32 is not decoding into a valid UTF-8 string. Rust’s String type can only contain valid UTF-8 sequences. If the token represents an incomplete or invalid Unicode code point, Rust will reject it.
What is your use case, and what kind of output would you expect? We could consider using Vec<u8> for raw bytes (see _decode_native for this functionality)
I'm currently translating Sebastian Rashka's LLM's from Scratch book that uses PyTorch to Rust/Candle. For this part, just trying to get the equivalent of the below code in Rust.
It looks like in the tiktoken library, the associated python method for decoding replaces any errors when trying to convert bytes to a string (see here).
If, I'm reading that correctly, then it would be good to have similar behaviour here.
I encountered this error when trying to decode a vec of token ids, and identified the culprit token id to be 49426.
Code to replicate:
The text was updated successfully, but these errors were encountered: