Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting decoder error on specific token id when using GPT2 BPE tokenizer #97

Open
nerdai opened this issue Nov 30, 2024 · 2 comments
Open

Comments

@nerdai
Copy link

nerdai commented Nov 30, 2024

I encountered this error when trying to decode a vec of token ids, and identified the culprit token id to be 49426.

called `Result::unwrap()` on an `Err` value: Unable to decode into a valid UTF-8 string: incomplete utf-8 byte sequence from index 0

Code to replicate:

use tiktoken_rs::get_bpe_from_model;

let token_ids = vec![49426_u32];
let tokenizer = get_bpe_from_model("gpt2").unwrap();
tokenizer.decode(token_ids)
@zurawiki
Copy link
Owner

zurawiki commented Dec 1, 2024

Thanks surfacing this issue.

Based on this error, the token 49426_u32 is not decoding into a valid UTF-8 string. Rust’s String type can only contain valid UTF-8 sequences. If the token represents an incomplete or invalid Unicode code point, Rust will reject it.

What is your use case, and what kind of output would you expect? We could consider using Vec<u8> for raw bytes (see _decode_native for this functionality)

@nerdai
Copy link
Author

nerdai commented Dec 3, 2024

Ah, I see.

I'm currently translating Sebastian Rashka's LLM's from Scratch book that uses PyTorch to Rust/Candle. For this part, just trying to get the equivalent of the below code in Rust.

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
some_text = ...
ids = tokenizer.encode(some_text)
print(tokenizer.decode(ids)

It looks like in the tiktoken library, the associated python method for decoding replaces any errors when trying to convert bytes to a string (see here).

If, I'm reading that correctly, then it would be good to have similar behaviour here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants