You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During inference, the boi_token (which is "<|image start|>" in string form), the resolution information, and the img_token (which is "<|image token|>") are directly provided as a starter for the generation. (BTW, the mismatch between the string form and the variable naming is confusing and annoying, lol)
That said, i am also curious why the authors limited the supervision to the first visual token id and the last visual token id, while ignoring eol_token, eof_token, and eoi_token.
The label just consists of image token, with the special token <|image start|> ignored. Why compute sft loss like this?
The text was updated successfully, but these errors were encountered: