You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the recent commit, I have noticed an inconsistency in the configuration of the query_pre_attn_scalar parameter between the 9B and 27B models in this repository.
Specifically:
In the 9B model, query_pre_attn_scalar is not explicitly set and appears to use the default value derived from head_dim (256, not 224 which can be derived by # hidden_size / # attention_heads).
In the 27B model, query_pre_attn_scalar is explicitly set to 144 (# hidden_size / # attention_heads).
Could you please provide some insight into the reasoning behind this difference? Is there a specific rationale for not setting query_pre_attn_scalar in the 9B model while explicitly setting it in the 27B model?
The text was updated successfully, but these errors were encountered:
This change is to align the model better with the official internal implementation and these new values should be the correct one following the technical report link to the technical report
I'm looking for clarification on why the query_pre_attn_scalar value was changed from 224 (d_model / # heads) to head_dim 256 specifically for the 9B model in the latest commit, while no changes were applied to the 27B model.
(27B model uses d_model / # heads which equals 144 instead of head_dim 128 for query_pre_attn_scalar.)
Could you please direct me to the section of the technical report or documentation where the rationale behind this decision is discussed?
In the recent commit, I have noticed an inconsistency in the configuration of the
query_pre_attn_scalar
parameter between the 9B and 27B models in this repository.Specifically:
In the 9B model,
query_pre_attn_scalar
is not explicitly set and appears to use the default value derived from head_dim (256, not 224 which can be derived by # hidden_size / # attention_heads).In the 27B model,
query_pre_attn_scalar
is explicitly set to 144 (# hidden_size / # attention_heads).Could you please provide some insight into the reasoning behind this difference? Is there a specific rationale for not setting
query_pre_attn_scalar
in the 9B model while explicitly setting it in the 27B model?The text was updated successfully, but these errors were encountered: