[WebNN EP] Support GroupQueryAttention(GQA) #23416

peishenyan · 2025-01-17T07:34:51Z

Description

Adds support for GroupQueryAttention via WebNN matmul, transpose, reshape, and other operations that follow the logic in the GQA subgraph below.

                 query     key     value
                   |        |        |
           q_Reshape   k_Reshape   v_Reshape  (shape=B,S,H,N)
                   |        |        |
          q_Transpose  k_Transpose v_Transpose
           (0,2,1,3)    (0,2,3,1)   (perm=0,2,1,3)
             \           /           |     past_key
              \         /            |        |
present_key<---\----ScatterND <------|--------+
               |      |              |        |
               |  opt_k_transpose?   |    seqlens_k
               \  (0,1,3,2)          |        |
                \    /               |        +----past_value
                qk_MatMul            |       /
                     |    [B=h]      |      /
                     |   /           |     /
                  qk_Div         ScatterND -----> present_value
                      |              |
                      |              /
                     Add <----------/---------------finfo_min_mask
                      |            /
                    Softmax       /
                       \         /
                        \       /
                      qkv_MatMul
                             |
                          Transpose (perm=0,2,1,3)
                             |
                          Reshape---(shape=B,S,W)
                             |
                           output

 Abbreviatios: 
               B is batch_size, S is sequence_length, W is hidden_size
               N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H)
               B and S could be symbolic. ? means it is optional.
    GQA inputs: query, key value, past_key, past_value, seqlens_k, total_sequence_length
    Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.

Notes: Now we only support past_sequence_length == total_sequence_length for GQA.

Motivation and Context

peishenyan force-pushed the gqa_attention branch from cc00695 to 03c0a1e Compare January 17, 2025 07:40

peishenyan added 5 commits January 17, 2025 16:43

simple implementation for GQA

8416d3c

add input and output cast when fp16

142e1e5

add comments

e52195e

add support for group query

be8e82b

format code

e853f5e

peishenyan force-pushed the gqa_attention branch from 03c0a1e to e853f5e Compare January 17, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebNN EP] Support GroupQueryAttention(GQA) #23416

[WebNN EP] Support GroupQueryAttention(GQA) #23416

peishenyan commented Jan 17, 2025 •

edited

Loading

[WebNN EP] Support GroupQueryAttention(GQA) #23416

Are you sure you want to change the base?

[WebNN EP] Support GroupQueryAttention(GQA) #23416

Conversation

peishenyan commented Jan 17, 2025 • edited Loading

Description

Motivation and Context

peishenyan commented Jan 17, 2025 •

edited

Loading