Question about the subscript index in Formula 22 (MTP) in the DeepSeek V3 Technical Report

#930
maomaommmaomaomm
opened 21 days ago
Author

Question

In Formula 22, the subscript index 1:T - k suggests that hidden state index range from 1 to T - k, where T represents the input sequence length and 𝑖: 𝑗 denotes the slicing operation (inclusive of both the left and right boundaries).

Image

However, the last hidden state that could be generated from the model seems to be $h_{T-D-1}^k$ instead of $h_{T-k}^k$, where D is the number of sequential MTP modules.

Details

According to the report, for a sequence of length T=7 (as in the example in Figure 3 below), and for any token index like i=2, the model will process as follows:

  1. k=0: The first 2 tokens t1 and t2 are input into the Main Model to generate hidden states $h_1^0$ and $h_2^0$, where $h_2^0$ is used to predict token t3 and also gets passed to the MTP Module 1.
  2. k=1: The MTP Module 1 uses the combination of the embedding of token t3 along with the hidden state $h_2^0$ to generate an updated hidden state $h_2^1$ through the Transformer Block. This new hidden state $h_2^1$ is then used to predict token t4 and also gets passed to MTP Module 2.
  3. k=2: The MTP Module 2 uses the combination of the embedding of token t4 along with the hidden state $h_2^1$ to generate an updated hidden state $h_2^2$ through the Transformer Block. This new hidden state $h_2^2$ is used to predict token t5.
Image

If the understanding above is correct, then the last hidden state that can be generated from this process for a sequence of length T=7 is $h_4^k$, which is used as follows:

  1. In the Main Model, $h_4^0$ is used to predict token t5 and gets passed to MTP Module 1
  2. In the MTP Module 1, $h_4^0$ gets updated to $h_4^1$, which is used to predict token t6 and gets passed to MTP Module 2
  3. In the MTP Module 2, $h_4^1$ gets updated to $h_4^2$, which is used to predict token t7 and thus completing the sequence.

In this case, the total hidden states that are generated in the process are $h_1^k$, $h_2^k$, $h_3^k$, $h_4^k$, where T= 7 and k in [0, 1, 2], which is NO way the range of 1 : T - k (1:6 or 1:5) as described in formula 22. Instead, shouldn't it be sth more like 1 : T - D - 1?


I find this expression in formula 22 really confusing and would really appreciate it if anyone could help explain this. Thanks!