Question about the subscript index in Formula 22 (MTP) in the DeepSeek V3 Technical Report
Question
In Formula 22, the subscript index 1:T - k
suggests that hidden state index range from 1
to T - k
, where T represents the input sequence length and 𝑖: 𝑗 denotes the slicing operation (inclusive of both the left and right boundaries).
However, the last hidden state that could be generated from the model seems to be $h_{T-D-1}^k$ instead of $h_{T-k}^k$, where D is the number of sequential MTP modules.
Details
According to the report, for a sequence of length T=7
(as in the example in Figure 3 below), and for any token index like i=2
, the model will process as follows:
- k=0: The first 2 tokens
t1
andt2
are input into the Main Model to generate hidden states $h_1^0$ and $h_2^0$, where $h_2^0$ is used to predict tokent3
and also gets passed to the MTP Module 1. - k=1: The MTP Module 1 uses the combination of the embedding of token
t3
along with the hidden state $h_2^0$ to generate an updated hidden state $h_2^1$ through the Transformer Block. This new hidden state $h_2^1$ is then used to predict tokent4
and also gets passed to MTP Module 2. - k=2: The MTP Module 2 uses the combination of the embedding of token
t4
along with the hidden state $h_2^1$ to generate an updated hidden state $h_2^2$ through the Transformer Block. This new hidden state $h_2^2$ is used to predict tokent5
.
If the understanding above is correct, then the last hidden state that can be generated from this process for a sequence of length T=7
is $h_4^k$, which is used as follows:
- In the Main Model, $h_4^0$ is used to predict token
t5
and gets passed to MTP Module 1 - In the MTP Module 1, $h_4^0$ gets updated to $h_4^1$, which is used to predict token
t6
and gets passed to MTP Module 2 - In the MTP Module 2, $h_4^1$ gets updated to $h_4^2$, which is used to predict token
t7
and thus completing the sequence.
In this case, the total hidden states that are generated in the process are $h_1^k$, $h_2^k$, $h_3^k$, $h_4^k$, where T= 7
and k
in [0, 1, 2], which is NO way the range of 1 : T - k
(1:6 or 1:5) as described in formula 22. Instead, shouldn't it be sth more like 1 : T - D - 1
?
I find this expression in formula 22 really confusing and would really appreciate it if anyone could help explain this. Thanks!
Question about the subscript index in Formula 22 (MTP) in the DeepSeek V3 Technical Report
Question
In Formula 22, the subscript index 1:T - k
suggests that hidden state index range from 1
to T - k
, where T represents the input sequence length and 𝑖: 𝑗 denotes the slicing operation (inclusive of both the left and right boundaries).
However, the last hidden state that could be generated from the model seems to be $h_{T-D-1}^k$ instead of $h_{T-k}^k$, where D is the number of sequential MTP modules.
Details
According to the report, for a sequence of length T=7
(as in the example in Figure 3 below), and for any token index like i=2
, the model will process as follows:
- k=0: The first 2 tokens
t1
andt2
are input into the Main Model to generate hidden states $h_1^0$ and $h_2^0$, where $h_2^0$ is used to predict tokent3
and also gets passed to the MTP Module 1. - k=1: The MTP Module 1 uses the combination of the embedding of token
t3
along with the hidden state $h_2^0$ to generate an updated hidden state $h_2^1$ through the Transformer Block. This new hidden state $h_2^1$ is then used to predict tokent4
and also gets passed to MTP Module 2. - k=2: The MTP Module 2 uses the combination of the embedding of token
t4
along with the hidden state $h_2^1$ to generate an updated hidden state $h_2^2$ through the Transformer Block. This new hidden state $h_2^2$ is used to predict tokent5
.
If the understanding above is correct, then the last hidden state that can be generated from this process for a sequence of length T=7
is $h_4^k$, which is used as follows:
- In the Main Model, $h_4^0$ is used to predict token
t5
and gets passed to MTP Module 1 - In the MTP Module 1, $h_4^0$ gets updated to $h_4^1$, which is used to predict token
t6
and gets passed to MTP Module 2 - In the MTP Module 2, $h_4^1$ gets updated to $h_4^2$, which is used to predict token
t7
and thus completing the sequence.
In this case, the total hidden states that are generated in the process are $h_1^k$, $h_2^k$, $h_3^k$, $h_4^k$, where T= 7
and k
in [0, 1, 2], which is NO way the range of 1 : T - k
(1:6 or 1:5) as described in formula 22. Instead, shouldn't it be sth more like 1 : T - D - 1
?
I find this expression in formula 22 really confusing and would really appreciate it if anyone could help explain this. Thanks!