Speculative decoding

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM

EAGLE

Extrapolation Algorithm for Greater Language-model Efficiency

Motivation:

speculative sampling relies on the draft models having similar distributions as the target models.
- use smaller models. i.e: Llama 3.2 3B as draft for Llama 3.3 70B.
- high overhead for stepping through the whole models would outweighs the benefits

Difference between EAGLE-1 and EAGLE-3

EAGLE-1’s limitation at its feature prediction constraints, via LM head architecture,

EAGLE-3 addresses this by use direct token prediction and rely on multi-layer feature fusion called “training-time test”, similar to MLP Speculator

distribution skew

EAGLE does not involve any fine-tuning of the target model, therefore preservation of outputs distributions by EAGLE is theoretically guaranteed for both greedy and non-greedy sampling. This is not the case with Lookahead and Medusa.

EAGLE-1

Observations:

autoregressive on feature-level ¹ is simpler than token-level, given that there are more regularity.

uncertainty in sampling process hinders the performance of predicting the next feature.

feature-level are high-dimensional and continuous, meaning sampling “am” or “always” will results in different feature sequences.

EAGLE address this by inputs the token sequence from one time step ahead including the sampling outcomes into the draft models.

predicting $f_{\text{always}}$ based on $f_{\text{I}}$ and $t_\text{always}$
predicting $f_{\text{am}}$ based on $f_{\text{I}}$ and $t_\text{am}$

notation.

“Features” refers to second-to-top-layer feature of LLM, or the hidden states before LM head
Token by $t$ , embedding by $e$ , features by $f$ , distributions by $p$
Sequences are referred as $T_{i:j}$ for $(t_i, t_{i+1},\ldots, t_j)$ ²

architecture

[feature_seq, token_seq] # [bs, seq_len, hidden_dim], [bs, seq_len]
token_seq -> token_emb # [bs, seq_len] -> [bs, seq_len, hidden_dim]
fused_seq = feature_seq * token_emb # [bs, seq_len, 2xhidden_dim] ³
autoregressive_head:
- FC layer → reduce # [bs, seq_len, hidden_dim]
- decoder layer → features
using tree attention to generate a draft tree of depth $m$ and more than $m$ tokens for $m$ forward pass. ⁴

training

Smooth L1 loss:
$L_\text{reg} = \text{Smooth L1}(f_{i+1} \text{draft}(T_{2:i+1}, F_{1:i}))$
classification loss to optimize given objectives:
$\begin{aligned} p_{i+2} &= \text{Softmax}(\text{LM\_Head}(f_{i+1})) \\ \hat{p}_{i+2} &= \text{Softmax}(\text{LM\_Head}(\hat{f}_{i+1})) \\ L_{\text{cls}} &= \text{CrossEntropy}(p_{i+2}, \hat{p}_{i+2}) \end{aligned}$
Autoregressive head with loss $L = L_{\text{reg}} + w_{\text{cls}} L_{\text{cls}}$
- set $w_{\text{cls}}=0.1$ given that classification loss is in order magnitude bigger than regression loss
Dataset: ShareGPT, 68k dialogue
Hyperparameter:
- LR: $3e^{-5}$
- AdamW with beta $(\beta_1, \beta_2)=(0.9,0.95)$
- gradient clipping: $0.5$

EAGLE-2

tl/dr: Improvement on EAGLE-1 via context-aware dynamic draft tree into this drafting modeling.

EAGLE-3

HASS

Learning Harmonized Representations for Speculative Sampling arXiv (Zhang et al., 2025)

HArmonizedSS/HASS

Falcon

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree arXiv (Gao et al., 2025)

MLP Speculator

via combined tokens/embedding speculators

Accelerating Production LLMs with Combined Token/Embedding Speculators arXiv (Wertheimer et al., 2024)

DistillSpec

DistillSpec: Improving Speculative Decoding via Knowledge Distillation arXiv (Zhou et al., 2024)

Medusa

https://sites.google.com/view/medusa-llm

FasterDecoding/Medusa

ngrams

apoorvumang/prompt-lookup-decoding

also known as Prompt Lookup Decoding (PLD), HF’s assisted generations

idea: to use string matching from prompt to generate candidate tokens, instead of using a draft-based models.

def find_candidate_pred_tokens(input_ids, max_ngram_size=3, num_pred_tokens=10):
  input_length = input_ids.size(1)
 
  for ngram_size in range(max_ngram_size, 0, -1):
    # Extract the last n tokens as our search ngram
    ngram = input_ids[0, -ngram_size:].tolist()
 
    # Create sliding windows of size ngram_size
    windows = input_ids.unfold(dimension=1, size=ngram_size, step=1)
 
    # Convert ngram to a tensor for comparison
    ngram_tensor = torch.tensor(ngram, device=input_ids.device).unsqueeze(0)
 
    # Find where the windows match the ngram
    matches = (windows == ngram_tensor).all(dim=2)
 
    # Get the indices of matches
    match_indices = matches.nonzero(as_tuple=True)[1]
 
    # Iterate through match indices to find a valid continuation
    for idx in match_indices:
      start_idx = idx + ngram_size
      end_idx = start_idx + num_pred_tokens
      # Ensure we don't go beyond the length of input_ids and avoid self-match
      if end_idx <= input_length and start_idx < input_length - ngram_size:
        return input_ids[0, start_idx:end_idx]
 
  # If no match is found, return an empty tensor
  return torch.tensor([], dtype=torch.long, device=input_ids.device)

lookahead decoding

SPiRE

MagicDec

optimization

(Liu et al., 2024) proposes SmartSpec via optimizing goodput.

speculative length

number of effective tokens generated by draft-models per iteration

Improvement factor (IF) determines the value of $\alpha$ .

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models arXiv (Mamou et al., 2024) proposes a dynamic speculative length to optimize for best decoding quality. fwiw num_speculative_tokens=5 has been found to be a pretty good balance between latency and quality trade-off for larger models. They propose an oracle classifier per draft requests to determine whether they should increase/decrease SL as follows:

C_i = \text{FFN}(\operatorname{Concat}(\text{top\_k}({y_i}^D), \text{entropy}({y_i}^D), i))

where it takes the probability vectors of draft models $y_i^D$ for token position $i$ to generate a confidence score $C_i$ ⁵

distributed sps

Accelerating Large Language Model Decoding with Speculative Sampling arXiv (Chen et al., 2023)

speculative sampling

aliases: SpS, speculative decoding.

Based on:

tl/dr

Latency is improved at the cost of increasing ops, with $\gamma=5$ ⁸
This is not useful when computation resources are limited.
Walltime improvement by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ where $\alpha$ is the approximation $E(\beta)$ ⁹
Note that this is different from rejection sampling ¹⁰
Lenience factor $l$ to perform speed versus quality trade-off ¹¹ when draft-models distributions is different from target-models’. ¹²

goal and algorithm

Let $M_p$ be the target model for task $X$ , and $p(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Let $M_q$ be the draft/approximation models at the same task, and $q(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Objective: to use $M_q$ to generate $\gamma \in \mathbb{Z}^{+}$ completions, and use $M_p$ to verify $\gamma$ tokens in parallel

Keep when $q(x) \le p(x)$
Reject when $q(x) \ge p(x)$ for sample with $P=1-\frac{p(x)}{q(x)}$ and sample $x$ again from $p^{'}(x) = \textit{norm}(\textit{max}(0, p(x) - q(x)))$ ¹³

"\\begin{algorithm}\n\\caption{SpeculativeDecodingStep}\n\\begin{algorithmic}\n\n\\INPUT{$M_p,\\;M_q,\\;\\textit{prefix}$}\n\n\\State $\\triangleright$ Sample $\\gamma$ guesses $x_1,\\dots,x_\\gamma$ from $M_q$\n\\FOR{$i \\gets 1$ \\TO $\\gamma$}\n \\STATE $q_i(x) \\gets M_q\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_{i-1}]\\bigr)$\n \\STATE $x_i \\sim q_i(x)$\n\\ENDFOR\n\n\\State $\\triangleright$ Run $M_p$ in parallel\n\\STATE $p_1(x),\\dots,p_{\\gamma+1}(x) \\gets\n M_p(\\textit{prefix}),\\dots,\n M_p\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_\\gamma]\\bigr)$\n\n\\State $\\triangleright$ Determine the number of accepted guesses $n$\n\\STATE $r_1,\\dots,r_\\gamma \\sim U(0,1)$\n\\STATE $n \\gets \\min\\!\\bigl(\\{\\,i-1 \\mid\n 1\\le i\\le\\gamma,\\;\n r_i > \\frac{p_i(x)}{q_i(x)}\\,\\}\\cup\\{\\gamma\\}\\bigr)$\n\n\\State $\\triangleright$ Adjust $M_p$’s distribution if needed\n\\STATE $p'(x) \\gets p_{n+1}(x)$\n\\IF{$n < \\gamma$}\n \\STATE $p'(x) \\gets \\mathrm{norm}\\!\\bigl(\\max\\!\\bigl(0,\\;\n p_{n+1}(x)-q_{n+1}(x)\\bigr)\\bigr)$\n\\ENDIF\n\n\\State $\\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$\n\\STATE $t \\sim p'(x)$\n\\RETURN $\\textit{prefix} + [x_1,\\dots,x_n,t]$\n\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 4 SpeculativeDecodingStep

Input: $M_p,\;M_q,\;\textit{prefix}$

$\triangleright$ Sample $\gamma$ guesses $x_1,\dots,x_\gamma$ from $M_q$

for $i \gets 1$ to $\gamma$ do

$q_i(x) \gets M_q\!\bigl(\textit{prefix} + [x_1,\dots,x_{i-1}]\bigr)$

$x_i \sim q_i(x)$

end for

$\triangleright$ Run $M_p$ in parallel

$p_1(x),\dots,p_{\gamma+1}(x) \gets M_p(\textit{prefix}),\dots, M_p\!\bigl(\textit{prefix} + [x_1,\dots,x_\gamma]\bigr)$

$\triangleright$ Determine the number of accepted guesses $n$

$r_1,\dots,r_\gamma \sim U(0,1)$

$n \gets \min\!\bigl(\{\,i-1 \mid 1\le i\le\gamma,\; r_i > \frac{p_i(x)}{q_i(x)}\,\}\cup\{\gamma\}\bigr)$

$\triangleright$ Adjust $M_p$ ’s distribution if needed

$p'(x) \gets p_{n+1}(x)$

if $n < \gamma$ then

$p'(x) \gets \mathrm{norm}\!\bigl(\max\!\bigl(0,\; p_{n+1}(x)-q_{n+1}(x)\bigr)\bigr)$

end if

$\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$

$t \sim p'(x)$

return $\textit{prefix} + [x_1,\dots,x_n,t]$

acceptance probability

alias: acceptance rate

definition 3.1

acceptance rate $\beta_{x<t}$ given a prefix $x_{<t}$ is the probability of accepting $x_t \sim q(x_t\mid x_{<t})$ via speculative sampling.

$E(\beta)$ is the natural measure of how well $M_q$ approximates $M_p$

$\alpha = E(\beta)$ assuming $\beta$ are i.i.d, (1) is a capped geometrics variables, with success probability of $1 - \alpha$ and cap $\gamma + 1$ :

E(\text{\# generated tokens}) = \frac{1-\alpha^{\gamma +1}}{1-\alpha}

calculating $\alpha$

definition 3.2

Let natural divergence $D_{LK}$ be:
$D_{LK}(p,q) = \sum_{x} |p(x) - M(x)| = \sum_{x} \mid q(x) - M(x) \mid$
where $M(x) = \frac{p(x) + q(x)}{2}$

Lemma 3.3

$D_{LK}(p,q) = 1 - \sum_{x} \min{p(x), q(x)}$ ¹⁴

Corollary 3.4

$D_{LK}(p,q)$ is a symmetric divergence in $[0,1]$ , where

$D_{LK}(p,q)=0 \Longleftrightarrow p=q$

$D_{LK}(p,q)=1 \Longleftrightarrow \text{p and q have disjoint support}$

Theorem 3.5

$\beta = 1 - D_{LK}(p,q)$ ¹⁵

Corollary 3.6

$\alpha = 1 - E(D_{LK}(p,q)) = E(min(p,q))$

walltime improvement

With i.i.d assumption speculative sampling reduces $\text{\# of calls}$ to target models by $\frac{1-\alpha^{\gamma +1}}{1-\alpha }$ , assuming running on compute resources that support increased concurrency (GPUs.)

For walltime ¹⁶ analysis, assuming we can run $\gamma +1$ concurrent evaluation of $M_p$ :

cost-efficient

let $c$ be the ratio between time for single run of $M_q$ and the time for single run $M_p$

$c$ is highly dependent on hardware measure. From the paper, $c \approx 0$ to avoid expectancy biases

Theorem 3.8

expected improvement factor in total walltime by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ ¹⁷

Note that we assume there are long enough generations sequence here.

Corollary 3.9

$\forall \alpha > c \space \exists \space \gamma \mid \text{ we will get improvement by a factor of } \frac{1+\alpha }{1+c}$

If we get an improvement for $\gamma$ , we’d also get improvement for any $0 < \gamma^{*} < \gamma$ , hence we can use (3.8) for $\gamma = 1$ , which yields $\frac{1+\alpha}{1+c}$

arithmetic operations

arithmetics operations per token

let $\hat{c}$ be the ratio of arithmetics operations per tokens of $M_q$ to that of $M_p$

Note that the number of operations will then grow by $\gamma +1$ , given that we will produce at most $\gamma +1$ tokens per run.

Theorem 3.11

The expected factor of increase in number of operations is $\frac{(1-\alpha)(\gamma \hat{c} + \gamma + 1)}{1-\alpha^{\gamma +1}}$ ¹⁸

Bibliographie

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv preprint arXiv:2302.01318 [arXiv]
Gao, X., Xie, W., Xiang, Y., & Ji, F. (2025). Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree. arXiv preprint arXiv:2412.12639 [arXiv]
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv preprint arXiv:2211.17192 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv preprint arXiv:2406.16858 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025a). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv preprint arXiv:2401.15077 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025b). EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv preprint arXiv:2503.01840 [arXiv]
Liu, X., Daniel, C., Hu, L., Kwon, W., Li, Z., Mo, X., Cheung, A., Deng, Z., Stoica, I., & Zhang, H. (2024). Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. arXiv preprint arXiv:2406.14066 [arXiv]
Mamou, J., Pereg, O., Korat, D., Berchansky, M., Timor, N., Wasserblat, M., & Schwartz, R. (2024). Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models. arXiv preprint arXiv:2405.04304 [arXiv]
Stern, M., Shazeer, N., & Uszkoreit, J. (2018). Blockwise Parallel Decoding for Deep Autoregressive Models. arXiv preprint arXiv:1811.03115 [arXiv]
Wertheimer, D., Rosenkranz, J., Parnell, T., Suneja, S., Ranganathan, P., Ganti, R., & Srivatsa, M. (2024). Accelerating Production LLMs with Combined Token/Embedding Speculators. arXiv preprint arXiv:2404.19124 [arXiv]
Zhang, L., Wang, X., Huang, Y., & Xu, R. (2025). Learning Harmonized Representations for Speculative Sampling. arXiv preprint arXiv:2408.15766 [arXiv]
Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., & Agarwal, R. (2024). DistillSpec: Improving Speculative Decoding via Knowledge Distillation. arXiv preprint arXiv:2310.08461 [arXiv]

features here refer to the hidden states of the decoder layers second-to-top-layer of the LLM, before the LM head. Not to be confused with features ↩
Vanilla autoregressive at token-level is described by $T_{1:j} \rightarrow E_{1:j} \rightarrow f_j \rightarrow p_{j+1} \rightarrow t_{j+1}$ :
- input $T_{1:j}$ is then transformed into embeddings $E_{1:j}$
- then into features $F_{1:j}$ ,
- LM Head maps $f_j$ to a distribution $p_{j+1} = \text{LM\_Head}(f_j)$
- sampling next token $t_{j+1}$
↩
See vllm-project/vllm#20078 ↩
Aligns with DistillSpec and Medusa ↩
This seems like an premature optimization. For use-cases where the batch sizes fluctuates, the calculation for an optimal speculative length would probably too overkill when the improvement could be minimal. ↩
Note that we refer to standard sampling to methods such as argmax, top-k, nucleus, temperatures, et al., albeit each have a different ways to process logits. We will consider these as standard sampling from an adjusted distribution ↩
This work from DeepMind was performed concurrently and independently from Leviathan et al. (2023). The work at DeepMind focuses more on distributed settings of speculative decoding ↩
also referred in practice as num_speculative_tokens ↩
or natural measure of the acceptance rate $\beta$ ↩
Rejection sampling follows a iterative sampling procedure that might looks superficially similar to speculative sampling:
1. Sample $x \sim q(x)$ and returns $r \sim U(0,1)$
2. If $r < \frac{p(x)}{M q(x)}$ return $x$
3. then go to 1
Where $M = \operatorname{max}_{x} \frac{p(x)}{q(x)}$

We could employ non-iterative version of rejection sampling instead of speculative sampling here (go through step 1 and 2, and otherwise sample an unmodified $p(x)$ directly)

Specifically, the expected accept probability:
$E_{x\sim q(x)} \frac{p(x)}{M q(x)} = \sum_{x} p(x) \min_{x^{'}}{\frac{q(x^{'})}{p(x^{'})}} \le \sum_{x} p(x) \min{(1, \frac{q(x)}{p(x)})} = \sum_{x} \min{(p(x), q(x))}$ ↩
A lenience parameter $l \in [0,1]$ to introduce further trade-off. This is useful when the distributions of draft models does not match the target model exactly.

Specifically we have:
$\begin{aligned} \alpha &= \mathbb{E}_{x\sim q(x)} \!\left[ \begin{cases} 1, & \text{if } l\,q(x) \le p(x),\\[6pt] \displaystyle\frac{p(x)}{l\,q(x)}, & \text{if } l\,q(x) > p(x) \end{cases} \right] \\[10pt] &= \mathbb{E}_{x\sim q(x)}\! \frac{p(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \sum_{x} \frac{p(x)\,q(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \frac{1}{l}\sum_{x} \min\!\bigl(p(x),\,l\,q(x)\bigr) \\[8pt] &= \sum_{x} \min\!\Bigl(\tfrac{p(x)}{l},\,q(x)\Bigr). \end{aligned}$

Important

this relies on q is sampled from this given distributions, and $l$ increases $\alpha$

In the case of greedy decoding (temperature=0), the draft essentially outputs $x^{'}_q = \argmax{q(x)}$ , so scaling $l q(x)$ becomes a no-op, given that the argmax will be unchanged in this case. ↩
Note that we can’t use temperature=0 (i.e argmax sampling):
- Instead we allow some lenience before standardizing the distribution (accept token $x$ sampled from $M_q$ in case of $p(x) \le l \dot \max{p}$ )
- In this case, then similar empirical increases to $\alpha$ to those of temperature=1
↩
On Correctness of Speculative Sampling (SpS)

We will show that $\forall p(x) \text{ and } q(x)$ , tokens sampled via speculative sampling from $p(x)$ and $q(x)$ are distributed identically to those sampled from $p(x)$ alone.

Let $\beta$ be the acceptance probability

Note that
$p'(x) = \operatorname{norm}\!\bigl(\max(0,\;p(x)-q(x))\bigr) = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)} {\displaystyle \sum_{x'}\!\bigl(p(x')-\min\!\bigl(q(x'),\,p(x')\bigr)\bigr)} = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)}{1-\beta},$
so the normalising constant for the adjusted distribution $p'(x)$ is $1-\beta$ ; the last equality follows immediately from Lemma 3.3 and Theorem 3.5.

Now
$P(x = x') \;=\; P(\text{guess accepted},\,x = x') \;+\; P(\text{guess rejected},\,x = x').$
Where
$P(\text{guess accepted},\,x = x') \;=\; q(x')\,\min\!\bigl(1,\tfrac{p(x')}{q(x')}\bigr) \;=\; \min\!\bigl(q(x'),\,p(x')\bigr),$
and
$P(\text{guess rejected},\,x = x') \;=\; (1-\beta)\,p'(x') \;=\; p(x') - \min\!\bigl(q(x'),\,p(x')\bigr).$
Overall
$\begin{aligned} P(x = x') &= \min\!\bigl(p(x'),\,q(x')\bigr) \;+\; p(x') - \min\!\bigl(p(x'),\,q(x')\bigr) \\ &= p(x'). \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} D_{LK}(p,q) &= \sum_{x} |p(x) - M(x)| = \sum_{x} \frac{|p-q|}{2} \\ &= 1- \sum_{x} \frac{p+q - |p-q|}{2} \\ &= 1 - \sum_{x} \min{p(x), q(x)} \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} \beta &= \mathbb{E}_{x \sim q(x)} \Biggl[ \begin{cases} 1 & \text{if } q(x) \le p(x), \\[6pt] \displaystyle\frac{p(x)}{q(x)} & \text{if } q(x) > p(x) \end{cases} \Biggr] \\[8pt] &= \sum_{x} \min\!\bigl(p(x),\,q(x)\bigr). \end{aligned} \qquad\square$ ↩
also known as wikipedia/en/Elapsed_real_time. This is different from CPU time, given that it measure the actual time taken from the start of the computer program, where as CPU time only measures time during which processor is actively working on a certain task or process ↩
Denote the cost of running single steps of $M_p$ by $T$ .

Each run will then costs $T c \gamma + T = T(c \gamma +1)$ (running $M_q$ $\gamma$ times and running $M_p$ once)

Given (1) procduces $\frac{1-\alpha^{\gamma +1}}{1-\alpha}$ tokens

The cost to produces a token with speculative sampling would be $\frac{(c \gamma +1)(1-\alpha )}{1-\alpha^{\gamma +1}} T$

$\boxed{}$ ↩
Denote by $\hat{T}$ the number of arithmetic operations done by standard decoding per tokens, therefore speculative sampling costs $\hat{T} \hat{c} \gamma + \hat{T}(\gamma +1)$ operations. Then divided by the expected tokens we got the desired results $\boxed{}$ ↩

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM