Hands-On vLLM Thinking Token Budget

vLLM is a workhorse to run inference for any LLM under the sun. One of the recent developments in the project is the ability to define thinking_token_budget, basically a request level argument that can determine how much tokens the model will spend on thinking.

First place to go to check how to use this and get a list of supported models is definitely on the docs. It lists only three models as of today and a single example that does not tell you much about what are all the models that support thinking budget and what are the reasoning start and end string that we can use while serving the model. With that I started digging into the source code of vLLM and got a gist of which models in vLLM supports thinking_token_budget and how to find them.

Finding all the reasoning parsers

All of the reasoning parsers supported by vLLM can be found conveniently inside of the vllm/reasoning/__init__.py file. It shows, what's the name of the parser (required while running vLLM), filename (inside of the same directory) and classname (inside the file) to load it.

---

Classes to look for

If we look into the source code, then we can see two Base Classes used by all the parsers. One is ReasoningParser and another one is BaseThinkingReasoningParser which inherits from ReasoningParser.

Looking at the code, any Reasoning class that directly inherits from ReasoningParser is an older parser for models that do not support thinking_token_budget or do not contain simple tokens for determining start and stop (take granite as an example). Also a big giveaway is their reasoning_start_str and reasoning_end_str property are null.

So if we are looking for a model that supports thinking budget, then we have to look for reasoning parser classes that inherit from BaseThinkingReasoningParser. And to be able to use it, we have to find reasoning_start_str and reasoning_end_str within that class. Some of the classes use start_token and end_token instead of the aforementioned ones, but those are the values you are looking for to pass to your --reasoning-config flag.

Putting it together

As a hard requirement vLLM specifically requires us to pass --reasoning-parser along with --reasoning-config where we need to pass reasoning_start_str and reasoning_end_str token to use thinking budget control. You can find all the reasoning tokens tokens from the parsers using this command:

rg -l 'BaseThinkingReasoningParser' vllm/reasoning/ | xargs rg -t py -A 2 'def (start_token|end_token|reasoning_start_str|reasoning_end_str)\b'

First command: finds all Python files under vllm/reasoning/ that reference BaseThinkingReasoningParser.

Second command: prints each definition of start_token, end_token, reasoning_start_str, or reasoning_end_str (plus the 2 lines following it) from those files.

Finally

Now if we want to use thinking_token_budget for gemma4 suppose; I need to get:

name of the parser (from vllm/reasoning/__init__.py)
reasoning config (reasoning_start_str / start_token and reasoning_end_str / stop_token) from vllm/reasoning/gemma4_reasoning_parser.py

vllm serve google/gemma-4-31B \
    --reasoning-parser gemma4 \
    --reasoning-config '{"reasoning_start_str": "<|channel>", "reasoning_end_str": "<channel|>"}'

When the server is up and running, I can send my request with thinking_budget_token:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B",
    "messages": [
      { "role": "user", "content": "How many r's in strawberry?" }
    ],
    "thinking_token_budget": 10
  }'