Skip to main content

Command Palette

Search for a command to run...

Hands-On vLLM Thinking Token Budget

Updated
4 min read
Hands-On vLLM Thinking Token Budget
Y

Backend Engineer 🚀 Cloud Native Enthusiast ☁

vLLM is a workhorse to run inference for any LLM under the sun. One of the recent developments in the project is the ability to define thinking_token_budget, basically a request level argument that can determine how much tokens the model will spend on thinking.

First place to go to check how to use this and get a list of supported models is definitely on the docs. It lists only three models as of today and a single example that does not tell you much about what are all the models that support thinking budget and what are the reasoning start and end string that we can use while serving the model. With that I started digging into the source code of vLLM and got a gist of which models in vLLM supports thinking_token_budget and how to find them.

Finding all the reasoning parsers

All of the reasoning parsers supported by vLLM can be found conveniently inside of the vllm/reasoning/__init__.py file. It shows, what's the name of the parser (required while running vLLM), filename (inside of the same directory) and classname (inside the file) to load it.

---

List of Reasoning parser in vLLM

Classes to look for

If we look into the source code, then we can see two Base Classes used by all the parsers. One is ReasoningParser and another one is BaseThinkingReasoningParser which inherits from ReasoningParser.

Looking at the code, any Reasoning class that directly inherits from ReasoningParser is an older parser for models that do not support thinking_token_budget or do not contain simple tokens for determining start and stop (take granite as an example). Also a big giveaway is their reasoning_start_str and reasoning_end_str property are null.

So if we are looking for a model that supports thinking budget, then we have to look for reasoning parser classes that inherit from BaseThinkingReasoningParser. And to be able to use it, we have to find reasoning_start_str and reasoning_end_str within that class. Some of the classes use start_token and end_token instead of the aforementioned ones, but those are the values you are looking for to pass to your --reasoning-config flag.

Putting it together

As a hard requirement vLLM specifically requires us to pass --reasoning-parser along with --reasoning-config where we need to pass reasoning_start_str and reasoning_end_str token to use thinking budget control. You can find all the reasoning tokens tokens from the parsers using this command:

rg -l 'BaseThinkingReasoningParser' vllm/reasoning/ | xargs rg -t py -A 2 'def (start_token|end_token|reasoning_start_str|reasoning_end_str)\b' 

Command Breakdown

  • rg: ripgrep, a fast, recursive grep tool

  • -l: print only the filenames that contain a match, not the matching lines themselves

  • 'BaseThinkingReasoningParser'; the search pattern, finds files containing this class/symbol name

  • vllm/reasoning/: the directory to search in, scoped to vLLM's reasoning module

Result: A list of file paths that reference BaseThinkingReasoningParser.


  • |: Passes the stdout of the left command as stdin to the right command.

  • xargs: reads lines from stdin (the filenames from step 1) and passes them as arguments to the next command

  • rg: ripgrep again, now doing a second search

  • -t py: restrict matches to Python files only (.py extension)

  • -A 2: print 2 lines after each match, useful for seeing the method body or return value

  • 'def (start_token|end_token|reasoning_start_str|reasoning_end_str)\b'; the regex pattern, matches Python method definitions for any of these four method names

  • \b: a word boundary, ensures the match doesn't continue into a longer name (e.g., won't match reasoning_end_str_extra)

Result: For each file from step 1, shows the definition lines of these four methods plus 2 lines of context after each.

Finally

Now if we want to use thinking_token_budget for gemma4 suppose; I need to get:

  • name of the parser (from vllm/reasoning/__init__.py)

  • reasoning config (reasoning_start_str / start_token and reasoning_end_str / stop_token) from vllm/reasoning/gemma4_reasoning_parser.py

vllm serve google/gemma-4-31B \
    --reasoning-parser gemma4 \
    --reasoning-config '{"reasoning_start_str": "<|channel>", "reasoning_end_str": "<channel|>"}'

When the server is up and running, I can send my request with thinking_budget_token:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B",
    "messages": [
      { "role": "user", "content": "How many r's in strawberry?" }
    ],
    "thinking_token_budget": 10
  }'