Hands-On vLLM Thinking Token Budget

Backend Engineer 🚀 Cloud Native Enthusiast ☁
vLLM is a workhorse to run inference for any LLM under the sun. One of the recent developments in the project is the ability to define thinking_token_budget, basically a request level argument that can determine how much tokens the model will spend on thinking.
First place to go to check how to use this and get a list of supported models is definitely on the docs. It lists only three models as of today and a single example that does not tell you much about what are all the models that support thinking budget and what are the reasoning start and end string that we can use while serving the model. With that I started digging into the source code of vLLM and got a gist of which models in vLLM supports thinking_token_budget and how to find them.
Finding all the reasoning parsers
All of the reasoning parsers supported by vLLM can be found conveniently inside of the vllm/reasoning/__init__.py file. It shows, what's the name of the parser (required while running vLLM), filename (inside of the same directory) and classname (inside the file) to load it.
---
Classes to look for
If we look into the source code, then we can see two Base Classes used by all the parsers. One is ReasoningParser and another one is BaseThinkingReasoningParser which inherits from ReasoningParser.
Looking at the code, any Reasoning class that directly inherits from ReasoningParser is an older parser for models that do not support thinking_token_budget or do not contain simple tokens for determining start and stop (take granite as an example). Also a big giveaway is their reasoning_start_str and reasoning_end_str property are null.
So if we are looking for a model that supports thinking budget, then we have to look for reasoning parser classes that inherit from BaseThinkingReasoningParser. And to be able to use it, we have to find reasoning_start_str and reasoning_end_str within that class. Some of the classes use start_token and end_token instead of the aforementioned ones, but those are the values you are looking for to pass to your --reasoning-config flag.
Putting it together
As a hard requirement vLLM specifically requires us to pass --reasoning-parser along with --reasoning-config where we need to pass reasoning_start_str and reasoning_end_str token to use thinking budget control. You can find all the reasoning tokens tokens from the parsers using this command:
rg -l 'BaseThinkingReasoningParser' vllm/reasoning/ | xargs rg -t py -A 2 'def (start_token|end_token|reasoning_start_str|reasoning_end_str)\b'
Command Breakdown
rg: ripgrep, a fast, recursive grep tool-l: print only the filenames that contain a match, not the matching lines themselves'BaseThinkingReasoningParser'; the search pattern, finds files containing this class/symbol namevllm/reasoning/: the directory to search in, scoped to vLLM's reasoning module
Result: A list of file paths that reference BaseThinkingReasoningParser.
|: Passes the stdout of the left command as stdin to the right command.xargs: reads lines from stdin (the filenames from step 1) and passes them as arguments to the next commandrg: ripgrep again, now doing a second search-t py: restrict matches to Python files only (.pyextension)-A 2: print 2 lines after each match, useful for seeing the method body or return value'def (start_token|end_token|reasoning_start_str|reasoning_end_str)\b'; the regex pattern, matches Python method definitions for any of these four method names\b: a word boundary, ensures the match doesn't continue into a longer name (e.g., won't matchreasoning_end_str_extra)
Result: For each file from step 1, shows the definition lines of these four methods plus 2 lines of context after each.
Finally
Now if we want to use thinking_token_budget for gemma4 suppose; I need to get:
name of the parser (from
vllm/reasoning/__init__.py)reasoning config (
reasoning_start_str/start_tokenandreasoning_end_str/stop_token) fromvllm/reasoning/gemma4_reasoning_parser.py
vllm serve google/gemma-4-31B \
--reasoning-parser gemma4 \
--reasoning-config '{"reasoning_start_str": "<|channel>", "reasoning_end_str": "<channel|>"}'
When the server is up and running, I can send my request with thinking_budget_token:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B",
"messages": [
{ "role": "user", "content": "How many r's in strawberry?" }
],
"thinking_token_budget": 10
}'



