Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Abstract

Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current Large Language Models (LLMs), however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning.

Accuracy and Generation Length on Mathematical Datasets

	Accuracy ↑					Generation Length ↓
	MATH 500	AIME 2024	GSM8K	GPQA Diamond	Avg.	MATH 500	AIME 2024	GSM8K	GPQA Diamond	Avg.
QwQ-32B [1]
CoT Thinking	97.66	76.88	96.67	64.17	83.84	4156	12080	1556	8095	6472
CoT Thinking (Greedy)	97.00	80.00	96.57	65.15	84.68	3827	11086	1536	7417	5967
Soft Thinking	98.00	83.33	96.81	67.17	86.32	3644	10627	1391	7213	5719
DeepSeek-R1-Distill-Qwen-32B [2]
CoT Thinking	94.50	72.08	95.61	63.10	81.32	3543	9347	875	6218	4995
CoT Thinking (Greedy)	93.00	63.33	95.30	59.09	77.68	3651	8050	1048	8395	5286
Soft Thinking	95.00	76.66	95.83	64.64	83.03	3373	6620	785	4722	3875
DeepSeek-R1-Distill-Llama-70B [3]
CoT Thinking	94.70	70.40	94.82	65.34	81.31	3141	8684	620	5500	4486
CoT Thinking (Greedy)	94.61	73.33	93.60	66.16	81.92	2877	9457	606	4443	4345
Soft Thinking	94.80	73.33	94.90	66.66	82.42	3021	6644	597	4470	3683

Table 1: Comparison of Soft Thinking and various baseline methods on accuracy and generation length of correct answers across mathematical datasets. Best results are highlighted in bold.

Method	Accuracy ↑	Generation Length ↓
QwQ-32B [1]
CoT Thinking	97.63	97.49	62.00	85.70	2557	2154	9986	4899
CoT Thinking (Greedy)	95.73	96.50	57.35	83.19	2396	2069	7034	3833
Soft Thinking	98.17	97.66	62.72	86.18	2638	2157	7535	4110
DeepSeek-R1-Distill-Qwen-32B [2]
CoT Thinking	97.25	95.13	57.33	83.23	3095	2761	8376	4744
CoT Thinking (Greedy)	87.19	87.54	43.36	72.70	2294	1703	4702	2900
Soft Thinking	97.56	95.33	59.50	84.13	2713	2534	6255	3834
DeepSeek-R1-Distill-Llama-70B [3]
CoT Thinking	97.71	94.77	56.94	83.14	2711	2386	8319	4472
CoT Thinking (Greedy)	92.07	91.82	48.02	77.30	2192	1979	5438	3203
Soft Thinking	98.17	94.94	58.42	83.84	2498	2214	6512	3741

Method

Accuracy ↑

Generation Length ↓

HumanEval

MBPP

LiveCodeBench

Avg.

HumanEval

MBPP

LiveCodeBench

Avg.

QwQ-32B [1]

CoT Thinking

97.63

97.49

62.00

85.70

2557

2154

9986

4899

CoT Thinking (Greedy)

95.73

96.50

57.35

83.19

2396

2069

7034

3833

Soft Thinking

98.17

97.66

62.72

86.18

2638

2157

7535

4110

DeepSeek-R1-Distill-Qwen-32B [2]

CoT Thinking

97.25

95.13

57.33

83.23

3095

2761

8376

4744

CoT Thinking (Greedy)

87.19

87.54

43.36

72.70

2294

1703

4702

2900

Soft Thinking

97.56

95.33

59.50

84.13

2713

2534

6255

3834

DeepSeek-R1-Distill-Llama-70B [3]

CoT Thinking

97.71

94.77

56.94

83.14

2711

2386

8319

4472

CoT Thinking (Greedy)

92.07

91.82

48.02

77.30

2192

1979

5438

3203

Soft Thinking

98.17

94.94

58.42

83.84

2498

2214

6512

3741

BibTeX

@misc{zhang2025softthinkingunlockingreasoning, title={Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space}, author={Zhen Zhang and Xuehai He and Weixiang Yan and Ao Shen and Chenyang Zhao and Shuohang Wang and Yelong Shen and Xin Eric Wang}, year={2025}, eprint={2505.15778}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.15778}, }

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Abstract

Soft Thinking Pipeline

Soft Thinking replaces discrete tokens with abstract concept tokens, enabling reasoning in continuous concept space.

An example of Soft Thinking and CoT

A comparison between standard CoT and Soft Thinking on a multiplication problem. We select the token with the highest probability at each step of Soft Thinking for readability and interpretability. Full distribution is visualized in heatmap. Red text denotes repetitive, useless words.

Probability Distribution of Soft Thinking at Each Step

An example illustrating the probability distribution of our proposed Soft Thinking method. At each step, top-k token candidates and their probabilities are shown. Red boxes indicate the selected tokens that form the final generated sequence for readability and interpretability.

Accuracy and Generation Length on Mathematical Datasets

BibTeX