Abstract
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance.
We present WirelessMathLM, demonstrating that compact models (0.5B–7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property—verifiable correctness—that enables effective reinforcement learning without human feedback.
We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start.
Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using ≈100× fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales, with positive transfer to general mathematics benchmarks.
Key Highlights
Verification-Based RL
First to use binary verification rewards for domain-specific mathematical reasoning without human feedback
Compact Models
7B model approaches GPT-4o performance while using ≈100× fewer parameters than DeepSeek-R1
WirelessMathBench-XL
4,027 problems from 970 papers spanning 6 communication eras with automated verification
Positive Transfer
Domain-specific training improves general math performance by +8.4 points on average
Key Results


WirelessMathBench-XL Dataset
WirelessMathBench-XL is a comprehensive benchmark for evaluating mathematical reasoning in wireless communications. The dataset spans six major communication eras and covers diverse problem types including information theory, signal processing, optimization, and network analysis.

Experimental Setup & Results
Comprehensive Baseline Comparison
We benchmark against comprehensive baselines spanning proprietary and open-source models:
Proprietary Models
- GPT-5 (57.87% overall) - Best proprietary performance
- GPT-4o (40.37%) - Close to our 7B model performance
- Claude-4.0-Sonnet (53.75%)
- Gemini-2.5-Flash (54.25%)
- Grok-4-Fast (54.89%)
Open-Source General
- DeepSeek-R1 (671B, 57.37%) - Massive but best open model
- DeepSeek-V3.1 (671B, 56.87%)
- Llama-3.3-70B (38.37%)
- Qwen2.5-72B (37.50%)
Math-Specialized
- Qwen2.5-Math-72B (42.13%)
- DeepSeekMath-7B-RL (21.50%)
Performance by Question Type
Multiple Choice Questions (MCQ)
Fill-in-the-Blank
Full Equation Completion
Training Configuration
Hardware & Time
- 4 × NVIDIA A6000 GPUs
- 0.5B model: 14 hours
- 3B model: 40 hours
- 7B model: 61 hours
Hyperparameters
- 40 epochs (240 steps)
- Learning rate: 10⁻⁶
- Temperature: 0.6 (validation), 1.0 (training)
- KL penalty β = 0.01
Transfer Learning Results
Positive Transfer to General Mathematics
Surprisingly, domain-specific training on wireless mathematics enhances general mathematical reasoning without catastrophic forgetting.
WirelessMathLM-7B Performance Gains
Key Insights
- No Catastrophic Forgetting: Specialized training strengthens rather than degrades fundamental mathematical capabilities
- Consistent Gains: Improvements across diverse mathematical domains suggest robust transfer
- Scale-Dependent Effects: 3B model shows even larger relative improvements (+39.9% on MATH 500)
- Average Improvement: +8.4 points across all general mathematics benchmarks
Qualitative Analysis
Solution Quality Assessment
Comprehensive analysis of 800 solutions from WirelessMathLM-7B reveals sophisticated mathematical reasoning capabilities developed through GRPO training.
Advanced Reasoning Capabilities
Domain-Specific Knowledge Integration
Strong competency in applying wireless-specific mathematical frameworks including conjugate beamforming, information-theoretic bounds, and signal processing formulations.
Constraint Awareness
Automatically incorporates non-negativity constraints for power allocations, maintains causality in signal processing, and respects dimensionality requirements.
Physical Intuition Integration
Solutions frequently connect mathematical expressions to underlying physical phenomena, demonstrating deep understanding beyond pattern matching.
Method Justification
Correct solutions routinely include explicit rationales for chosen approaches with detailed step-by-step derivations.
Citation
If you use WirelessMathLM, WirelessMathBench-XL, or our methodology in your research, please cite our paper:
@article{li2025wirelessmathlm, title={WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning}, author={Li, Xin and Liu, Mengbing and Zhu, Yiyang and Zhang, Wenhe and Wei, Li and An, Jiancheng and Yuen, Chau}, journal={arXiv preprint}, year={2025} }