Abstract

Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance.

We present WirelessMathLM, demonstrating that compact models (0.5B–7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property—verifiable correctness—that enables effective reinforcement learning without human feedback.

We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start.

Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using ≈100× fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales, with positive transfer to general mathematics benchmarks.

Key Highlights

Verification-Based RL

First to use binary verification rewards for domain-specific mathematical reasoning without human feedback

Compact Models

7B model approaches GPT-4o performance while using ≈100× fewer parameters than DeepSeek-R1

WirelessMathBench-XL

4,027 problems from 970 papers spanning 6 communication eras with automated verification

Positive Transfer

Domain-specific training improves general math performance by +8.4 points on average

Key Results

39.5%
WirelessMathLM-7B accuracy on WirelessMathBench-XL
100×
Fewer parameters than DeepSeek-R1 while approaching performance
+103%
Performance improvement for 3B model with GRPO training
+8.4
Average point improvement on general math benchmarks
Model Performance Comparison
Figure 1: Performance comparison across different model sizes and training methods on WirelessMathBench-XL.
GRPO Training Impact
Figure 2: Impact of Group Relative Policy Optimization (GRPO) training on model performance across scales.

WirelessMathBench-XL Dataset

4,027
Problems
970
Source Papers
6
Communication Eras

WirelessMathBench-XL is a comprehensive benchmark for evaluating mathematical reasoning in wireless communications. The dataset spans six major communication eras and covers diverse problem types including information theory, signal processing, optimization, and network analysis.

Techniques Distribution
Distribution of mathematical techniques and problem types in the dataset.

Experimental Setup & Results

Comprehensive Baseline Comparison

We benchmark against comprehensive baselines spanning proprietary and open-source models:

Proprietary Models

  • GPT-5 (57.87% overall) - Best proprietary performance
  • GPT-4o (40.37%) - Close to our 7B model performance
  • Claude-4.0-Sonnet (53.75%)
  • Gemini-2.5-Flash (54.25%)
  • Grok-4-Fast (54.89%)

Open-Source General

  • DeepSeek-R1 (671B, 57.37%) - Massive but best open model
  • DeepSeek-V3.1 (671B, 56.87%)
  • Llama-3.3-70B (38.37%)
  • Qwen2.5-72B (37.50%)

Math-Specialized

  • Qwen2.5-Math-72B (42.13%)
  • DeepSeekMath-7B-RL (21.50%)

Performance by Question Type

Multiple Choice Questions (MCQ)

WirelessMathLM-7B: 53.4%
GPT-4o: 54.1%
DeepSeek-R1: 65.4%

Fill-in-the-Blank

WirelessMathLM-7B: 37.0%
Base Model: 14.3%
+159% improvement

Full Equation Completion

WirelessMathLM-7B: 36.1%
GPT-5-mini: 40.3%

Training Configuration

Hardware & Time

  • 4 × NVIDIA A6000 GPUs
  • 0.5B model: 14 hours
  • 3B model: 40 hours
  • 7B model: 61 hours

Hyperparameters

  • 40 epochs (240 steps)
  • Learning rate: 10⁻⁶
  • Temperature: 0.6 (validation), 1.0 (training)
  • KL penalty β = 0.01

Transfer Learning Results

Positive Transfer to General Mathematics

Surprisingly, domain-specific training on wireless mathematics enhances general mathematical reasoning without catastrophic forgetting.

WirelessMathLM-7B Performance Gains

MATH 500
52.0% 67.0% +28.8%
Minerva-Math
12.1% 14.3% +18.2%
OlympiadBench
25.3% 30.2% +19.4%
AMC
27.7% 41.0% +48.0%
AIME24
6.7% 13.3% +98.5%

Key Insights

  • No Catastrophic Forgetting: Specialized training strengthens rather than degrades fundamental mathematical capabilities
  • Consistent Gains: Improvements across diverse mathematical domains suggest robust transfer
  • Scale-Dependent Effects: 3B model shows even larger relative improvements (+39.9% on MATH 500)
  • Average Improvement: +8.4 points across all general mathematics benchmarks

Qualitative Analysis

Solution Quality Assessment

Comprehensive analysis of 800 solutions from WirelessMathLM-7B reveals sophisticated mathematical reasoning capabilities developed through GRPO training.

99.1%
Solutions demonstrate clear step-by-step reasoning with logical connectives
87%
Correct responses properly identify underlying problem types and select appropriate methodologies
100%
Solutions maintain dimensional consistency in matrix operations and physical constraints

Advanced Reasoning Capabilities

Domain-Specific Knowledge Integration

Strong competency in applying wireless-specific mathematical frameworks including conjugate beamforming, information-theoretic bounds, and signal processing formulations.

Constraint Awareness

Automatically incorporates non-negativity constraints for power allocations, maintains causality in signal processing, and respects dimensionality requirements.

Physical Intuition Integration

Solutions frequently connect mathematical expressions to underlying physical phenomena, demonstrating deep understanding beyond pattern matching.

Method Justification

Correct solutions routinely include explicit rationales for chosen approaches with detailed step-by-step derivations.

Citation

If you use WirelessMathLM, WirelessMathBench-XL, or our methodology in your research, please cite our paper:

@article{li2025wirelessmathlm,
  title={WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning},
  author={Li, Xin and Liu, Mengbing and Zhu, Yiyang and Zhang, Wenhe and Wei, Li and An, Jiancheng and Yuen, Chau},
  journal={arXiv preprint},
  year={2025}
}

Resources

Paper

Read the full paper on arXiv (coming soon)

Access Paper

Code

Source code and training scripts (will be released upon publication)

View Code

Dataset

WirelessMathBench-XL benchmark dataset

Download Dataset

Models

WirelessMathLM models

Download Models