π Benchmarks & Quality Report
Comprehensive performance analysis of SynapseX models
This document provides detailed benchmark results, methodology, and comparisons with state-of-the-art models. Updated with each major release.
π― Executive Summaryβ
SynapseX-14B v2.0 Highlightsβ
| Metric | Score | vs Base Model | vs GPT-4o-mini |
|---|---|---|---|
| Quantum Computing Tasks | 84.8% | +38.2% | +12.4% |
| Molecular Science Tasks | 82.6% | +41.7% | +8.7% |
| Physics Problems | 82.8% | +23.6% | +6.2% |
| MMLU (General) | 79.2% | +0.3% | -3.4% |
| GSM8K (Math) | 88.4% | +3.2% | -5.1% |
| HumanEval (Code) | 78.3% | +3.6% | -10.2% |
Key Insight: SynapseX delivers state-of-the-art performance in scientific domains while maintaining competitive general capabilities.
π¬ Domain-Specific Benchmarksβ
Quantum Computing Evaluation (QuantumBench v1.0)β
Our custom benchmark suite for quantum computing capabilities:
Task Categoriesβ
| Category | Questions | SynapseX-14B | Qwen2.5-14B | Ξ |
|---|---|---|---|---|
| Circuit Design | 150 | 84.2% | 62.1% | +22.1pp |
| Error Correction | 100 | 76.8% | 51.3% | +25.5pp |
| Algorithm Design | 120 | 89.1% | 71.4% | +17.7pp |
| State Calculation | 200 | 91.3% | 68.9% | +22.4pp |
| Gate Decomposition | 80 | 82.7% | 59.2% | +23.5pp |
| Noise Analysis | 50 | 78.4% | 48.6% | +29.8pp |
| Quantum ML | 100 | 81.2% | 54.3% | +26.9pp |
Example Questionsβ
π Circuit Design Example
Question: Design a quantum circuit that implements the Grover diffusion operator for a 3-qubit system. Provide the circuit in Qiskit.
Expected: Correct implementation with H gates, X gates, and multi-controlled Z gate.
SynapseX Response Quality: β Correct circuit, explained steps, included visualization
π Error Correction Example
Question: Explain the stabilizer formalism for the 5-qubit code and derive the logical operators.
Expected: Correct stabilizer generators, X_L and Z_L operators.
SynapseX Response Quality: β Complete derivation with step-by-step explanation
Molecular Science Evaluation (MolBench v1.0)β
Task Categoriesβ
| Category | Questions | SynapseX-14B | Qwen2.5-14B | Ξ |
|---|---|---|---|---|
| SMILES Prediction | 200 | 87.4% | 64.2% | +23.2pp |
| Retrosynthesis | 150 | 79.8% | 52.7% | +27.1pp |
| Property Prediction | 180 | 83.6% | 61.8% | +21.8pp |
| Reaction Mechanism | 120 | 85.2% | 58.3% | +26.9pp |
| Drug-Target | 100 | 76.9% | 48.5% | +28.4pp |
| Toxicity Analysis | 80 | 79.3% | 55.1% | +24.2pp |
Physics Evaluation (PhysicsBench v1.0)β
| Category | Questions | SynapseX-14B | Qwen2.5-14B | Ξ |
|---|---|---|---|---|
| Quantum Mechanics | 150 | 86.7% | 72.1% | +14.6pp |
| Statistical Mechanics | 100 | 81.4% | 65.3% | +16.1pp |
| Condensed Matter | 80 | 78.9% | 59.8% | +19.1pp |
| Electrodynamics | 120 | 84.2% | 71.6% | +12.6pp |
| Thermodynamics | 100 | 82.8% | 68.4% | +14.4pp |
π Standard LLM Benchmarksβ
MMLU (Massive Multitask Language Understanding)β
5-shot evaluation across 57 subjects:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MMLU Results β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Model β Score β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β GPT-4o β 88.7% βββββββββββββββββββββββββββββββββββββ β
β Claude 3.5 Sonnet β 88.3% βββββββββββββββββββββββββββββββββββββ β
β GPT-4o-mini β 82.0% ββββββββββββββββββββββββββββββββ β
β β
SynapseX-14B β 79.2% βββββββββββββββββββββββββββββββ β
β Qwen2.5-14B-Instruct β 79.0% βββββββββββββββββββββββββββββββ β
β Qwen2.5-7B-Instruct β 75.4% ββββββββββββββββββββββββββββ β
β Llama-3.1-8B-Instruct β 69.4% βββββββββββββββββββββ ββββ β
β Mistral-7B-Instruct-v0.3 β 63.4% ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
STEM Subjects (where SynapseX excels):
| Subject | SynapseX-14B | Qwen2.5-14B | GPT-4o-mini |
|---|---|---|---|
| Physics | 86.2% | 78.1% | 81.4% |
| Chemistry | 84.7% | 76.3% | 79.8% |
| Computer Science | 82.1% | 79.4% | 84.2% |
| Mathematics | 81.9% | 78.6% | 83.1% |
| Biology | 79.3% | 77.8% | 80.6% |
GSM8K (Grade School Math)β
Mathematical reasoning benchmark:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GSM8K Results β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Model β Score β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β GPT-4o β 95.8% ββββββββββββββββββββββββββββββββββββββ β
β GPT-4o-mini β 93.2% βββββββββββββββββββββββββββββββββββββ β
β β
SynapseX-14B β 88.4% βββββββββββββββββββββββββββββββββββ β
β Qwen2.5-14B-Instruct β 85.7% βββββββββββββββββββββββββββββββββ β
β Qwen2.5-7B-Instruct β 82.6% ββββββββββββββββββββββββββββββββ β
β Llama-3.1-8B-Instruct β 76.6% βββββββββββββββββββββββββββββ β
β Mistral-7B-Instruct-v0.3 β 58.4% ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
HumanEval (Code Generation)β
Python code generation from docstrings:
ββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
β HumanEval Results β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββ€
β Model β Pass@1 β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€
β GPT-4o β 90.2% ββββββββββββββββββββββββββββββββββββββ β
β Claude 3.5 Sonnet β 89.0% βββββββββββββββββββββββββββββββββββββ β
β GPT-4o-mini β 87.2% ββββββββββββββββββββββββββββββββββββ β
β β
SynapseX-14B β 78.3% βββββββββββ ββββββββββββββββββββββ β
β Qwen2.5-14B-Instruct β 75.6% ββββββββββββββββββββββββββββββββ β
β Llama-3.1-8B-Instruct β 62.8% ββββββββββββββββββββββββββββ β
β Mistral-7B-Instruct-v0.3 β 35.4% βββββββββββββββββ β
βββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
MT-Bench (Multi-Turn Conversation)β
Quality of multi-turn dialogue:
| Model | Score | Writing | Roleplay | Reasoning | Math | Coding | Extraction | STEM | Humanities |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 9.1 | 9.3 | 9.0 | 9.2 | 8.8 | 9.4 | 9.0 | 9.2 | 8.9 |
| GPT-4o-mini | 8.5 | 8.6 | 8.3 | 8.5 | 8.2 | 8.9 | 8.4 | 8.6 | 8.5 |
| SynapseX-14B | 8.35 | 8.1 | 8.0 | 8.4 | 8.5 | 8.6 | 8.2 | 8.9 | 8.1 |
| Qwen2.5-7B-Instruct | 8.07 | 8.0 | 7.9 | 8.1 | 8.0 | 8.2 | 8.0 | 8.2 | 8.0 |
| Llama-3.1-8B-Instruct | 8.0 | 7.9 | 7.8 | 8.0 | 7.8 | 8.1 | 8.0 | 8.1 | 8.0 |
β‘ Performance Benchmarksβ
Latency Analysisβ
Tested with standardized prompts (512 input tokens, 256 output tokens):
First Token Latency (TTFT)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time to First Token (ms) β
ββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββββββββββββββββββββββ€
β Provider β P50 β P95 β P99 β Visual β
ββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββββββββββββββ€
β β
SynapseX β 45 β 82 β 120 β ββ β
β Together.ai β 95 β 180 β 250 β ββββ β
β Azure OpenAI β 120 β 280 β 420 β βββββ β
β Replicate β 180 β 350 β 520 β βββββββ β
β GCP Vertex AI β 220 β 480 β 680 β βββββββββ β
β AWS Bedrock β 280 β 520 β 780 β βββββββββββ β
ββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββββββββββββββββββββββ
Throughput (Tokens/Second)β
| Provider | Batch=1 | Batch=4 | Batch=8 | Batch=16 |
|---|---|---|---|---|
| SynapseX | 120 | 380 | 680 | 1,100 |
| Together.ai | 85 | 260 | 420 | 680 |
| Azure OpenAI | 72 | 220 | 360 | 580 |
| Replicate | 65 | 180 | 290 | 460 |
| GCP Vertex AI | 48 | 140 | 220 | 350 |
| AWS Bedrock | 42 | 120 | 190 | 300 |
Cold Start Timeβ
| Provider | Cold Start | Warm Start | Improvement |
|---|---|---|---|
| SynapseX | 8s | 0.5s | Baseline |
| Together.ai | 12s | 0.8s | 1.5x slower |
| Replicate | 45s | 1.2s | 5.6x slower |
| Azure OpenAI | 45s | 0.9s | 5.6x slower |
| GCP Vertex AI | 165s | 2.1s | 20x slower |
| AWS Bedrock | 180s | 2.5s | 22x slower |
π§ͺ Methodologyβ
Benchmark Configurationβ
All benchmarks were run with:
evaluation:
temperature: 0.0 # Deterministic for reproducibility
max_tokens: 2048
num_runs: 3 # Average across 3 runs
timeout: 120s
hardware:
gpu: AMD Instinct MI300X
memory: 192GB HBM3
environment:
framework: vLLM 0.6.0
precision: BF16
batch_size: dynamic
Reproducibilityβ
All benchmark code and results are available:
# Clone and run benchmarks
git clone https://github.com/softquantus/synapsex-benchmarks
cd synapsex-benchmarks
# Install dependencies
pip install -r requirements.txt
# Run full benchmark suite
python run_benchmarks.py \
--model synapsex-14b \
--benchmarks all \
--output results/
Statistical Significanceβ
- All reported scores are averages across 3 independent runs
- Error bars represent 95% confidence intervals
- Differences marked as significant have p < 0.05 (paired t-test)
π Training Data Qualityβ
Dataset Compositionβ
The quality of training data directly impacts model performance:
Premium Dataset v1.0 Statistics
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metric β Value β
βββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β Total Examples β 764,842 β
β Estimated Tokens β ~1.2 billion β
β Average Instruction Length β 598.2 characters β
β Average Response Length β 987.3 characters β
β Technical Density β 0.78% β
β Information Density β 42.7% β
β Code Examples β 21,795 (2.8%) β
β With System Prompts β 371,107 (48.5%) β
β Unique Sources β 18 datasets β
β Deduplication Rate β 17% β
βββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββ
Quality Filtering Pipelineβ
- Length Filter: Remove examples < 50 or > 10,000 tokens
- Language Filter: English-only (detect with langdetect)
- Deduplication: MinHash + semantic clustering
- Density Score: Rank by information/technical density
- Category Balancing: Max 10.1% per category
π Competitive Analysisβ
vs. Foundation Modelsβ
| Capability | SynapseX-14B | Qwen2.5-14B | Llama-3.1-70B | GPT-4o-mini |
|---|---|---|---|---|
| Quantum Computing | βββββ | βββ | βββ | ββββ |
| Molecular Science | βββββ | βββ | βββ | ββββ |
| Physics | βββββ | ββββ | ββββ | ββββ |
| General Knowledge | ββββ | ββββ | βββββ | βββββ |
| Coding | ββββ | ββββ | ββββ | βββββ |
| Multi-turn | ββββ | ββββ | βββββ | βββββ |
| Cost Efficiency | βββββ | ββββ | βββ | ββββ |
vs. Specialized Scientific Modelsβ
| Model | Quantum | Molecular | Physics | General |
|---|---|---|---|---|
| SynapseX-14B | 84.8% | 82.6% | 82.8% | 79.2% |
| ChemLLM-7B | 45.2% | 78.4% | 61.3% | 52.1% |
| BioMistral-7B | 38.6% | 71.2% | 54.8% | 61.4% |
| SciLLM-8B | 52.3% | 68.9% | 72.1% | 65.3% |
π Appendixβ
Benchmark Definitionsβ
| Benchmark | Description | Metric |
|---|---|---|
| MMLU | 57-subject knowledge test | 5-shot accuracy |
| GSM8K | Grade school math word problems | Exact match accuracy |
| HumanEval | Python code generation | Pass@1 |
| MT-Bench | Multi-turn conversation quality | GPT-4 judge score (1-10) |
| QuantumBench | Custom quantum computing tasks | Task accuracy |
| MolBench | Custom molecular science tasks | Task accuracy |
| PhysicsBench | Graduate-level physics problems | Task accuracy |
Version Historyβ
| Version | Date | Key Changes |
|---|---|---|
| v2.0 | Dec 2025 | Premium dataset (765K), physics domain |
| v1.5 | Sep 2025 | Multi-domain expansion |
| v1.0 | Jun 2025 | Initial release, DPO training |
| v0.1 | Mar 2025 | Prototype, SFT only |
π§ Contactβ
For benchmark questions or custom evaluations:
- Technical: benchmarks@synapsex.ai
- Enterprise: enterprise@synapsex.ai
- Research: research@synapsex.ai
Last updated: December 2025
SynapseX β Enterprise AI for Science