DeepSeek V3 was unexpectedly released recently. It's a decently big (685 billion parameters) model and apparently outperforms Claude 3.5 Sonnet and GPT-4o on a lot of benchmarks. And they release the base model! Very cool. Some notes:
- They don't make this comparison, but the GPT-4 technical report has some benchmarks of the original GPT-4-0314 where it seems to significantly outperform DSv3 (notably, WinoGrande, HumanEval and HellaSwag). I can't easily find evaluations of current-generation cost-optimized models like 4o and Sonnet on this. Is this just because GPT-4 benefits lots from posttraining whereas DeepSeek evaluated their base model, or is the model still worse in some hard-to-test way? GPT-4 is 1.8T trained on about as much data.
- It's conceivable that GPT-4 (the original model) is still the largest (by total parameter count) model (trained for a useful amount of time). The big labs seem to have mostly focused on optimizing inference costs, and this shows that their SOTA models can mostly be matched with ~600B. We cannot rule out larger, better models not publicly released or announced, of course.
- DeepSeek has absurd engineers. They have 2048 H800s (slightly crippled H100s for China). LLaMA 3.1 405B is roughly competitive in benchmarks and apparently used 16384 H100s for a similar amount of time. This is due to some standard optimizations like Mixture of Experts (though their implementation is finer-grained than usual) and some newer ones like Multi-Token Prediction - but mostly because they fixed everything making their runs slow. They avoid tensor parallelism (interconnect-heavy) by carefully compacting everything so it fits on fewer GPUs, designed their own optimized pipeline parallelism, wrote their own PTX (roughly, Nvidia GPU assembly) for low-overhead communication so they can overlap it better, fix some precision issues with FP8 in software, casually implement a new FP12 format to store activations more compactly and have a section suggesting hardware design changes they'd like made.
- It should in principle be significantly cheaper to host than LLaMA-3.1-405B, which is already $0.8/million tokens.