LOADING

Liang Wenfeng’s signed paper, DeepSeek’s big move after first round of financing: generation speed increases by 85%

After completing a 50 billion yuan financing, DeepSeek released its first open-source new achievement today!

Just now, DeepSeek open-source a set of engineering solutions to make existing models run faster: launching DeepSeek-V4 Pro DSpark and DeepSeek-V4 Flash DSpark models, and open-source Speculative Decoding framework DSpark and Speculative Decoding training framework DeepSpec.

 

▲ Screenshot of DeepSeeker V4 Pro DSpark Open Source Launch Page

According to the synchronously uploaded paper “DSpark: Confidence Scheduled Speculative Decoding with Semi Autoregressive Generation” signed by Liang Wenfeng and jointly completed by Peking University, DSpark was deployed in the DeepSeeker V4 online service system to effectively reduce the computational waste caused by invalid verification when receiving real user traffic.

Compared to the mature production baseline solution (MTP-1), DSpark has increased single user generation speed by 60% -85% while maintaining overall throughput. More importantly, under strict interaction delay constraints, DSpark has avoided a significant decline in throughput, achieved performance levels that were previously unattainable, and pushed up the Pareto optimal boundary of the entire service system.

 

▲ DSpark paper screenshot

Hugging Face Address:

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

GitHub address:

https://github.com/deepseek-ai/DeepSpec

Paper address:

https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

According to the model card on Hugging Face, DeepSeek-V4 Pro DSpark and DeepSeek-V4 Flash DSpark are not new models, but rather add a speculative decoding module on the basis of the original version to accelerate inference speed and reduce costs.

Speculative decoding, in simple terms, is a lossless technique for large model inference, with the core process of first drafting and then verifying. It decouples draft generation from target model validation to accelerate inference of large language models.

The current mainstream parallel drafters can generate ultra long token sequences in a single forward operation, but due to the lack of dependency relationships between tokens, the pass rate of subsequent content in the draft will rapidly decline. In addition, if the entire long candidate sequence is checked without discrimination, valuable batch computing power will be wasted on tokens that are easily rejected, resulting in a significant decrease in overall throughput in high concurrency service scenarios.

To this end, DeepSeek proposed the DSpark inference decoding framework, which combines high-throughput parallel generation with adaptive and load aware verification mechanisms. To ensure the quality of the draft, DSpark adopts a semi autoregressive architecture: combining parallel backbone networks with lightweight serial modules, establishing token dependencies within the modules, and alleviating the problem of decay in the final content pass rate.

 

▲ DSpark architecture and decoding process

To optimize system efficiency, DSpark introduces a confidence scheduling verification mechanism: based on the estimated prefix passing probability and engine throughput characteristics, the verification length is dynamically adjusted for each request. In multi domain offline benchmark testing, compared to the currently optimal autoregressive drafter and parallel drafter, DSpark can significantly improve the effective pass sequence length.

As shown in the figure below, DeepSeek provides a minimum inference example for both DeepSeeker V4 Pro DSpark and DeepSeeker V4 Flash DSpark models.

 

The minimum inference example provided by DeepSeek

Overall, users are expected to experience improvements in generation speed, first token latency, concurrency, and other aspects after deploying the DSpark version of DeepSeeker V4 model.

Let’s take a look at DeepSpec, which is a full stack code library or toolchain for training and evaluating speculative decoding draft models. It includes data preparation tools, draft model implementations, training code, and evaluation scripts, and supports MIT licensing.

 

▲ DeepSpec Open Source New Page Screenshot

The workflow of DeepSpec is as follows: it runs each stage in sequence, and the output of each stage provides feedback for the next stage:

1. Data preparation: Download prompts, regenerate target answers, and build target cache.

2. Training: Train a model on the cached target output.

3. Evaluation: Measure the acceptance of speculative decoding on benchmark tasks.

Currently, DeepSpec supports three draft models: DSpark, DFlash, and Eagle3.

The DeepSpec team is still expressing their gratitude to SpecForge (Apache-2.0), DFlash (MIT), Qwen3, and Gemma in their final article.

 

▲ DeepSeek Acknowledgements

As can be seen, DeepSeek not only releases relevant models, but also opens up a complete training framework for developers and enterprises to use this tool to train draft models for their Qwen3, Gemma, and other models.

Conclusion: The importance of reasoning has increased

Test engineering ability

Although DeepSeek’s release is low-key and not a new model iteration, its actual value is not low. DeepSeek has released an engineering solution to make existing models run faster, which is expected to bring a faster and lower cost inference experience and lower the threshold for inference decoding.

The big model competition has entered the stage of a system game that emphasizes both training and reasoning. This is also DeepSeek’s first foray into the inference optimization track after completing its financing. The strategic intention is also clear: not only to accelerate model iteration and productization, but also to seize the high ground of computing efficiency competition.

© 版权声明

相关文章