Jesse createthis

## DeepSeek_V3_2.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                createthis
                / DeepSeek_V3_2.md
            
            
              Created
              November 19, 2025 19:43
            
              
                DeepSeek_V3_2 pdf converted to markdown using DeepSeek OCR
              
          
    DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

DeepSeek- AI
research@deepseek.com
Abstract

We introduce DeepSeek- V3.2- Exp, an experimental sparse- attention model, which equips DeepSeek- V3.1- Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine- grained sparse attention mechanism powered by a lightning indexer, DeepSeek- V3.2- Exp achieves significant efficiency improvements in both training and inference, especially in long- context scenarios. The model checkpoints are available at https://huggingface.co/deepseek- ai/DeepSeek- V3.2- Exp.

  
## mqa_attn_return_logits_kernel.cu
#include <tl_templates/cuda/cuda_fp8.h>
#include <tl_templates/cuda/gemm.h>
#include <tl_templates/cuda/copy.h>
#include <tl_templates/cuda/reduce.h>
#include <tl_templates/cuda/ldsm.h>
#include <tl_templates/cuda/threadblock_swizzle.h>
#include <tl_templates/cuda/debug.h>
#ifdef ENABLE_BF16
#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
#endif

## dump_indexer_tilelang.py
#!/usr/bin/env python3
import argparse
import torch
import os
import sys
from typing import Optional

# Optional TVM runtime import to dump CUDA/PTX sources
import tilelang
from tilelang import tvm

## bench_topk_tilelang.py
#!/usr/bin/env python3
import argparse
import time
import torch

# TileLang example kernels
from examples.deepseek_v32.topk_selector import tl_topk, tl_topk_impl

def bench_tl_topk(seq_len: int, topk: int = 256, batch: int = 1, iters: int = 50, warmup: int = 5):
    torch.cuda.synchronize()

## bench_indexer_tilelang.py
#!/usr/bin/env python3
import argparse
import torch

# Prefer local examples path resolution if running from repo root
try:
    from examples.deepseek_v32.utils import per_custom_dims_cast_to_fp8 as _to_fp8
    def to_fp8(x):
        # Cast along last dim to FP8 E4M3 to match kernel expectations
        # Handle both (x, dims, use_ue8m0) and (x, dims) signatures and return the scaled tensor only.

## topk_selector_analysis.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                createthis
                / topk_selector_analysis.md
            
            
              Last active
              October 12, 2025 16:28
            
              
                Analysis of topk_selector.py by DeepSeek V3.1-Terminus when given DSA context
              
          
    This code implements a high-performance Top-K selection algorithm using TileLang for GPU acceleration. I'll explain it line by line, focusing on the radix-based selection approach.
1. Imports and Configuration

import torch
import tilelang
import tilelang.language as T
pass_configs = {
 tilelang.PassConfigKey.TL_DISABLE_THREAD_STORAGE_SYNC: True,

  
## fp8_lighting_indexer_analysis.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                createthis
                / fp8_lighting_indexer_analysis.md
            
            
              Last active
              October 12, 2025 14:13
            
              
                Analysis of fp8_lighting_indexer.py by DeepSeek V3.1-Terminus when given DSA context
              
          
    This code implements the DeepSeek Sparse Attention (DSA) lightning indexer, which computes index scores for efficient attention using FP8 precision. I'll explain it line by line, breaking it into logical sections. The code uses TileLang (a DSL for GPU kernels) and PyTorch for high-performance computation.
1. Imports and Utility Functions

# ruff: noqa
import itertools
import tilelang
from tilelang import language as T
import torch

  
## deepseek_sparse_attention.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                createthis
                / deepseek_sparse_attention.md
            
            
              Last active
              October 12, 2025 16:16
            
              
                ds v3.2-exp first page - markdown
              
          
    1. Architecture
Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.
Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.
The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$
and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:
$$

  
## deepseek_v3_2_exp_chat_template.jinja
{% if not add_generation_prompt is defined %}
  {% set add_generation_prompt = false %}
{% endif %}
{% if not thinking is defined %}
  {% set thinking = false %}
{% endif %}
{% set ns = namespace(is_first=false, is_tool=false, system_prompt='', is_first_sp=true, is_last_user=false, is_only_sys=false, is_prefix=false) %}
{%- for message in messages %}
  {%- if message['role'] == 'system' %}
    {%- if ns.is_first_sp %}

## parse_json_tool_calls.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                createthis
                / parse_json_tool_calls.md
            
            
              Last active
              September 8, 2025 01:13
            
              
                parse_json_tool_calls update_cursor true vs false
              
          
    Input number line (per character)

<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>get_time<｜tool▁sep｜>{"city": "Tokyo"}<｜tool▁call▁end｜><｜tool▁calls▁end｜>
|　                　| 　               　|        |　        　|                 |　             　| 　              　|
0　                　19　               　38       47　       　58                76　            　92　            　110

Input number line (per byte)
	#include <tl_templates/cuda/cuda_fp8.h>
	#include <tl_templates/cuda/gemm.h>
	#include <tl_templates/cuda/copy.h>
	#include <tl_templates/cuda/reduce.h>
	#include <tl_templates/cuda/ldsm.h>
	#include <tl_templates/cuda/threadblock_swizzle.h>
	#include <tl_templates/cuda/debug.h>
	#ifdef ENABLE_BF16
	#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
	#endif
	#!/usr/bin/env python3
	import argparse
	import torch
	import os
	import sys
	from typing import Optional

	# Optional TVM runtime import to dump CUDA/PTX sources
	import tilelang
	from tilelang import tvm
	#!/usr/bin/env python3
	import argparse
	import time
	import torch

	# TileLang example kernels
	from examples.deepseek_v32.topk_selector import tl_topk, tl_topk_impl

	def bench_tl_topk(seq_len: int, topk: int = 256, batch: int = 1, iters: int = 50, warmup: int = 5):
	torch.cuda.synchronize()
	{% if not add_generation_prompt is defined %}
	{% set add_generation_prompt = false %}
	{% endif %}
	{% if not thinking is defined %}
	{% set thinking = false %}
	{% endif %}
	{% set ns = namespace(is_first=false, is_tool=false, system_prompt='', is_first_sp=true, is_last_user=false, is_only_sys=false, is_prefix=false) %}
	{%- for message in messages %}
	{%- if message['role'] == 'system' %}
	{%- if ns.is_first_sp %}