Skip to content

Instantly share code, notes, and snippets.

View createthis's full-sized avatar

Jesse createthis

View GitHub Profile
@createthis
createthis / DeepSeek_V3_2.md
Created November 19, 2025 19:43
DeepSeek_V3_2 pdf converted to markdown using DeepSeek OCR

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

DeepSeek- AI

research@deepseek.com

Abstract

We introduce DeepSeek- V3.2- Exp, an experimental sparse- attention model, which equips DeepSeek- V3.1- Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine- grained sparse attention mechanism powered by a lightning indexer, DeepSeek- V3.2- Exp achieves significant efficiency improvements in both training and inference, especially in long- context scenarios. The model checkpoints are available at https://huggingface.co/deepseek- ai/DeepSeek- V3.2- Exp.

@createthis
createthis / mqa_attn_return_logits_kernel.cu
Created November 12, 2025 18:52
mqa_attn_return_logits_kernel.cu
#include <tl_templates/cuda/cuda_fp8.h>
#include <tl_templates/cuda/gemm.h>
#include <tl_templates/cuda/copy.h>
#include <tl_templates/cuda/reduce.h>
#include <tl_templates/cuda/ldsm.h>
#include <tl_templates/cuda/threadblock_swizzle.h>
#include <tl_templates/cuda/debug.h>
#ifdef ENABLE_BF16
#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
#endif
@createthis
createthis / dump_indexer_tilelang.py
Created November 12, 2025 18:50
dump_indexer_tilelang.py
#!/usr/bin/env python3
import argparse
import torch
import os
import sys
from typing import Optional
# Optional TVM runtime import to dump CUDA/PTX sources
import tilelang
from tilelang import tvm
@createthis
createthis / bench_topk_tilelang.py
Created November 12, 2025 13:17
bench_topk_tilelang.py
#!/usr/bin/env python3
import argparse
import time
import torch
# TileLang example kernels
from examples.deepseek_v32.topk_selector import tl_topk, tl_topk_impl
def bench_tl_topk(seq_len: int, topk: int = 256, batch: int = 1, iters: int = 50, warmup: int = 5):
torch.cuda.synchronize()
@createthis
createthis / bench_indexer_tilelang.py
Created November 12, 2025 13:14
bench_indexer_tilelang.py
#!/usr/bin/env python3
import argparse
import torch
# Prefer local examples path resolution if running from repo root
try:
from examples.deepseek_v32.utils import per_custom_dims_cast_to_fp8 as _to_fp8
def to_fp8(x):
# Cast along last dim to FP8 E4M3 to match kernel expectations
# Handle both (x, dims, use_ue8m0) and (x, dims) signatures and return the scaled tensor only.
@createthis
createthis / topk_selector_analysis.md
Last active October 12, 2025 16:28
Analysis of topk_selector.py by DeepSeek V3.1-Terminus when given DSA context

This code implements a high-performance Top-K selection algorithm using TileLang for GPU acceleration. I'll explain it line by line, focusing on the radix-based selection approach.

1. Imports and Configuration

import torch
import tilelang
import tilelang.language as T
pass_configs = {
 tilelang.PassConfigKey.TL_DISABLE_THREAD_STORAGE_SYNC: True,
@createthis
createthis / fp8_lighting_indexer_analysis.md
Last active October 12, 2025 14:13
Analysis of fp8_lighting_indexer.py by DeepSeek V3.1-Terminus when given DSA context

This code implements the DeepSeek Sparse Attention (DSA) lightning indexer, which computes index scores for efficient attention using FP8 precision. I'll explain it line by line, breaking it into logical sections. The code uses TileLang (a DSL for GPU kernels) and PyTorch for high-performance computation.

1. Imports and Utility Functions

# ruff: noqa
import itertools
import tilelang
from tilelang import language as T
import torch
@createthis
createthis / deepseek_sparse_attention.md
Last active October 12, 2025 16:16
ds v3.2-exp first page - markdown

1. Architecture

Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.

Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.

The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$ and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:

$$

@createthis
createthis / deepseek_v3_2_exp_chat_template.jinja
Created October 7, 2025 23:38
DeepSeek V3.2-Exp chat_template.jinja
{% if not add_generation_prompt is defined %}
{% set add_generation_prompt = false %}
{% endif %}
{% if not thinking is defined %}
{% set thinking = false %}
{% endif %}
{% set ns = namespace(is_first=false, is_tool=false, system_prompt='', is_first_sp=true, is_last_user=false, is_only_sys=false, is_prefix=false) %}
{%- for message in messages %}
{%- if message['role'] == 'system' %}
{%- if ns.is_first_sp %}
@createthis
createthis / parse_json_tool_calls.md
Last active September 8, 2025 01:13
parse_json_tool_calls update_cursor true vs false

Input number line (per character)

<|tool▁calls▁begin|><|tool▁call▁begin|>get_time<|tool▁sep|>{"city": "Tokyo"}<|tool▁call▁end|><|tool▁calls▁end|>
|                  |                  |        |          |                 |               |                 |
0                  19                 38       47         58                76              92              110

Input number line (per byte)