Skip to content

Instantly share code, notes, and snippets.

@Fusaaaann
Fusaaaann / jina_markdown_tokenizer_regex.ts
Created August 15, 2024 09:59
Use regex to do chunking by using all semantic cues. Refactored based on https://gist.github.com/hanxiao/3f60354cf6dc5ac698bc9154163b4e6a .
const moo = require('moo');
// Define variables for magic numbers
const MAX_HEADING_LENGTH = 6;
const MAX_HEADING_CONTENT_LENGTH = 200;
const MAX_HEADING_UNDERLINE_LENGTH = 200;
const MAX_HTML_HEADING_ATTRIBUTES_LENGTH = 100;
const MAX_LIST_ITEM_LENGTH = 200;
const MAX_NESTED_LIST_ITEMS = 5;
const MAX_LIST_INDENT_SPACES = 7;