Duck Block Specification

Version: 2.0 Status: Stable Last Updated: 2025-01

A format-agnostic specification for representing documents as sequences of typed blocks in DuckDB. This specification enables interoperability between document format extensions, allowing documents to be converted, transformed, and analyzed using SQL.

Overview

Documents are represented as ordered sequences of blocks. Each block is a row with standardized columns, enabling:

Format conversion: Read HTML → transform → write Markdown
Cross-format queries: Analyze structure across document types
Unified tooling: Common utilities work with any compliant extension
SQL-based transformation: Filter, aggregate, and manipulate documents

Design Principles

Minimal core: Small set of universal element types that map across formats
Extensible: Formats can define additional element types with namespaced identifiers
Lossless where possible: Preserve source structure; note where normalization occurs
SQL-native: Use DuckDB types naturally (VARCHAR, INTEGER, MAP, LIST)

Core Schema (duck_block shape)

All compliant extensions MUST produce rows with these columns:

Column	Type	Required	Description
`kind`	VARCHAR	Yes	'block' or 'inline'
`element_type`	VARCHAR	Yes	Element type identifier
`content`	VARCHAR	Yes	Primary content of the element
`level`	INTEGER	No	Hierarchy level, nesting depth, or NULL
`encoding`	VARCHAR	Yes	Content encoding (see below)
`attributes`	MAP(VARCHAR, VARCHAR)	Yes	Type-specific metadata
`element_order`	INTEGER	Yes	Sequential position in document (0-indexed)

Optional columns: | Column | Type | Description | |--------|------|-------------| | file_path | VARCHAR | Source file path (when reading multiple files) |

Column Semantics

kind

Distinguishes between block-level and inline elements: - 'block': Block-level elements (headings, paragraphs, code blocks, lists, tables) - 'inline': Inline elements (bold, italic, links, images within text)

Block elements are rendered with trailing newlines. Inline elements are concatenated without trailing newlines.

element_type

Identifies the element's semantic type. Uses either: - Core type: One of the universal types defined below - Namespaced type: Format-specific type as format:type (e.g., html:div, md:footnote)

content

The primary textual content of the element. Interpretation depends on encoding: - For text: Raw text content, may include inline formatting - For json: Valid JSON string - For yaml: Valid YAML string - For html: HTML fragment - For xml: XML fragment

level

Document nesting depth or heading level: - Top-level blocks: 1 - Nested structures: Depth from root (2+) - Frontmatter/metadata: 0 - Not applicable: NULL

Note: For headings, the H1-H6 level should preferably be stored in attributes['heading_level']. If not present, the level field is used as a fallback.

encoding

Declares how content should be interpreted:

Encoding	Description	Use Cases
`text`	Plain text, possibly with inline markup	Paragraphs, headings, code
`json`	JSON-encoded structured data	Lists, tables, complex structures
`yaml`	YAML-encoded structured data	Frontmatter, metadata
`html`	HTML fragment	Preserved HTML in markdown
`xml`	XML fragment	Preserved XML content

attributes

Key-value metadata specific to the element type. All values are VARCHAR (convert as needed).

Common attributes: | Attribute | Description | Used By | |-----------|-------------|---------| | id | Element identifier | headings, anchors | | heading_level | Heading level (takes priority over level field) | headings | | language | Code language | code blocks | | ordered | List ordering ('true'/'false') | lists | | start | List start number | ordered lists | | src | Source URL | images, embeds | | href | Link URL | links | | alt | Alternative text | images | | title | Title text | links, images |

Core Block Types (kind = 'block')

These element types MUST be recognized by all compliant extensions:

Type	Description	Content Encoding	Level	Key Attributes
`heading`	Section heading	text	1-6	`id`, `heading_level`
`paragraph`	Text paragraph	text	NULL	-
`code`	Code block	text	NULL	`language`
`blockquote`	Quoted content	text	depth (1+)	-
`list`	List of items	json	1	`ordered`, `start`
`table`	Tabular data	json	NULL	-
`hr`	Thematic break	text (empty)	NULL	-
`metadata`	Document metadata	yaml or json	0	-
`frontmatter`	Document metadata	yaml	0	-
`image`	Block-level image	text (empty)	NULL	`src`, `alt`, `title`
`raw`	Raw format-specific content	varies	NULL	`format`

Block Type Details

heading

Section or document headings. The H1-H6 level is stored in attributes['heading_level'] (preferred). If heading_level is not present, the level field can be used as a fallback.

kind: 'block'
element_type: 'heading'
content: 'Introduction'
level: 1
encoding: 'text'
attributes: {'id': 'introduction', 'heading_level': '1'}

paragraph

Block of text. Inline formatting (bold, italic, links) is preserved in content.

kind: 'block'
element_type: 'paragraph'
content: 'This is a paragraph with **bold** and *italic* text.'
level: NULL
encoding: 'text'
attributes: {}

code

Preformatted code or monospace text.

kind: 'block'
element_type: 'code'
content: 'def hello():\n    print("Hello")'
level: NULL
encoding: 'text'
attributes: {'language': 'python'}

list

Ordered or unordered list. Content is JSON array.

kind: 'block'
element_type: 'list'
content: '["Item 1", "Item 2", "Item 3"]'
level: 1
encoding: 'json'
attributes: {'ordered': 'false'}

table

Tabular data as JSON with headers and rows.

kind: 'block'
element_type: 'table'
content: '{"headers": ["Name", "Age"], "rows": [["Alice", "30"], ["Bob", "25"]]}'
level: NULL
encoding: 'json'
attributes: {}

metadata / frontmatter

Document metadata (frontmatter, head elements, etc.). Level 0 indicates document-level.

kind: 'block'
element_type: 'frontmatter'
content: 'title: My Document\nauthor: Jane Doe'
level: 0
encoding: 'yaml'
attributes: {}

Core Inline Types (kind = 'inline')

Inline elements for rich text composition:

Type	Description	Output	Key Attributes
`text`	Plain text	as-is	-
`bold` / `strong`	Bold text	`text`	-
`italic` / `em`	Italic text	`text`	-
`code`	Inline code	`text`	-
`link`	Hyperlink	`[text](href)`	`href`, `title`
`image`	Inline image	`![alt](src)`	`src`, `title`
`strikethrough` / `del`	Strikethrough	`~~text~~`	-
`superscript` / `sup`	Superscript	`^text^`	-
`subscript` / `sub`	Subscript	`~text~`	-
`linebreak` / `br`	Hard line break	`<br>`	-
`math`	Math expression	$text$	`display`: inline/block

Inline Element Example

-- Compose rich text from inline elements
SELECT duck_blocks_to_md([
    {kind: 'inline', element_type: 'text', content: 'Check out ', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 0},
    {kind: 'inline', element_type: 'link', content: 'our docs', level: NULL, encoding: 'text', attributes: MAP{'href': 'https://example.com'}, element_order: 1},
    {kind: 'inline', element_type: 'text', content: ' for ', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 2},
    {kind: 'inline', element_type: 'bold', content: 'more info', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 3}
]);
-- Returns: 'Check out [our docs](https://example.com) for **more info**'

Format-Specific Extensions

Extensions MAY define additional element types using namespaced identifiers: format:type

Namespace Registry

Namespace	Format	Extension
`md`	Markdown	duckdb_markdown
`html`	HTML	duckdb_webbed
`xml`	XML	duckdb_webbed

Extension Examples

Markdown-specific: - md:footnote - Footnote definition - md:task_list - Task/checkbox list - md:html_block - Raw HTML in markdown

HTML-specific: - html:div - Generic div container - html:script - Script element

Writer Requirements

Extensions providing COPY TO functionality MUST:

Accept rows matching the core schema
Handle all core element types (or error gracefully)
Support custom column name mapping via options
Handle block/inline transitions appropriately

Standard COPY TO Options

Option	Type	Default	Description
`kind_column`	VARCHAR	'kind'	Column containing kind
`element_type_column`	VARCHAR	'element_type'	Column containing element type
`content_column`	VARCHAR	'content'	Column containing content
`level_column`	VARCHAR	'level'	Column containing level
`encoding_column`	VARCHAR	'encoding'	Column containing encoding
`attributes_column`	VARCHAR	'attributes'	Column containing attributes

SQL Usage Examples

Reading and Writing

-- Read markdown into duck_block rows
SELECT * FROM read_markdown_blocks('doc.md');

-- Round-trip: read, transform, write
COPY (
    SELECT kind, element_type, upper(content) as content, level, encoding, attributes
    FROM read_markdown_blocks('input.md')
    WHERE element_type != 'hr'
    ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

Converting Blocks to Markdown

-- Convert blocks back to markdown string
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('doc.md') b;

-- Build document programmatically
SELECT duck_blocks_to_md([
    {kind: 'block', element_type: 'heading', content: 'Title', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
    {kind: 'block', element_type: 'paragraph', content: 'Body text.', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 1}
]);

Converting Blocks to Sections

-- Get hierarchical sections from blocks
SELECT s.section_id, s.level, s.title
FROM (
    SELECT unnest(duck_blocks_to_sections(list(b ORDER BY element_order))) as s
    FROM read_markdown_blocks('doc.md') b
);

Version History

Version	Date	Changes
2.0	2025-01	Added `kind` column, renamed to `element_type`/`element_order`, added inline support
1.0	2024-12	Initial specification with `block_type`/`block_order`

Reference Implementation

The DuckDB Markdown extension (duckdb_markdown) serves as the reference implementation:

Repository: https://github.com/teaguesterling/duckdb_markdown
Documentation: https://duckdb-markdown.readthedocs.io/
Reader: read_markdown_blocks()
Writer: COPY TO ... (FORMAT MARKDOWN, markdown_mode 'blocks')
Converters: duck_block_to_md(), duck_blocks_to_md(), duck_blocks_to_sections()