Duck Block Specification
Version: 2.0 Status: Stable Last Updated: 2025-01
A format-agnostic specification for representing documents as sequences of typed blocks in DuckDB. This specification enables interoperability between document format extensions, allowing documents to be converted, transformed, and analyzed using SQL.
Overview
Documents are represented as ordered sequences of blocks. Each block is a row with standardized columns, enabling:
- Format conversion: Read HTML → transform → write Markdown
- Cross-format queries: Analyze structure across document types
- Unified tooling: Common utilities work with any compliant extension
- SQL-based transformation: Filter, aggregate, and manipulate documents
Design Principles
- Minimal core: Small set of universal element types that map across formats
- Extensible: Formats can define additional element types with namespaced identifiers
- Lossless where possible: Preserve source structure; note where normalization occurs
- SQL-native: Use DuckDB types naturally (VARCHAR, INTEGER, MAP, LIST)
Core Schema (duck_block shape)
All compliant extensions MUST produce rows with these columns:
| Column | Type | Required | Description |
|---|---|---|---|
kind |
VARCHAR | Yes | 'block' or 'inline' |
element_type |
VARCHAR | Yes | Element type identifier |
content |
VARCHAR | Yes | Primary content of the element |
level |
INTEGER | No | Hierarchy level, nesting depth, or NULL |
encoding |
VARCHAR | Yes | Content encoding (see below) |
attributes |
MAP(VARCHAR, VARCHAR) | Yes | Type-specific metadata |
element_order |
INTEGER | Yes | Sequential position in document (0-indexed) |
Optional columns:
| Column | Type | Description |
|--------|------|-------------|
| file_path | VARCHAR | Source file path (when reading multiple files) |
Column Semantics
kind
Distinguishes between block-level and inline elements: - 'block': Block-level elements (headings, paragraphs, code blocks, lists, tables) - 'inline': Inline elements (bold, italic, links, images within text)
Block elements are rendered with trailing newlines. Inline elements are concatenated without trailing newlines.
element_type
Identifies the element's semantic type. Uses either:
- Core type: One of the universal types defined below
- Namespaced type: Format-specific type as format:type (e.g., html:div, md:footnote)
content
The primary textual content of the element. Interpretation depends on encoding:
- For text: Raw text content, may include inline formatting
- For json: Valid JSON string
- For yaml: Valid YAML string
- For html: HTML fragment
- For xml: XML fragment
level
Document nesting depth or heading level: - Top-level blocks: 1 - Nested structures: Depth from root (2+) - Frontmatter/metadata: 0 - Not applicable: NULL
Note: For headings, the H1-H6 level should preferably be stored in attributes['heading_level']. If not present, the level field is used as a fallback.
encoding
Declares how content should be interpreted:
| Encoding | Description | Use Cases |
|---|---|---|
text |
Plain text, possibly with inline markup | Paragraphs, headings, code |
json |
JSON-encoded structured data | Lists, tables, complex structures |
yaml |
YAML-encoded structured data | Frontmatter, metadata |
html |
HTML fragment | Preserved HTML in markdown |
xml |
XML fragment | Preserved XML content |
attributes
Key-value metadata specific to the element type. All values are VARCHAR (convert as needed).
Common attributes:
| Attribute | Description | Used By |
|-----------|-------------|---------|
| id | Element identifier | headings, anchors |
| heading_level | Heading level (takes priority over level field) | headings |
| language | Code language | code blocks |
| ordered | List ordering ('true'/'false') | lists |
| start | List start number | ordered lists |
| src | Source URL | images, embeds |
| href | Link URL | links |
| alt | Alternative text | images |
| title | Title text | links, images |
Core Block Types (kind = 'block')
These element types MUST be recognized by all compliant extensions:
| Type | Description | Content Encoding | Level | Key Attributes |
|---|---|---|---|---|
heading |
Section heading | text | 1-6 | id, heading_level |
paragraph |
Text paragraph | text | NULL | - |
code |
Code block | text | NULL | language |
blockquote |
Quoted content | text | depth (1+) | - |
list |
List of items | json | 1 | ordered, start |
table |
Tabular data | json | NULL | - |
hr |
Thematic break | text (empty) | NULL | - |
metadata |
Document metadata | yaml or json | 0 | - |
frontmatter |
Document metadata | yaml | 0 | - |
image |
Block-level image | text (empty) | NULL | src, alt, title |
raw |
Raw format-specific content | varies | NULL | format |
Block Type Details
heading
Section or document headings. The H1-H6 level is stored in attributes['heading_level'] (preferred). If heading_level is not present, the level field can be used as a fallback.
kind: 'block'
element_type: 'heading'
content: 'Introduction'
level: 1
encoding: 'text'
attributes: {'id': 'introduction', 'heading_level': '1'}
paragraph
Block of text. Inline formatting (bold, italic, links) is preserved in content.
kind: 'block'
element_type: 'paragraph'
content: 'This is a paragraph with **bold** and *italic* text.'
level: NULL
encoding: 'text'
attributes: {}
code
Preformatted code or monospace text.
kind: 'block'
element_type: 'code'
content: 'def hello():\n print("Hello")'
level: NULL
encoding: 'text'
attributes: {'language': 'python'}
list
Ordered or unordered list. Content is JSON array.
kind: 'block'
element_type: 'list'
content: '["Item 1", "Item 2", "Item 3"]'
level: 1
encoding: 'json'
attributes: {'ordered': 'false'}
table
Tabular data as JSON with headers and rows.
kind: 'block'
element_type: 'table'
content: '{"headers": ["Name", "Age"], "rows": [["Alice", "30"], ["Bob", "25"]]}'
level: NULL
encoding: 'json'
attributes: {}
metadata / frontmatter
Document metadata (frontmatter, head elements, etc.). Level 0 indicates document-level.
kind: 'block'
element_type: 'frontmatter'
content: 'title: My Document\nauthor: Jane Doe'
level: 0
encoding: 'yaml'
attributes: {}
Core Inline Types (kind = 'inline')
Inline elements for rich text composition:
| Type | Description | Output | Key Attributes |
|---|---|---|---|
text |
Plain text | as-is | - |
bold / strong |
Bold text | **text** |
- |
italic / em |
Italic text | *text* |
- |
code |
Inline code | `text` |
- |
link |
Hyperlink | [text](href) |
href, title |
image |
Inline image |  |
src, title |
strikethrough / del |
Strikethrough | ~~text~~ |
- |
superscript / sup |
Superscript | ^text^ |
- |
subscript / sub |
Subscript | ~text~ |
- |
linebreak / br |
Hard line break | <br> |
- |
math |
Math expression | $text$ |
display: inline/block |
Inline Element Example
-- Compose rich text from inline elements
SELECT duck_blocks_to_md([
{kind: 'inline', element_type: 'text', content: 'Check out ', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'inline', element_type: 'link', content: 'our docs', level: NULL, encoding: 'text', attributes: MAP{'href': 'https://example.com'}, element_order: 1},
{kind: 'inline', element_type: 'text', content: ' for ', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 2},
{kind: 'inline', element_type: 'bold', content: 'more info', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 3}
]);
-- Returns: 'Check out [our docs](https://example.com) for **more info**'
Format-Specific Extensions
Extensions MAY define additional element types using namespaced identifiers: format:type
Namespace Registry
| Namespace | Format | Extension |
|---|---|---|
md |
Markdown | duckdb_markdown |
html |
HTML | duckdb_webbed |
xml |
XML | duckdb_webbed |
Extension Examples
Markdown-specific:
- md:footnote - Footnote definition
- md:task_list - Task/checkbox list
- md:html_block - Raw HTML in markdown
HTML-specific:
- html:div - Generic div container
- html:script - Script element
Writer Requirements
Extensions providing COPY TO functionality MUST:
- Accept rows matching the core schema
- Handle all core element types (or error gracefully)
- Support custom column name mapping via options
- Handle block/inline transitions appropriately
Standard COPY TO Options
| Option | Type | Default | Description |
|---|---|---|---|
kind_column |
VARCHAR | 'kind' | Column containing kind |
element_type_column |
VARCHAR | 'element_type' | Column containing element type |
content_column |
VARCHAR | 'content' | Column containing content |
level_column |
VARCHAR | 'level' | Column containing level |
encoding_column |
VARCHAR | 'encoding' | Column containing encoding |
attributes_column |
VARCHAR | 'attributes' | Column containing attributes |
SQL Usage Examples
Reading and Writing
-- Read markdown into duck_block rows
SELECT * FROM read_markdown_blocks('doc.md');
-- Round-trip: read, transform, write
COPY (
SELECT kind, element_type, upper(content) as content, level, encoding, attributes
FROM read_markdown_blocks('input.md')
WHERE element_type != 'hr'
ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');
Converting Blocks to Markdown
-- Convert blocks back to markdown string
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('doc.md') b;
-- Build document programmatically
SELECT duck_blocks_to_md([
{kind: 'block', element_type: 'heading', content: 'Title', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'block', element_type: 'paragraph', content: 'Body text.', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 1}
]);
Converting Blocks to Sections
-- Get hierarchical sections from blocks
SELECT s.section_id, s.level, s.title
FROM (
SELECT unnest(duck_blocks_to_sections(list(b ORDER BY element_order))) as s
FROM read_markdown_blocks('doc.md') b
);
Version History
| Version | Date | Changes |
|---|---|---|
| 2.0 | 2025-01 | Added kind column, renamed to element_type/element_order, added inline support |
| 1.0 | 2024-12 | Initial specification with block_type/block_order |
Reference Implementation
The DuckDB Markdown extension (duckdb_markdown) serves as the reference implementation:
- Repository: https://github.com/teaguesterling/duckdb_markdown
- Documentation: https://duckdb-markdown.readthedocs.io/
- Reader:
read_markdown_blocks() - Writer:
COPY TO ... (FORMAT MARKDOWN, markdown_mode 'blocks') - Converters:
duck_block_to_md(),duck_blocks_to_md(),duck_blocks_to_sections()