Markdown Document Block Implementation

This document describes the Markdown-specific implementation of the Document Block Specification.

Extension: duckdb_markdown Namespace: md Spec Version: 2.0

Overview

The markdown extension implements the duck_block spec for CommonMark and GitHub Flavored Markdown documents. It provides:

read_markdown_blocks() - Parse markdown into block rows (duck_block shape)
COPY TO with markdown_mode 'blocks' or 'duck_block' - Render blocks to markdown files
duck_block_to_md() - Convert single duck_block to markdown string
duck_blocks_to_md() - Convert list of duck_blocks to markdown string
duck_blocks_to_sections() - Convert blocks to hierarchical sections

Duck Block Type

The extension uses the duck_block struct shape from the duck_block_utils specification:

STRUCT(
    kind          VARCHAR,              -- 'block' or 'inline'
    element_type  VARCHAR,              -- 'heading', 'paragraph', 'bold', 'link', etc.
    content       VARCHAR,              -- Text content
    level         INTEGER,              -- Document nesting depth (1 for top-level)
    encoding      VARCHAR,              -- 'text', 'json', 'yaml', 'html', 'xml'
    attributes    MAP(VARCHAR, VARCHAR),-- Key-value metadata (e.g., heading_level, language)
    element_order INTEGER               -- Position in sequence
)

Note: The duck_block type is defined by the duck_block_utils extension. This extension uses the shape but does not register the type name.

Supported Element Types

Block Types (kind = 'block')

element_type	Markdown Element	Notes
`heading`	`#`, `##`, etc. (ATX style)	Level in `attributes['heading_level']` (1-6), falls back to `level` field
`paragraph`	Text blocks	Inline formatting preserved
`code`	Fenced code blocks	Language in `attributes['language']`
`blockquote`	`>` quoted blocks	Level = nesting depth
`list`	`-`, `*`, `1.` lists	JSON array of items
`table`	GFM tables	JSON `{headers, rows}`
`hr`	`---`, `***`, `___`	Normalized to `---` on output
`metadata`	YAML frontmatter	Level 0, encoding 'yaml'
`frontmatter`	YAML frontmatter	Alias for `metadata`
`image`	`![alt](src "title")`	Details in attributes
`raw`	Raw HTML	Preserved HTML in markdown
`html`	Raw HTML	Alias for raw
`md:html_block`	Raw HTML	Markdown-specific namespace

Inline Types (kind = 'inline')

element_type	Markdown Output	Attributes
`link`	`[text](href "title")`	`href`, `title`
`image`	`![alt](src "title")`	`src`, `title`
`bold` / `strong`	`text`	-
`italic` / `em`	`text`	-
`code`	`text`	-
`text`	plain text	-
`space`	word separator	-
`softbreak`	soft line break	-
`linebreak` / `br`	hard break	-
`strikethrough` / `del`	`~~text~~`	-
`superscript` / `sup`	`^text^`	-
`subscript` / `sub`	`~text~`	-
`underline`	`<u>text</u>`	HTML fallback
`smallcaps`	`<span style="...">`	HTML fallback
`math`	$text$ or `$$text$$`	`display`: inline/block
`quoted`	`"text"` or `'text'`	`quote_type`: single/double
`cite`	`[@key]`	`key`
`note`	`[^note]`	-

Reader: read_markdown_blocks()

read_markdown_blocks(
    path VARCHAR,                    -- File path or glob pattern
    include_filepath := false,       -- Add file_path column (alias: filename)
    include_raw_html := false        -- Include raw HTML blocks
)

Output Schema

Returns rows with duck_block shape:

Column	Type	Description
`kind`	VARCHAR	Always 'block' for this reader
`element_type`	VARCHAR	Block type identifier
`content`	VARCHAR	Block content
`level`	INTEGER	Document nesting depth (1 for top-level, 0 for frontmatter)
`encoding`	VARCHAR	'text', 'json', or 'yaml'
`attributes`	MAP(VARCHAR, VARCHAR)	Block metadata (heading_level, language, id, etc.)
`element_order`	INTEGER	Position in document (1-indexed)
`file_path`	VARCHAR	Source file (when enabled)

Note on heading levels: For headings, the actual H1-H6 level is stored in attributes['heading_level'], while level indicates document nesting depth (always 1 for top-level blocks). This matches the duck_block_utils convention.

Examples

-- Basic usage
SELECT * FROM read_markdown_blocks('README.md');

-- Filter to headings with heading level
SELECT content, attributes['heading_level'] as heading_level
FROM read_markdown_blocks('doc.md')
WHERE element_type = 'heading'
ORDER BY element_order;

-- Extract code blocks by language
SELECT content, attributes['language'] as lang
FROM read_markdown_blocks('tutorial.md')
WHERE element_type = 'code';

-- Multi-file analysis
SELECT file_path, element_type, count(*) as count
FROM read_markdown_blocks('docs/**/*.md', include_filepath := true)
GROUP BY file_path, element_type;

Writer: COPY TO Blocks

COPY query TO 'output.md' (
    FORMAT MARKDOWN,
    markdown_mode 'blocks',  -- or 'duck_block'
    -- Column mapping (defaults shown)
    kind_column 'kind',                -- 'block' or 'inline' (default: 'kind')
    element_type_column 'element_type',
    content_column 'content',
    level_column 'level',
    encoding_column 'encoding',
    attributes_column 'attributes'
);

The kind column determines how elements are rendered: - 'block' - Element is rendered with trailing newlines (paragraphs, headings, code blocks, etc.) - 'inline' - Element is rendered without trailing newlines (bold, italic, links, etc.)

When transitioning from inline to block elements, a paragraph break (\n\n) is automatically inserted.

Rendering Rules

element_type	Markdown Output
`heading`	`#` × heading_level + space + content (uses `attributes['heading_level']`, falls back to `level`)
`paragraph`	content + blank line
`code`	``` + language + newline + content + ```
`blockquote`	`>` prefix per line
`list`	`-` or `N.` per item
`list_item`	`-` for unordered, `N.` for ordered (uses `attributes['ordered']` and `attributes['item_number']`)
`table`	`\\| cell \\|` format with separator
`hr`	`---`
`metadata`	`---` + content + `---`
`image`	`![alt](src "title")`

Examples

-- Round-trip a document
COPY (
    SELECT kind, element_type, content, level, encoding, attributes
    FROM read_markdown_blocks('input.md')
    ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

-- Transform headings
COPY (
    SELECT
        kind,
        element_type,
        CASE WHEN element_type = 'heading'
             THEN upper(content)
             ELSE content END as content,
        level, encoding, attributes
    FROM read_markdown_blocks('doc.md')
    ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

-- Generate from scratch
COPY (
    SELECT * FROM (VALUES
        ('block', 'metadata', 'title: Generated Doc', 0, 'yaml', MAP{}::MAP(VARCHAR, VARCHAR), 0),
        ('block', 'heading', 'Introduction', 1, 'text', MAP{'id': 'intro'}::MAP(VARCHAR, VARCHAR), 1),
        ('block', 'paragraph', 'Welcome to this guide.', NULL, 'text', MAP{}::MAP(VARCHAR, VARCHAR), 2),
        ('block', 'code', 'print("hello")', NULL, 'text', MAP{'language': 'python'}::MAP(VARCHAR, VARCHAR), 3)
    ) AS t(kind, element_type, content, level, encoding, attributes, element_order)
) TO 'generated.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

Duck Block Conversion Functions

Convert blocks back to Markdown without writing to files. These functions enable in-memory document transformation pipelines.

duck_block_to_md(block)

Converts a single duck_block struct to a Markdown string.

SELECT duck_block_to_md({
    kind: 'block',
    element_type: 'heading',
    content: 'Hello World',
    level: 1,
    encoding: 'text',
    attributes: MAP{},
    element_order: 0
});
-- Returns: '# Hello World\n\n'

duck_blocks_to_md(blocks[])

Converts a list of blocks to a complete Markdown document string.

-- Using the function
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('source.md') b;

-- Transform and render in one query
SELECT duck_blocks_to_md(
    list(b ORDER BY element_order)
)
FROM read_markdown_blocks('source.md') b;

-- Build document programmatically
SELECT duck_blocks_to_md([
    {kind: 'block', element_type: 'heading', content: 'Title', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
    {kind: 'block', element_type: 'paragraph', content: 'Body.', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 1}
]);

duck_blocks_to_sections(blocks[])

Converts blocks to a list of section structs with hierarchy information.

Output Schema: | Column | Type | Description | |--------|------|-------------| | section_id | VARCHAR | Section identifier (from id attribute or generated) | | section_path | VARCHAR | Hierarchical path ("Parent > Child > Grandchild") | | level | INTEGER | Heading level (0 for frontmatter) | | title | VARCHAR | Section heading text | | content | MARKDOWN | Rendered content until next heading |

-- Extract sections from blocks
SELECT s.section_id, s.level, s.title, length(s.content) as len
FROM (
    SELECT unnest(duck_blocks_to_sections(
        list(b ORDER BY element_order)
    )) as s
    FROM read_markdown_blocks('doc.md') b
);

-- Pipeline: read blocks -> filter -> convert to sections
SELECT s.*
FROM (
    SELECT unnest(duck_blocks_to_sections(
        list(b ORDER BY element_order)
    )) as s
    FROM read_markdown_blocks('tutorial.md') b
    WHERE b.element_type IN ('heading', 'paragraph', 'code')
);

Inline Elements in Blocks

The duck_block_to_md and duck_blocks_to_md functions support both block and inline elements. Use kind: 'inline' for inline formatting:

-- Compose rich inline content
SELECT duck_blocks_to_md([
    {kind: 'inline', element_type: 'text', content: 'Check out ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
    {kind: 'inline', element_type: 'link', content: 'our docs', level: 1, encoding: 'text', attributes: MAP{'href': 'https://docs.example.com'}, element_order: 1},
    {kind: 'inline', element_type: 'text', content: ' for ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 2},
    {kind: 'inline', element_type: 'bold', content: 'more info', level: 1, encoding: 'text', attributes: MAP{}, element_order: 3}
]);
-- Returns: 'Check out [our docs](https://docs.example.com) for **more info**'

Using Inline Elements Within Block Content

Combine inline rendering with block functions for rich document generation:

-- Paragraph with formatted content
SELECT duck_block_to_md({
    kind: 'block',
    element_type: 'paragraph',
    content: duck_blocks_to_md([
        {kind: 'inline', element_type: 'text', content: 'Visit ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
        {kind: 'inline', element_type: 'link', content: 'GitHub', level: 1, encoding: 'text', attributes: MAP{'href': 'https://github.com'}, element_order: 1},
        {kind: 'inline', element_type: 'text', content: ' for projects.', level: 1, encoding: 'text', attributes: MAP{}, element_order: 2}
    ])::VARCHAR,
    level: NULL,
    encoding: 'text',
    attributes: MAP{},
    element_order: 0
});
-- Returns: 'Visit [GitHub](https://github.com) for projects.\n\n'

Round-Trip Fidelity

Preserved

Block order and structure
Heading levels (1-6)
Code block languages
List ordering (ordered vs unordered)
Table structure (headers and rows)
Frontmatter content
Inline formatting within blocks

Normalized

Whitespace between blocks → single blank line
Heading style → ATX (#) only
Horizontal rule style → ---
Code fence style → triple backticks
List markers → - for unordered, 1. for ordered

Comparison: Sections vs Blocks

The extension provides two document representations:

Aspect	read_markdown_sections	read_markdown_blocks
Granularity	Hierarchical sections	Flat block sequence
Use case	Document navigation	Document transformation
Content	Section content (configurable)	Individual blocks
Hierarchy	Parent/child relationships	Sequential order only
Round-trip	Lossy (section-level)	High fidelity

Choose sections for: TOC generation, section extraction, hierarchical queries Choose blocks for: Document transformation, format conversion, block-level analysis

Version History

Version	Date	Changes
2.1	2025-01	Fixed level/heading_level: `level` is now document depth (1 for top-level), heading H1-H6 stored in `attributes['heading_level']`. Added `list_item` element type support. Fixed inline-to-block transitions. Added `filename` parameter alias.
2.0	2025-01	Unified on `duck_block` shape, removed `markdown_doc_block` type
1.3	2025-01	Added unified `doc_element` type with conversion functions
1.2	2024-12	Added duck_block conversion functions, `duck_block` COPY mode alias
1.1	2024-12	Aligned with doc_block_spec v1.0, added metadata type
1.0	2024	Initial implementation