Markdown Document Block Implementation
This document describes the Markdown-specific implementation of the Document Block Specification.
Extension: duckdb_markdown
Namespace: md
Spec Version: 2.0
Overview
The markdown extension implements the duck_block spec for CommonMark and GitHub Flavored Markdown documents. It provides:
read_markdown_blocks()- Parse markdown into block rows (duck_block shape)COPY TOwithmarkdown_mode 'blocks'or'duck_block'- Render blocks to markdown filesduck_block_to_md()- Convert single duck_block to markdown stringduck_blocks_to_md()- Convert list of duck_blocks to markdown stringduck_blocks_to_sections()- Convert blocks to hierarchical sections
Duck Block Type
The extension uses the duck_block struct shape from the duck_block_utils specification:
STRUCT(
kind VARCHAR, -- 'block' or 'inline'
element_type VARCHAR, -- 'heading', 'paragraph', 'bold', 'link', etc.
content VARCHAR, -- Text content
level INTEGER, -- Document nesting depth (1 for top-level)
encoding VARCHAR, -- 'text', 'json', 'yaml', 'html', 'xml'
attributes MAP(VARCHAR, VARCHAR),-- Key-value metadata (e.g., heading_level, language)
element_order INTEGER -- Position in sequence
)
Note: The duck_block type is defined by the duck_block_utils extension. This extension uses the shape but does not register the type name.
Supported Element Types
Block Types (kind = 'block')
| element_type | Markdown Element | Notes |
|---|---|---|
heading |
#, ##, etc. (ATX style) |
Level in attributes['heading_level'] (1-6), falls back to level field |
paragraph |
Text blocks | Inline formatting preserved |
code |
Fenced code blocks | Language in attributes['language'] |
blockquote |
> quoted blocks |
Level = nesting depth |
list |
-, *, 1. lists |
JSON array of items |
table |
GFM tables | JSON {headers, rows} |
hr |
---, ***, ___ |
Normalized to --- on output |
metadata |
YAML frontmatter | Level 0, encoding 'yaml' |
frontmatter |
YAML frontmatter | Alias for metadata |
image |
 |
Details in attributes |
raw |
Raw HTML | Preserved HTML in markdown |
html |
Raw HTML | Alias for raw |
md:html_block |
Raw HTML | Markdown-specific namespace |
Inline Types (kind = 'inline')
| element_type | Markdown Output | Attributes |
|---|---|---|
link |
[text](href "title") |
href, title |
image |
 |
src, title |
bold / strong |
**text** |
- |
italic / em |
*text* |
- |
code |
`text` |
- |
text |
plain text | - |
space |
word separator | - |
softbreak |
soft line break | - |
linebreak / br |
hard break | - |
strikethrough / del |
~~text~~ |
- |
superscript / sup |
^text^ |
- |
subscript / sub |
~text~ |
- |
underline |
<u>text</u> |
HTML fallback |
smallcaps |
<span style="..."> |
HTML fallback |
math |
$text$ or $$text$$ |
display: inline/block |
quoted |
"text" or 'text' |
quote_type: single/double |
cite |
[@key] |
key |
note |
[^note] |
- |
Reader: read_markdown_blocks()
read_markdown_blocks(
path VARCHAR, -- File path or glob pattern
include_filepath := false, -- Add file_path column (alias: filename)
include_raw_html := false -- Include raw HTML blocks
)
Output Schema
Returns rows with duck_block shape:
| Column | Type | Description |
|---|---|---|
kind |
VARCHAR | Always 'block' for this reader |
element_type |
VARCHAR | Block type identifier |
content |
VARCHAR | Block content |
level |
INTEGER | Document nesting depth (1 for top-level, 0 for frontmatter) |
encoding |
VARCHAR | 'text', 'json', or 'yaml' |
attributes |
MAP(VARCHAR, VARCHAR) | Block metadata (heading_level, language, id, etc.) |
element_order |
INTEGER | Position in document (1-indexed) |
file_path |
VARCHAR | Source file (when enabled) |
Note on heading levels: For headings, the actual H1-H6 level is stored in attributes['heading_level'], while level indicates document nesting depth (always 1 for top-level blocks). This matches the duck_block_utils convention.
Examples
-- Basic usage
SELECT * FROM read_markdown_blocks('README.md');
-- Filter to headings with heading level
SELECT content, attributes['heading_level'] as heading_level
FROM read_markdown_blocks('doc.md')
WHERE element_type = 'heading'
ORDER BY element_order;
-- Extract code blocks by language
SELECT content, attributes['language'] as lang
FROM read_markdown_blocks('tutorial.md')
WHERE element_type = 'code';
-- Multi-file analysis
SELECT file_path, element_type, count(*) as count
FROM read_markdown_blocks('docs/**/*.md', include_filepath := true)
GROUP BY file_path, element_type;
Writer: COPY TO Blocks
COPY query TO 'output.md' (
FORMAT MARKDOWN,
markdown_mode 'blocks', -- or 'duck_block'
-- Column mapping (defaults shown)
kind_column 'kind', -- 'block' or 'inline' (default: 'kind')
element_type_column 'element_type',
content_column 'content',
level_column 'level',
encoding_column 'encoding',
attributes_column 'attributes'
);
The kind column determines how elements are rendered:
- 'block' - Element is rendered with trailing newlines (paragraphs, headings, code blocks, etc.)
- 'inline' - Element is rendered without trailing newlines (bold, italic, links, etc.)
When transitioning from inline to block elements, a paragraph break (\n\n) is automatically inserted.
Rendering Rules
| element_type | Markdown Output |
|---|---|
heading |
# × heading_level + space + content (uses attributes['heading_level'], falls back to level) |
paragraph |
content + blank line |
code |
``` + language + newline + content + ``` |
blockquote |
> prefix per line |
list |
- or N. per item |
list_item |
- for unordered, N. for ordered (uses attributes['ordered'] and attributes['item_number']) |
table |
\| cell \| format with separator |
hr |
--- |
metadata |
--- + content + --- |
image |
 |
Examples
-- Round-trip a document
COPY (
SELECT kind, element_type, content, level, encoding, attributes
FROM read_markdown_blocks('input.md')
ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');
-- Transform headings
COPY (
SELECT
kind,
element_type,
CASE WHEN element_type = 'heading'
THEN upper(content)
ELSE content END as content,
level, encoding, attributes
FROM read_markdown_blocks('doc.md')
ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');
-- Generate from scratch
COPY (
SELECT * FROM (VALUES
('block', 'metadata', 'title: Generated Doc', 0, 'yaml', MAP{}::MAP(VARCHAR, VARCHAR), 0),
('block', 'heading', 'Introduction', 1, 'text', MAP{'id': 'intro'}::MAP(VARCHAR, VARCHAR), 1),
('block', 'paragraph', 'Welcome to this guide.', NULL, 'text', MAP{}::MAP(VARCHAR, VARCHAR), 2),
('block', 'code', 'print("hello")', NULL, 'text', MAP{'language': 'python'}::MAP(VARCHAR, VARCHAR), 3)
) AS t(kind, element_type, content, level, encoding, attributes, element_order)
) TO 'generated.md' (FORMAT MARKDOWN, markdown_mode 'blocks');
Duck Block Conversion Functions
Convert blocks back to Markdown without writing to files. These functions enable in-memory document transformation pipelines.
duck_block_to_md(block)
Converts a single duck_block struct to a Markdown string.
SELECT duck_block_to_md({
kind: 'block',
element_type: 'heading',
content: 'Hello World',
level: 1,
encoding: 'text',
attributes: MAP{},
element_order: 0
});
-- Returns: '# Hello World\n\n'
duck_blocks_to_md(blocks[])
Converts a list of blocks to a complete Markdown document string.
-- Using the function
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('source.md') b;
-- Transform and render in one query
SELECT duck_blocks_to_md(
list(b ORDER BY element_order)
)
FROM read_markdown_blocks('source.md') b;
-- Build document programmatically
SELECT duck_blocks_to_md([
{kind: 'block', element_type: 'heading', content: 'Title', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'block', element_type: 'paragraph', content: 'Body.', level: NULL, encoding: 'text', attributes: MAP{}, element_order: 1}
]);
duck_blocks_to_sections(blocks[])
Converts blocks to a list of section structs with hierarchy information.
Output Schema:
| Column | Type | Description |
|--------|------|-------------|
| section_id | VARCHAR | Section identifier (from id attribute or generated) |
| section_path | VARCHAR | Hierarchical path ("Parent > Child > Grandchild") |
| level | INTEGER | Heading level (0 for frontmatter) |
| title | VARCHAR | Section heading text |
| content | MARKDOWN | Rendered content until next heading |
-- Extract sections from blocks
SELECT s.section_id, s.level, s.title, length(s.content) as len
FROM (
SELECT unnest(duck_blocks_to_sections(
list(b ORDER BY element_order)
)) as s
FROM read_markdown_blocks('doc.md') b
);
-- Pipeline: read blocks -> filter -> convert to sections
SELECT s.*
FROM (
SELECT unnest(duck_blocks_to_sections(
list(b ORDER BY element_order)
)) as s
FROM read_markdown_blocks('tutorial.md') b
WHERE b.element_type IN ('heading', 'paragraph', 'code')
);
Inline Elements in Blocks
The duck_block_to_md and duck_blocks_to_md functions support both block and inline elements. Use kind: 'inline' for inline formatting:
-- Compose rich inline content
SELECT duck_blocks_to_md([
{kind: 'inline', element_type: 'text', content: 'Check out ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'inline', element_type: 'link', content: 'our docs', level: 1, encoding: 'text', attributes: MAP{'href': 'https://docs.example.com'}, element_order: 1},
{kind: 'inline', element_type: 'text', content: ' for ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 2},
{kind: 'inline', element_type: 'bold', content: 'more info', level: 1, encoding: 'text', attributes: MAP{}, element_order: 3}
]);
-- Returns: 'Check out [our docs](https://docs.example.com) for **more info**'
Using Inline Elements Within Block Content
Combine inline rendering with block functions for rich document generation:
-- Paragraph with formatted content
SELECT duck_block_to_md({
kind: 'block',
element_type: 'paragraph',
content: duck_blocks_to_md([
{kind: 'inline', element_type: 'text', content: 'Visit ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'inline', element_type: 'link', content: 'GitHub', level: 1, encoding: 'text', attributes: MAP{'href': 'https://github.com'}, element_order: 1},
{kind: 'inline', element_type: 'text', content: ' for projects.', level: 1, encoding: 'text', attributes: MAP{}, element_order: 2}
])::VARCHAR,
level: NULL,
encoding: 'text',
attributes: MAP{},
element_order: 0
});
-- Returns: 'Visit [GitHub](https://github.com) for projects.\n\n'
Round-Trip Fidelity
Preserved
- Block order and structure
- Heading levels (1-6)
- Code block languages
- List ordering (ordered vs unordered)
- Table structure (headers and rows)
- Frontmatter content
- Inline formatting within blocks
Normalized
- Whitespace between blocks → single blank line
- Heading style → ATX (
#) only - Horizontal rule style →
--- - Code fence style → triple backticks
- List markers →
-for unordered,1.for ordered
Comparison: Sections vs Blocks
The extension provides two document representations:
| Aspect | read_markdown_sections | read_markdown_blocks |
|---|---|---|
| Granularity | Hierarchical sections | Flat block sequence |
| Use case | Document navigation | Document transformation |
| Content | Section content (configurable) | Individual blocks |
| Hierarchy | Parent/child relationships | Sequential order only |
| Round-trip | Lossy (section-level) | High fidelity |
Choose sections for: TOC generation, section extraction, hierarchical queries Choose blocks for: Document transformation, format conversion, block-level analysis
Version History
| Version | Date | Changes |
|---|---|---|
| 2.1 | 2025-01 | Fixed level/heading_level: level is now document depth (1 for top-level), heading H1-H6 stored in attributes['heading_level']. Added list_item element type support. Fixed inline-to-block transitions. Added filename parameter alias. |
| 2.0 | 2025-01 | Unified on duck_block shape, removed markdown_doc_block type |
| 1.3 | 2025-01 | Added unified doc_element type with conversion functions |
| 1.2 | 2024-12 | Added duck_block conversion functions, duck_block COPY mode alias |
| 1.1 | 2024-12 | Aligned with doc_block_spec v1.0, added metadata type |
| 1.0 | 2024 | Initial implementation |