Ecosystem Integration

The DuckDB Markdown extension is part of a planned document processing ecosystem. This page describes related extensions and cross-format workflows.

Extension Status

Extension Purpose Status
duckdb_markdown Markdown processing Released
duckdb_webbed HTML/XML processing Planned
duckdb_duck_block_utils Block manipulation utilities Planned

Current Capabilities (duckdb_markdown)

The markdown extension provides complete block-level document processing:

LOAD markdown;

-- Read markdown into duck_block rows
SELECT * FROM read_markdown_blocks('README.md');

-- Convert blocks back to markdown
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('doc.md') b;

-- Convert blocks to hierarchical sections
SELECT unnest(duck_blocks_to_sections(list(b ORDER BY element_order)))
FROM read_markdown_blocks('doc.md') b;

Block-Level Transformations

-- Filter and transform blocks
SELECT duck_blocks_to_md(list(b ORDER BY element_order))
FROM read_markdown_blocks('doc.md') b
WHERE element_type IN ('heading', 'paragraph', 'code');

-- Extract code blocks with language
SELECT content, attributes['language'] as lang
FROM read_markdown_blocks('tutorial.md')
WHERE element_type = 'code';

-- Round-trip with modifications
COPY (
    SELECT kind, element_type,
           CASE WHEN element_type = 'heading' THEN upper(content) ELSE content END as content,
           level, encoding, attributes
    FROM read_markdown_blocks('input.md')
    ORDER BY element_order
) TO 'output.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

Planned: Cross-Format Conversion

When webbed and duck_block_utils are available, cross-format workflows will be possible:

Markdown to HTML (Planned)

-- Future: Convert markdown blocks to HTML
LOAD markdown;
LOAD webbed;

SELECT duck_blocks_to_html(
    list(b ORDER BY element_order)
)
FROM read_markdown_blocks('README.md') b;

HTML to Markdown (Planned)

-- Future: Convert HTML to markdown via duck_block
LOAD markdown;
LOAD webbed;

SELECT duck_blocks_to_md(
    html_to_duck_blocks('<h1>Title</h1><p>Content</p>')
);

Planned: Block Utilities

The duck_block_utils extension will provide format-agnostic block manipulation:

Planned Functions

Function Description
duck_blocks_filter(blocks, types[]) Keep only specified element types
duck_blocks_exclude(blocks, types[]) Remove specified element types
duck_blocks_to_text(blocks) Extract plain text content
duck_blocks_toc(blocks) Generate table of contents
duck_blocks_validate(blocks) Check schema compliance
duck_blocks_stats(blocks) Block type statistics

Example Usage (Planned)

LOAD markdown;
LOAD duck_block_utils;

-- Generate table of contents
SELECT * FROM duck_blocks_toc(
    (SELECT list(b ORDER BY element_order) FROM read_markdown_blocks('README.md') b)
);

-- Get block type distribution
SELECT * FROM duck_blocks_stats(
    (SELECT list(b ORDER BY element_order) FROM read_markdown_blocks('docs/**/*.md') b)
);

The duck_block Specification

All ecosystem extensions share the common duck_block structure:

STRUCT(
    kind          VARCHAR,              -- 'block' or 'inline'
    element_type  VARCHAR,              -- 'heading', 'paragraph', 'bold', etc.
    content       VARCHAR,              -- Text content
    level         INTEGER,              -- Heading level or nesting depth
    encoding      VARCHAR,              -- 'text', 'json', 'yaml'
    attributes    MAP(VARCHAR, VARCHAR),-- Element metadata
    element_order INTEGER               -- Position in sequence
)

This shared structure enables:

  • Format conversion: Read one format, write another
  • Cross-format queries: Analyze structure across document types
  • Unified tooling: Common utilities work with any format
  • SQL-based transformation: Filter, aggregate, and manipulate documents

See Duck Block Specification for complete details.