Skip to content

Data Formatting and Preparation

Data formatting is a crucial step in preparing content for Large Language Models (LLMs). Proper formatting ensures that the input data is clean, structured, and optimized for model processing, leading to better results and more accurate responses.

Why Proper Formatting Matters

Importance of Data Formatting

  • Improves model comprehension and response quality
  • Reduces noise and irrelevant information
  • Maintains semantic structure and relationships
  • Ensures consistent input format for LLMs
  • Preserves important metadata while removing unnecessary formatting

Available Tools

MarkItDown

Microsoft MarkItDown @microsoft/markitdown

A versatile Python-based conversion tool that supports:

  • PDF documents
  • Microsoft Office files (Word, PowerPoint, Excel)
  • Images (with EXIF and OCR capabilities)
  • Audio files (metadata and transcription)
  • HTML documents
  • Text-based formats (CSV, JSON, XML)
  • ZIP archives

Perfect for batch processing and creating standardized markdown content for LLM consumption.

DOM-to-Semantic-Markdown

DOM-to-Semantic-Markdown @romansky/dom-to-semantic-markdown

Specialized tool for converting HTML/DOM content to semantic markdown:

  • Preserves document structure and hierarchy
  • Extracts metadata and semantic relationships
  • Optimized output for LLM processing
  • Supports various metadata extraction modes
  • Ideal for web content processing

Best Practices

Formatting Guidelines

  1. Remove unnecessary styling and formatting
  2. Preserve semantic structure and relationships
  3. Maintain clear document hierarchy
  4. Include relevant metadata
  5. Use consistent markdown formatting
  6. Validate output quality before LLM processing